Shihao Weng

2 papers · Latest: May 7, 2026

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

LLM safety judges are unreliable; their verdicts depend on policy wording, not just agent behavior, leading to flawed safety evaluations.

ARGUS defends LLM agents against context-aware prompt injection by auditing decisions based on provenance, significantly reducing attack success.

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.