AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
TLDR
AgentTrust provides a runtime safety layer for AI agents, intercepting tool calls to prevent unsafe actions like data exfiltration and accidental deletion.
Key contributions
- Intercepts AI agent tool calls at runtime for safety verdicts (allow, warn, block, review).
- Features shell deobfuscation, SafeFix suggestions, RiskChain for attack detection, and LLM-as-Judge.
- Achieves 95-96% verdict accuracy on 930+ diverse safety and adversarial scenarios.
- Open-source (AGPL-3.0) and provides Model Context Protocol server for agent compatibility.
Why it matters
AI agents' ability to interact with the real world through tools poses significant security risks. AgentTrust offers a crucial runtime defense, preventing harmful actions before they occur. This work enhances the safety and trustworthiness of AI agents, enabling their broader and more secure deployment in sensitive environments.
Original Abstract
Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existing defenses are incomplete: post-hoc benchmarks measure behavior after execution, static guardrails miss obfuscation and multi-step context, and infrastructure sandboxes constrain where code runs without understanding what an action means. We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review. AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge for ambiguous inputs. We release a 300-scenario benchmark across six risk categories and an additional 630 independently constructed real-world adversarial scenarios. On the internal benchmark, the production-only ruleset achieves 95.0% verdict accuracy and 73.7% risk-level accuracy at low-millisecond end-to-end latency. On the 630-scenario benchmark, evaluated under a patched ruleset and not claimed as zero-shot, AgentTrust achieves 96.7% verdict accuracy, including about 93% on shell-obfuscated payloads. AgentTrust is released under the AGPL-3.0 license and provides a Model Context Protocol server for MCP-compatible agents.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.