Behavioral Integrity Verification for AI Agent Skills
Yuhao Wu, Tung-Ling Li, Hongliang Liu
TLDR
This paper introduces Behavioral Integrity Verification (BIV) to audit AI agent skills, finding widespread deviations and improving malicious skill detection.
Key contributions
- Formalizes Behavioral Integrity Verification (BIV) for AI agent skills.
- BIV framework combines deterministic code analysis and LLM-assisted capability extraction.
- Discovers 80% of skills have a description-implementation gap, revealing four novel compound threats.
- Achieves 0.946 F1 for malicious skill detection, outperforming state-of-the-art baselines.
Why it matters
This work addresses a critical gap in AI agent safety by verifying skill artifacts directly. It reveals pervasive discrepancies between declared and actual capabilities, highlighting significant security risks. BIV provides a scalable solution for auditing agent skills, enhancing trust and security in LLM-powered systems.
Original Abstract
Agent skills extend LLM agents with privileged third-party capabilities such as filesystem access, credentials, network calls, and shell execution. Existing safety work catches malicious prompts and risky runtime actions, but the skill artifact itself goes unverified. We formalize this as the behavioral integrity verification (BIV) problem: a typed set comparison between declared and actual capabilities over a shared taxonomy that bridges code, instructions, and metadata. The BIV framework instantiates this comparison by pairing deterministic code analysis with LLM-assisted capability extraction. The resulting structured evidence supports three downstream analyses: deviation taxonomy, root-cause classification, and malicious-skill detection. On 49,943 skills from the OpenClaw registry, the deviation taxonomy reveals a pervasive description-implementation gap: 80.0% of skills deviate from declared behavior, with four novel compound-threat categories surfaced. Root-cause classification finds that deviations are mostly oversight, not malice: 81.1% trace to developer oversight and 18.9% to adversarial intent, with 5.0% of skills carrying predicted multi-stage attack chains. On a 906-skill malicious-skill detection benchmark, BIV reaches an F1 of 0.946, outperforming state-of-the-art rule-based and single-pass LLM baselines. These results demonstrate behavioral integrity auditing for agent skills at scale.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.