Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems

May 12, 20262605.11891

cs.CRcs.AI

TLDR

Proteus is a self-evolving red-team framework that uncovers adaptive leakage in LLM agent skills, showing current vetting underestimates risk.

Key contributions

Defines 'adaptive leakage' to describe iterative attacks on LLM agent skills that evade audits.
Introduces Proteus, a self-evolving red-team framework to measure and exploit adaptive leakage.
Proteus searches a five-axis attack space, using audit feedback for iterative skill mutation.
Achieves 40-90% Attack Success Rate, bypassing leading auditors like SkillVetter (>93%).

Why it matters

This paper highlights a critical security vulnerability in LLM agent skill ecosystems. It demonstrates that current static auditing methods are insufficient against adaptive attackers who can iteratively refine malicious skills. The findings urge developers and marketplaces to adopt more dynamic and feedback-driven security measures.

Original Abstract

Agent skills extend LLM agents with reusable instructions, tool interfaces, and executable code, and users increasingly install third-party skills from marketplaces, repositories, and community channels. Because a skill exposes both executable behavior and context-setting documentation, its deployment risk cannot be measured by single-shot audits or prompt-level red teams alone: a realistic attacker can use audit and runtime feedback to repeatedly rewrite the skill. We frame this risk as \emph{adaptive leakage} -- whether a budgeted attacker can iteratively revise a skill until it passes audit and produces verified runtime harm -- and present \ours{}, a grey-box self-evolving red-team framework for measuring it. Proteus searches a formalized five-axis skill-attack space. Each candidate is evaluated through a unified audit-sandbox-oracle pipeline that returns structured audit findings and runtime evidence to guide cross-round mutation. Beyond initial evasion, Proteus performs path expansion, which finds alternative implementations of successful attacks, and surface expansion, which transfers learned implementation patterns to new attack objectives beyond the original seed catalogue. Across eight phase-1 cells, Proteus reaches 40--90\% Attack Success Rate at $5$ rounds (ASR@5) with positive learning-curve slopes on both evaluated auditors. Phase-2 path/surface expansion produces 438 jointly bypassing and lethal variants, with SkillVetter bypassed at $\geq 93\%$ in every cell and AI-Infra-Guard, the strongest public auditor we evaluate, still admitting up to 41.3\% joint-success. These results show that current skill vetting substantially underestimates residual risk when evaluated against adaptive, feedback-driven attackers.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers