RedVLA: Physical Red Teaming for Vision-Language-Action Models

April 24, 20262604.22591

Yuhao Zhang, Borong Zhang, Jiaming Fan, Jiachen Shen, Yishuai Cai + 2 more

cs.RO

TLDR

This paper introduces RedVLA, the first physical red teaming framework for VLA models, to proactively detect and mitigate real-world safety risks.

Key contributions

Proposes RedVLA, the first framework for physical red teaming of Vision-Language-Action (VLA) models.
Employs a two-stage process: Risk Scenario Synthesis and Risk Amplification to find unsafe behaviors.
Uncovers diverse unsafe behaviors with up to 95.5% success rate on six representative VLA models.
Introduces SimpleVLA-Guard, a lightweight safety guard built from RedVLA-generated data.

Why it matters

Real-world VLA model deployment is hindered by unpredictable physical safety risks, which currently lack proactive detection. RedVLA fills this critical gap by systematically uncovering unsafe behaviors, enabling developers to mitigate risks before deployment and enhance model safety. This is crucial for building trust and enabling safer adoption of VLA technologies.

Original Abstract

The real-world deployment of Vision-Language-Action (VLA) models remains limited by the risk of unpredictable and irreversible physical harm. However, we currently lack effective mechanisms to proactively detect these physical safety risks before deployment. To address this gap, we propose \textbf{RedVLA}, the first red teaming framework for physical safety in VLA models. We systematically uncover unsafe behaviors through a two-stage process: (I) \textbf{Risk Scenario Synthesis} constructs a valid and task-feasible initial risk scene. Specifically, it identifies critical interaction regions from benign trajectories and positions the risk factor within these regions, aiming to entangle it with the VLA's execution flow and elicit a target unsafe behavior. (II) \textbf{Risk Amplification} ensures stable elicitation across heterogeneous models. It iteratively refines the risk factor state through gradient-free optimization guided by trajectory features. Experiments on six representative VLA models show that RedVLA uncovers diverse unsafe behaviors and achieves the ASR up to 95.5\% within 10 optimization iterations. To mitigate these risks, we further propose SimpleVLA-Guard, a lightweight safety guard built from RedVLA-generated data. Our data, assets, and code are available \href{https://redvla.github.io}{here}.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers