SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation

May 12, 20262605.12386

Chengyue Huang, Khang Vo Huynh, Sebastian Elbaum, Zsolt Kira, Lu Feng

cs.RO

TLDR

SafeManip is a new benchmark using LTLf to evaluate temporal safety in robotic manipulation, revealing strong models often behave unsafely.

Key contributions

Introduces SafeManip, a property-driven benchmark for temporal safety in robotic manipulation.
Uses LTLf and symbolic predicate traces to evaluate 8 safety categories (e.g., collision, contamination).
Provides reusable safety templates that generalize across diverse tasks and environments.
Reveals strong vision-language-action policies often fail temporal safety checks.

Why it matters

Current robotic evaluation often overlooks temporal safety, focusing only on task completion. SafeManip fills this gap by providing a systematic way to diagnose and measure safe success. This is vital for developing robust and trustworthy robotic systems that can operate reliably in real-world scenarios.

Original Abstract

Robotic manipulation is typically evaluated by task success, but successful completion does not guarantee safe execution. Many safety failures are temporal: a robot may touch a clean surface after contamination or release an object before it is fully inside an enclosure. We introduce SafeManip, a property-driven benchmark to explicitly evaluate temporal safety properties in robotic manipulation, moving beyond prior evaluations that largely focus on task completion or per-state constraint violations. SafeManip defines reusable safety templates over finite executions using Linear Temporal Logic over finite traces (LTLf). It maps observed rollouts to symbolic predicate traces and evaluates them with LTLf-based monitors. Its property suite covers eight manipulation safety categories: collision and contact safety, grasp stability, release stability, cross-contamination, action onset, mechanism recovery, object containment, and enclosure access. Templates can be instantiated with task-specific objects, fixtures, regions, or skills, allowing the same safety specifications to generalize across tasks and environments. We evaluate SafeManip on six vision-language-action policies, including $π_0$, $π_{0.5}$, GR00T, and their training variants, across 50 RoboCasa365 household tasks. Results show that even strong models often behave unsafely. Task-success gains do not reliably translate into safer execution: many successful rollouts remain unsafe, while longer-horizon or more complex tasks expose more violations. SafeManip provides a reusable evaluation layer for diagnosing temporal safety failures and measuring safe success beyond task completion.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers