Using large language models for embodied planning introduces systematic safety risks

April 20, 20262604.18463

Tao Zhang, Kaixian Qu, Zhibin Li, Jiajun Wu, Marco Hutter + 2 more

cs.AIcs.LGcs.RO

TLDR

LLMs used for robotic planning show significant safety risks, with even high-performing models generating dangerous plans, highlighting a critical challenge.

Key contributions

Introduced DESPITE, a benchmark of 12,279 tasks for evaluating LLM planner safety.
High planning ability in LLMs doesn't ensure safety; top model made dangerous plans 28.3% of time.
Open-source LLMs scale planning ability but show flat safety awareness (38-57%).
Proprietary reasoning models achieve higher safety awareness (71-81%) than others.

Why it matters

This paper reveals a critical safety gap in using LLMs for robotic planning. It demonstrates that improving planning ability alone doesn't ensure safety, urging a dedicated focus on danger avoidance. This is vital for the safe and reliable deployment of LLM-based robotic systems.

Original Abstract

Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers