SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

April 21, 20262604.19638

Josue Torres-Fonseca, Naihao Deng, Yinpei Dai, Shane Storks, Yichi Zhang + 3 more

cs.AIcs.CLcs.RO

TLDR

SafetyALFRED evaluates multimodal LLMs' embodied safety planning, revealing a gap between hazard recognition and active mitigation in real-world scenarios.

Key contributions

Introduces SafetyALFRED, a new benchmark with 6 real-world kitchen hazards for embodied agents.
Evaluates 11 SOTA MLLMs on both hazard recognition (QA) and active risk mitigation (embodied planning).
Reveals a significant "alignment gap": high hazard recognition but low mitigation success rates in embodied tasks.
Advocates for a paradigm shift to embodied, action-oriented safety benchmarks for MLLMs.

Why it matters

This paper is crucial as it exposes a significant gap in MLLM safety: models recognize hazards but struggle with active mitigation in embodied environments. SafetyALFRED provides a vital benchmark for assessing and improving the practical safety of autonomous agents, advocating for more realistic, action-oriented evaluations.

Original Abstract

Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers