SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
Josue Torres-Fonseca, Naihao Deng, Yinpei Dai, Shane Storks, Yichi Zhang + 3 more
TLDR
SafetyALFRED evaluates multimodal LLMs' embodied safety planning, revealing a gap between hazard recognition and active mitigation in real-world scenarios.
Key contributions
- Introduces SafetyALFRED, a new benchmark with 6 real-world kitchen hazards for embodied agents.
- Evaluates 11 SOTA MLLMs on both hazard recognition (QA) and active risk mitigation (embodied planning).
- Reveals a significant "alignment gap": high hazard recognition but low mitigation success rates in embodied tasks.
- Advocates for a paradigm shift to embodied, action-oriented safety benchmarks for MLLMs.
Why it matters
This paper is crucial as it exposes a significant gap in MLLM safety: models recognize hazards but struggle with active mitigation in embodied environments. SafetyALFRED provides a vital benchmark for assessing and improving the practical safety of autonomous agents, advocating for more realistic, action-oriented evaluations.
Original Abstract
Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.