ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints
Pei-An Chen, Yong-Ching Liang, Jia-Fong Yeh, Hung-Ting Su, Yi-Ting Chen + 2 more
TLDR
ADAPT introduces a benchmark and module for embodied agents to reason about dynamic, unspecified object affordances, improving task success in complex environments.
Key contributions
- Introduces DynAfford, a benchmark for embodied agents in dynamic environments with unspecified affordances.
- Presents ADAPT, a plug-and-play module that augments planners with explicit affordance reasoning.
- Demonstrates ADAPT significantly improves robustness and task success in both seen and unseen environments.
- Shows a fine-tuned VLM for affordance inference outperforms commercial LLMs like GPT-4o.
Why it matters
This paper addresses a critical gap in embodied AI, enabling agents to adapt to dynamic, real-world conditions with unspecified object affordances. It introduces a new benchmark and a modular affordance reasoning system, significantly improving robust agent behavior. The work also shows specialized VLMs outperform general LLMs for this task.
Original Abstract
Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.