ArXiv TLDR

ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

🐦 Tweet
2604.14902

Pei-An Chen, Yong-Ching Liang, Jia-Fong Yeh, Hung-Ting Su, Yi-Ting Chen + 2 more

cs.AIcs.CLcs.CVcs.RO

TLDR

ADAPT introduces a benchmark and module for embodied agents to reason about dynamic, unspecified object affordances, improving task success in complex environments.

Key contributions

  • Introduces DynAfford, a benchmark for embodied agents in dynamic environments with unspecified affordances.
  • Presents ADAPT, a plug-and-play module that augments planners with explicit affordance reasoning.
  • Demonstrates ADAPT significantly improves robustness and task success in both seen and unseen environments.
  • Shows a fine-tuned VLM for affordance inference outperforms commercial LLMs like GPT-4o.

Why it matters

This paper addresses a critical gap in embodied AI, enabling agents to adapt to dynamic, real-world conditions with unspecified object affordances. It introduces a new benchmark and a modular affordance reasoning system, significantly improving robust agent behavior. The work also shows specialized VLMs outperform general LLMs for this task.

Original Abstract

Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.