ArXiv TLDR

CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

🐦 Tweet
2604.02060

Jingliang Li, Jindou Jia, Tuo An, Chuhao Zhou, Xiangyu Chen + 5 more

cs.CVcs.RO

TLDR

CompassAD introduces a new benchmark and framework, CompassNet, for intent-driven 3D affordance grounding in cluttered scenes with functionally competing objects.

Key contributions

  • Introduces a new 3D affordance grounding setting for intent-driven instructions in multi-object scenes.
  • Presents CompassAD, the first benchmark for implicit intent in confusable multi-object scenes.
  • Proposes CompassNet, a framework with ICI and BCR modules for robust affordance grounding.
  • Achieves state-of-the-art results and demonstrates effective real-world robotic grasping.

Why it matters

This paper addresses a critical limitation in 3D affordance by enabling robots to distinguish between functionally competing objects in cluttered scenes. It allows for intent-driven understanding, leading to more intelligent and context-aware robotic manipulation in complex real-world environments.

Original Abstract

When told to "cut the apple," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.