ArXiv TLDR

AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

🐦 Tweet
2604.11674

Mingyang Li, Haofan Xu, Haowen Sun, Xinzhe Chen, Sihua Ren + 7 more

cs.ROcs.AI

TLDR

AffordSim is a novel simulation framework that generates affordance-aware robotic manipulation data, enabling more realistic and challenging task generation for learning.

Key contributions

  • Introduces AffordSim, a simulation framework for generating affordance-aware robotic manipulation data.
  • Utilizes VoxAfford, an open-vocabulary 3D affordance detector, to guide grasp pose estimation.
  • Features VLM-powered task generation and novel domain randomization for scalable data.
  • Establishes a 50-task benchmark, revealing challenges in affordance-demanding tasks for baselines.

Why it matters

Robotic manipulation policies often struggle with tasks requiring precise interaction with object parts. AffordSim addresses this by providing a scalable way to generate data that incorporates object affordances, leading to more robust and capable robots. This work highlights critical gaps in current imitation learning for complex, affordance-aware tasks.

Original Abstract

Simulation-based data generation has become a dominant paradigm for training robotic manipulation policies, yet existing platforms do not incorporate object affordance information into trajectory generation. As a result, tasks requiring precise interaction with specific functional regions--grasping a mug by its handle, pouring from a cup's rim, or hanging a mug on a hook--cannot be automatically generated with semantically correct trajectories. We introduce AffordSim, the first simulation framework that integrates open-vocabulary 3D affordance prediction into the manipulation data generation pipeline. AffordSim uses our VoxAfford model, an open-vocabulary 3D affordance detector that enhances MLLM output tokens with multi-scale geometric features, to predict affordance maps on object point clouds, guiding grasp pose estimation toward task-relevant functional regions. Built on NVIDIA Isaac Sim with cross-embodiment support (Franka FR3, Panda, UR5e, Kinova), VLM-powered task generation, and novel domain randomization using DA3-based 3D Gaussian reconstruction from real photographs, AffordSim enables automated, scalable generation of affordance-aware manipulation data. We establish a benchmark of 50 tasks across 7 categories (grasping, placing, stacking, pushing/pulling, pouring, mug hanging, long-horizon composite) and evaluate 4 imitation learning baselines (BC, Diffusion Policy, ACT, Pi 0.5). Our results reveal that while grasping is largely solved (53-93% success), affordance-demanding tasks such as pouring into narrow containers (1-43%) and mug hanging (0-47%) remain significantly more challenging for current imitation learning methods, highlighting the need for affordance-aware data generation. Zero-shot sim-to-real experiments on a real Franka FR3 validate the transferability of the generated data.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.