Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation

May 7, 20262605.06311

Yixin Zhu, Zixiong Wang, Jian Yang, Jin Xie, Jingyi Yu + 2 more

cs.RO

TLDR

VISER is a new visually realistic benchmark for robot manipulation, bridging the sim-to-real gap with high-fidelity assets and strong real-world correlation.

Key contributions

Introduces VISER, a visually realistic benchmark for robot manipulation to reduce the sim-to-real visual gap.
Features a high-fidelity dataset of 1,000+ 3D assets with physically-based rendering (PBR) materials.
Proposes an MLLM-powered pipeline for scalable generation of physically plausible assets and scenes.
Enables diverse evaluation tasks (grasping, placing) and shows 0.92 sim-to-real correlation.

Why it matters

This paper addresses a critical issue in robotics: the unreliability of simulation due to visual domain gaps. By introducing VISER, it provides a robust tool for evaluating robot policies that strongly correlates with real-world performance. This advancement accelerates the development and deployment of reliable robot manipulation systems.

Original Abstract

Reliable simulation evaluation of robot manipulation policies serves as a high-fidelity proxy for real-world performance. Although existing benchmarks cover a wide range of task categories, they lack visual realism, creating a large domain gap between simulation and reality. This undermines the reliability of simulation-based evaluation in predicting real-world performance. To mitigate the sim-to-real visual gap, we conduct a systematic analysis to isolate the effects of lighting and material. Our results show that these factors play a critical role in geometric reasoning and spatial grounding, yet are largely overlooked in existing benchmarks. Motivated by the analysis, we propose VISER, a visually realistic benchmark for evaluating robot manipulation in simulation. VISER features a high-fidelity dataset of over 1,000 3D assets with physically-based rendering (PBR) materials, along with 3D scenes created from these assets through curated layouts or generation. To this end, we propose an automated pipeline leveraging Multi-modal Large Language Models (MLLMs) for material-aware part segmentation and material retrieval, enabling scalable generation of physically plausible assets. Building on the high-fidelity 3D asset dataset, we construct diverse evaluation tasks, such as grasping, placing, and long-horizon tasks, enabling scalable and reproducible assessment of Vision-Language-Action (VLA) models. Our benchmark shows a strong correlation between simulation and real-world performance, achieving an average Pearson correlation coefficient of 0.92 across different policies.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers