Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
Yuangong Chen, Wai Keung Wong, Jiaxing Li, Ioannis Patras, Xu Zheng
TLDR
MLLMs struggle with viewpoint-dependent spatial reasoning; a new benchmark, PCSR-Bench, reveals a significant perception-reasoning gap.
Key contributions
- Introduces PCSR-Bench, a diagnostic benchmark for Perspective-Conditioned Spatial Reasoning (PCSR) in MLLMs using 360-degree images.
- PCSR-Bench includes 8 tasks, from foundational perception to advanced spatial reasoning like egocentric rotation and compositional chains.
- Evaluations on 14 MLLMs reveal a substantial perception-reasoning gap, with accuracy dropping from 57% to under 1% on complex PCSR tasks.
- An RL-based study shows targeted optimization can partially improve PCSR, highlighting its "partial plasticity" but also task-selectivity.
Why it matters
MLLMs show strong visual perception but struggle with viewpoint-dependent spatial reasoning. This paper introduces PCSR-Bench, a critical benchmark to diagnose this limitation in 360-degree images. It reveals a significant gap in current models, showing that while some improvement is possible with targeted optimization, PCSR remains a major bottleneck for advanced spatial understanding.
Original Abstract
Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.