ArXiv TLDR

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

🐦 Tweet
2605.05045

Philip Wootaek Shin, Ajay Narayanan Sridhar, Sivani Devarapalli, Rui Zhang, Jack Sampson + 1 more

cs.CVcs.CL

TLDR

Visual perturbations like rotation and noise significantly degrade vision-language models' relational reasoning, showing a gap in their robustness.

Key contributions

  • Visual perturbations (rotation, noise) severely degrade VLM relational reasoning.
  • Even mild distortions cause significant relation hallucination across models and datasets.
  • Prompt-based augmentation and preprocessing strategies offer only partial improvements.
  • Highlights a critical gap between VLM perceptual robustness and relational understanding.

Why it matters

This paper reveals a critical vulnerability in vision-language models: their inability to maintain relational understanding under common visual distortions. It underscores the urgent need for developing more robust, geometry-aware VLMs. This research is crucial for building reliable multimodal AI systems.

Original Abstract

Vision-language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.