Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
Trung Nguyen Quang, Yiming Gao, Fanyi Pu, Kaichen Zhang, Shuo Sun + 1 more
TLDR
Omnimodal LLMs struggle to reject false textual claims contradicting sensory input, revealing a "Representation-Action Gap" in grounding.
Key contributions
- Introduces IMAVB, a 500-clip benchmark for testing conflict detection in omnimodal LLMs.
- Discovers a "Representation-Action Gap": models encode mismatches but fail to reject false claims.
- Identifies two failure modes: under-rejection (accepting false premises) and over-rejection (rejecting valid questions).
- Proposes Probe-Guided Logit Adjustment (PGLA) to improve rejection behavior by re-injecting mismatch signals.
Why it matters
This paper highlights a critical flaw in omnimodal LLMs: their inability to act on internal knowledge of sensory contradictions. It reveals that models "know" when text conflicts with perception but fail to translate this into correct rejection behavior. This suggests future work should focus on improving the translation layer rather than just perception.
Original Abstract
When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension. Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise-perception mismatches even when the same models almost never reject the false claim in their outputs. Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy. The gap is modality-asymmetric (audio grounding underperforms vision) and prompt-resistant across seven variants. As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior. Together, these results suggest the bottleneck for omnimodal grounding lies in translation, not perception.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.