The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

April 22, 20262604.20665

cs.CVcs.AI

TLDR

This paper reveals "functional blindness" in VLMs, introducing a Modality Translation Protocol and new metrics for trustworthy multimodal reasoning.

Key contributions

Reveals "functional blindness" in VLMs, where language priors exploit visual representation bottlenecks.
Proposes the Modality Translation Protocol (MTP) and novel metrics (ToS, CoS, FoS) to quantify visual knowledge.
Introduces the Semantic Sufficiency Criterion (SSC) and the Divergence Law of Multimodal Scaling.
Advocates for using SSC as an architectural blueprint for trustworthy multimodal AI systems.

Why it matters

This paper critically examines the trustworthiness of current Vision-Language Models, revealing their "functional blindness" to visual data. It introduces a novel evaluation protocol and metrics to quantify these limitations, advocating for a new architectural paradigm. This is crucial for developing reliable multimodal AI that truly grounds reasoning in visual input.

Original Abstract

The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of "multimodal gain". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers