Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

April 14, 20262604.12896

Muhammad Kamran Janjua, Hugo Silva, Di Niu, Bahador Rashidi

cs.CVcs.LG

TLDR

Perception Programs (P$^2$) enhance MLLM visual reasoning by converting raw vision tool outputs into compact, language-native summaries, boosting performance significantly.

Key contributions

MLLMs often fail to benefit from raw pixel-level vision tool outputs due to representational misalignment.
Introduces Perception Programs (P$^2$), a training-free, model-agnostic method for MLLM visual reasoning.
P$^2$ converts dense tool outputs into compact, structured, language-native summaries for MLLMs.
Achieves 22% average gain across tasks, boosting GPT-5 Mini from 41% to 86% on multi-view reasoning.

Why it matters

Current MLLMs struggle to leverage vision tools effectively due to raw pixel output. P$^2$ offers a novel, training-free solution by translating visual cues into a language-native format. This significantly improves MLLM visual reasoning across various models, setting new SOTA without complex training.

Original Abstract

Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P$^2$), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P$^2$ consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P$^2$ raises its accuracy from 41.35\% to 86.47\% on multi-view reasoning, from 52.42\% to 81.45\% on relative depth, and achieves a 22\% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40\% absolute gains from P$^2$, surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers