Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

May 1, 20262605.00814

Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Zefeng He + 4 more

cs.CVcs.AI

TLDR

PVM introduces a lightweight module for LVLMs to counteract visual signal dilution, ensuring sustained perception during deep generation.

Key contributions

PVM counteracts "Visual Signal Dilution" in LVLMs, where visual attention decays with text length.
Introduces Persistent Visual Memory (PVM), a lightweight module for sustained, on-demand visual perception.
PVM integrates as a parallel FFN branch, providing direct visual embeddings to mitigate signal suppression.
Achieves consistent accuracy gains on Qwen3-VL models, improving complex reasoning tasks.

Why it matters

Large Vision-Language Models often lose visual context during long generations. PVM offers a lightweight, structural solution to this "Visual Signal Dilution" problem. This significantly enhances LVLM reliability and performance, particularly for complex reasoning tasks requiring sustained visual perception.

Original Abstract

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers