One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

April 15, 20262604.14149

Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat + 1 more

cs.CV

TLDR

X-Comp introduces learnable token-level and question-conditioned frame-level compression to enable efficient, dense long video understanding with VLMs.

Key contributions

Introduces LP-Comp for learnable, progressive token-level compression, achieving one token per frame.
Proposes QC-Comp for question-conditioned frame selection using LLM internal attention scores.
Mitigates LLM position bias in long videos by using segmented local attention.
Achieves 2x-4x more frames and boosts accuracy on long video benchmarks with minimal data.

Why it matters

Long video understanding is bottlenecked by VLM token limits. This paper offers a novel compression approach, X-Comp, that significantly increases the number of frames VLMs can process. It enables more comprehensive and accurate analysis of long videos with high data efficiency.

Original Abstract

Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards \emph{one token per frame} at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into \emph{learnable} and \emph{progressive} modules for \emph{token-level compression} (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate \emph{frame-level compression}, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named \emph{question-conditioned compression} (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, \emph{i.e.}, the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined \emph{token-level} and \emph{frame-level} leads to an e\textbf{x}treme compression model for long video understanding, named \textbf{\name}, achieving a significantly larger compression ratio and enabling denser frame sampling. Our \name is finetuned from VideoChat-Flash with a data-efficient \emph{supervised compression tuning} stage that only requires 2.5\% of the supervised fine-tuning data, yet boosts the accuracy from 42.9\% to 46.2\% on LVBench and enhances multiple other long video benchmarks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers