Tango: Taming Visual Signals for Efficient Video Large Language Models

April 10, 20262604.09547

Shukang Yin, Sirui Zhao, Hanchao Wang, Baozhi Jia, Xianquan Wang + 2 more

cs.CV

TLDR

Tango optimizes token pruning in Video LLMs by improving attention selection and similarity clustering, achieving significant speedup with minimal performance loss.

Key contributions

Addresses limitations in existing attention-based selection and similarity-based clustering for Video LLMs.
Introduces a diversity-driven strategy to enhance attention-based token selection.
Proposes Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure in clusters.
Achieves 1.88x inference speedup while retaining 98.9% performance on LLaVA-OV with 10% tokens.

Why it matters

Efficient Video LLMs are crucial for processing vast amounts of video data. Tango provides a significant advancement in token pruning, enabling faster inference without sacrificing performance. This makes Video LLMs more practical for real-world applications.

Original Abstract

Token pruning has emerged as a mainstream approach for developing efficient Video Large Language Models (Video LLMs). This work revisits and advances the two predominant token-pruning paradigms: attention-based selection and similarity-based clustering. Our study reveals two critical limitations in existing methods: (1) conventional top-k selection strategies fail to fully account for the attention distribution, which is often spatially multi-modal and long-tailed in magnitude; and (2) direct similarity-based clustering frequently generates fragmented clusters, resulting in distorted representations after pooling. To address these bottlenecks, we propose Tango, a novel framework designed to optimize the utilization of visual signals. Tango integrates a diversity-driven strategy to enhance attention-based token selection, and introduces Spatio-temporal Rotary Position Embedding (ST-RoPE) to preserve geometric structure via locality priors. Comprehensive experiments across various Video LLMs and video understanding benchmarks demonstrate the effectiveness and generalizability of our approach. Notably, when retaining only 10% of the video tokens, Tango preserves 98.9% of the original performance on LLaVA-OV while delivering a 1.88x inference speedup.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers