VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
Huawei Ji, Yuanhao Sun, Yuan Jin, Cheng Deng, Jiaxin Ding + 2 more
TLDR
VisPCO optimizes visual token pruning in VLMs by learning Pareto-optimal configurations for superior accuracy-efficiency tradeoffs.
Key contributions
- VisPCO formulates visual token pruning as a Pareto configuration optimization problem.
- Employs continuous relaxation and gradient-based search via Augmented Lagrangian for optimal configs.
- Effectively approximates empirical Pareto frontiers, generalizing across various pruning methods and VLMs.
- Reveals multi-step progressive pruning achieves superior accuracy-efficiency tradeoffs compared to single-layer.
Why it matters
Existing visual token pruning methods for VLMs often lack computational-performance optimality. VisPCO introduces an automated, gradient-based framework to identify optimal pruning configurations. This work is crucial for developing more efficient and accurate vision-language models, especially when processing high-resolution images and videos.
Original Abstract
Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs' hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.