VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
Hao Zhu, Shuo Jin, Wenbin Liao, Jiayu Xiao, Yan Zhu + 2 more
TLDR
VIP enhances dino.txt for efficient, high-quality open-vocabulary semantic segmentation by evolving text prompts with visual guidance.
Key contributions
- Introduces VIP, a visual-guided prompt evolution method, to improve semantic expressiveness in dino.txt.
- Integrates alias expansion and visual-guided distillation for robust semantic cue aggregation.
- Achieves 1.4% to 8.4% higher mIoU than leading methods in open-vocabulary semantic segmentation.
- Offers strong generalization across diverse domains with minimal inference time and memory overhead.
Why it matters
This paper tackles the challenge of efficient open-vocabulary semantic segmentation by enhancing text query expressiveness in spatially-aware models. VIP significantly boosts performance and generalization, offering a practical and efficient solution for fine-grained object perception.
Original Abstract
Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino.txt framework to facilitate more efficient and high-quality dense prediction. While dino.txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce \textcolor{oursblue}{\textbf{VI}}sual-guided \textcolor{oursblue}{\textbf{P}}rompt evolution (\textcolor{oursblue}{\textbf{\textit{VIP}}}) to rectify the semantic expressiveness of text queries in dino.txt, unleashing its potential for fine-grained object perception. Towards this end, \VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that \VIP: \ding{182} surpasses the top-leading methods by $1.4\% \sim 8.4\%$ average mIoU, \ding{183} generalizes well to diverse challenging domains, and \ding{184} requires marginal inference time and memory overhead. \href{https://github.com/MiSsU-HH/VIP}{Our code is publicly available at GitHub \faGithub}.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.