CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
Ankan Deria, Komal Kumar, Xilin He, Imran Razzak, Hisham Cholakkal + 2 more
TLDR
CoME-VL introduces a novel multi-encoder framework that fuses complementary contrastive and self-supervised vision representations for improved VLM performance.
Key contributions
- Integrates contrastive (CLIP-style) and self-supervised (DINO) vision encoders.
- Uses entropy-guided multi-layer aggregation with orthogonality-constrained projections.
- Employs RoPE-enhanced cross-attention for aligning heterogeneous token grids.
- Achieves SOTA on RefCOCO and significant gains on visual understanding and grounding tasks.
Why it matters
Existing VLMs often miss rich semantics from self-supervised encoders. CoME-VL addresses this by effectively combining diverse visual representations, leading to more robust and accurate vision-language models. This advancement significantly boosts performance on key visual understanding and grounding tasks.
Original Abstract
Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.