Let ViT Speak: Generative Language-Image Pre-training
Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao + 5 more
TLDR
GenLIP is a simple, scalable generative pre-training framework that enables Vision Transformers to directly predict language tokens, achieving strong multimodal performance.
Key contributions
- Introduces GenLIP, a minimalist generative pre-training for ViTs in MLLMs.
- Trains ViT to directly predict language tokens from visual tokens using a standard LM objective.
- Achieves competitive or superior results on diverse benchmarks with less pretraining data.
- Improves detail-sensitive tasks like OCR and chart understanding with multi-resolution pretraining.
Why it matters
GenLIP simplifies multimodal pre-training by removing complex components like contrastive learning or separate decoders. Its direct language prediction approach makes ViTs a more natural fit for MLLMs, offering a scalable and high-performing foundation for future vision-language models.
Original Abstract
In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.