Let ViT Speak: Generative Language-Image Pre-training

May 1, 20262605.00809

Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao + 5 more

cs.CV

TLDR

GenLIP is a simple, scalable generative pre-training framework that enables Vision Transformers to directly predict language tokens, achieving strong multimodal performance.

Key contributions

Introduces GenLIP, a minimalist generative pre-training for ViTs in MLLMs.
Trains ViT to directly predict language tokens from visual tokens using a standard LM objective.
Achieves competitive or superior results on diverse benchmarks with less pretraining data.
Improves detail-sensitive tasks like OCR and chart understanding with multi-resolution pretraining.

Why it matters

GenLIP simplifies multimodal pre-training by removing complex components like contrastive learning or separate decoders. Its direct language prediction approach makes ViTs a more natural fit for MLLMs, offering a scalable and high-performing foundation for future vision-language models.

Original Abstract

In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers