InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

December 21, 20232312.14238

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen + 10 more

cs.CV

TLDR

InternVL is a 6-billion parameter vision-language foundation model that aligns large-scale vision models with LLMs to achieve state-of-the-art results across diverse visual-linguistic tasks.

Key contributions

Scales vision foundation model to 6 billion parameters and aligns it progressively with large language models.
Trained on web-scale, diverse image-text data enabling broad applicability across 32 visual-linguistic benchmarks.
Achieves state-of-the-art performance in zero-shot image/video classification, retrieval, and supports multi-modal dialogue systems.

Why it matters

This paper addresses the lag in vision and vision-language foundation models compared to the rapid advances in large language models by scaling and aligning vision models with LLMs. By demonstrating strong zero-shot capabilities and broad task generalization, InternVL advances the development of truly multi-modal AI systems that integrate visual perception and language understanding, a critical step toward more general and versatile AI.

Original Abstract

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers