ArXiv TLDR

Learning Transferable Visual Models From Natural Language Supervision

🐦 Tweet
2103.00020

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh + 7 more

cs.CVcs.LG

TLDR

This paper presents CLIP, a model that learns versatile visual representations by training on 400 million image-text pairs, enabling zero-shot transfer to diverse vision tasks without task-specific training.

Key contributions

  • Introduces a scalable pre-training method predicting image-caption pairs from 400 million internet-sourced examples.
  • Enables zero-shot transfer to over 30 diverse vision tasks including OCR, action recognition, and fine-grained classification.
  • Achieves competitive or superior performance compared to fully supervised models like ResNet-50 on ImageNet without using labeled data.

Why it matters

This work matters because it fundamentally shifts how visual models are trained—from relying on fixed, labeled categories to leveraging natural language supervision at scale. This approach dramatically improves model generality and usability, allowing a single model to perform well across many tasks without additional training, which reduces the need for costly labeled datasets and accelerates deployment in real-world applications.

Original Abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.