ArXiv TLDR

Boosting Visual Instruction Tuning with Self-Supervised Guidance

🐦 Tweet
2604.12966

Sophia Sirko-Galouchenko, Monika Wysoczanska, Andrei Bursuc, Nicolas Thome, Spyros Gidaris

cs.CV

TLDR

This paper enhances MLLMs' visual reasoning by integrating self-supervised tasks as natural language instructions during instruction tuning.

Key contributions

  • Augments visual instruction tuning with self-supervised tasks expressed as natural language instructions.
  • Reformulates classic SSL tasks (e.g., rotation, color matching) into image-instruction-response triplets.
  • Requires no human annotations, architectural changes, or additional training stages.
  • Consistently improves MLLM performance on vision-centric benchmarks with minimal added data.

Why it matters

MLLMs often fail at fine-grained visual reasoning. This work shows that simple adjustments to training data, using self-supervised tasks, can significantly boost their visual understanding. It offers a lightweight, annotation-free method to improve MLLM capabilities.

Original Abstract

Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human annotations, no architectural modifications, and no additional training stages. Across multiple models, training regimes, and benchmarks, injecting only a small fraction (3-10%) of such visually grounded instructions consistently improves performance on vision-centric evaluations. Our findings highlight instruction tuning with visually grounded SSL tasks as a powerful lever for improving visual reasoning in MLLMs through simple adjustments to the training data distribution. Code available at: https://github.com/sirkosophia/V-GIFT

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.