VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

April 10, 20262604.09531

Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu + 1 more

cs.CVcs.AIcs.CL

TLDR

VisionFoundry uses an LLM-driven pipeline to generate synthetic image data, significantly improving VLMs' visual perception skills.

Key contributions

Introduces VisionFoundry, an LLM-driven pipeline for generating synthetic VQA data from task names.
Synthesizes images and verifies consistency using T2I models and a proprietary VLM.
Constructs VisionFoundry-10K, a 10k synthetic VQA dataset covering 10 visual tasks.
Improves VLM performance by +7% on MMVP and +10% on CV-Bench-3D benchmarks.

Why it matters

VLMs struggle with low-level visual perception due to limited natural image supervision. This paper shows that targeted synthetic data can effectively address these weaknesses. It offers a systematic way to train VLMs for better visual understanding.

Original Abstract

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers