Audio-Visual Intelligence in Large Foundation Models

May 5, 20262605.04045

You Qin, Kai Liu, Shengqiong Wu, Kai Wang, Shijian Deng + 10 more

cs.CV

TLDR

This survey provides the first comprehensive review of Audio-Visual Intelligence (AVI) in large foundation models, unifying tasks, methods, and challenges.

Key contributions

Establishes a unified taxonomy for Audio-Visual Intelligence (AVI) tasks across understanding, generation, and interaction.
Synthesizes core methodological foundations, including cross-modal fusion, large-scale pretraining, and generation techniques.
Curates representative datasets, benchmarks, and evaluation metrics for systematic comparison across AVI task families.
Identifies key open challenges in AVI, such as synchronization, spatial reasoning, controllability, and safety.

Why it matters

This paper is crucial as it consolidates the fragmented field of Audio-Visual Intelligence within large foundation models. By providing a coherent framework, it serves as a foundational reference, enabling systematic comparison and accelerating future research in this rapidly expanding and critical AI domain.

Original Abstract

Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers