AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation

April 9, 20262604.08540

Ziwei Zhou, Zeyuan Lai, Rui Wang, Yifan Yang, Zhen Xing + 4 more

cs.CVcs.AIcs.CL

TLDR

AVGen-Bench introduces a new benchmark and multi-granular evaluation for Text-to-Audio-Video generation, revealing gaps in semantic reliability.

Key contributions

Introduces AVGen-Bench, a task-driven benchmark with 11 categories for Text-to-Audio-Video generation.
Proposes a multi-granular evaluation framework using specialist models and MLLMs for comprehensive assessment.
Reveals a significant gap between aesthetic quality and semantic reliability in current T2AV models.
Highlights specific failures in text rendering, speech, physical reasoning, and musical pitch control.

Why it matters

This paper addresses the critical need for better evaluation in Text-to-Audio-Video generation. It uncovers key limitations in current models, guiding future research towards more semantically reliable and controllable T2AV systems. This will accelerate progress in media creation.

Original Abstract

Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers