ArXiv TLDR

Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

🐦 Tweet
2604.14129

Ami Baid, Zihui Xue, Kristen Grauman

cs.CV

TLDR

ACPO mitigates video-driven audio hallucination in AVLMs by using a dual-axis preference learning framework for faithful audio grounding.

Key contributions

  • Addresses pervasive video-driven audio hallucination in Audio-Visual Language Models (AVLMs).
  • Introduces Audio-Contrastive Preference Optimization (ACPO), a novel dual-axis preference learning framework.
  • Uses an output-contrastive objective to penalize visual descriptions masquerading as audio facts.
  • Employs an input-contrastive objective by swapping audio tracks to penalize invariant generation.

Why it matters

Audio-Visual Language Models are bottlenecked by cross-modal hallucination, particularly video-driven audio hallucination. This work offers a novel solution to improve audio grounding and reliability, making AVLMs more trustworthy for diverse applications.

Original Abstract

While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.