Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
Ami Baid, Zihui Xue, Kristen Grauman
TLDR
ACPO mitigates video-driven audio hallucination in AVLMs by using a dual-axis preference learning framework for faithful audio grounding.
Key contributions
- Addresses pervasive video-driven audio hallucination in Audio-Visual Language Models (AVLMs).
- Introduces Audio-Contrastive Preference Optimization (ACPO), a novel dual-axis preference learning framework.
- Uses an output-contrastive objective to penalize visual descriptions masquerading as audio facts.
- Employs an input-contrastive objective by swapping audio tracks to penalize invariant generation.
Why it matters
Audio-Visual Language Models are bottlenecked by cross-modal hallucination, particularly video-driven audio hallucination. This work offers a novel solution to improve audio grounding and reliability, making AVLMs more trustworthy for diverse applications.
Original Abstract
While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.