AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar, Liam Dorn, Ahmed Haj Ahmed + 1 more
TLDR
AUDITA is a new audio QA dataset designed to rigorously test AI's auditory reasoning against human skill, revealing significant performance gaps.
Key contributions
- Introduces AUDITA, a large-scale, real-world audio QA benchmark for robust auditory reasoning.
- Features human-authored trivia questions with challenging distractors and long-range temporal dependencies.
- Humans achieve 32.13% accuracy, while SOTA AI models perform poorly at <8.86%.
- Applies Item Response Theory (IRT) to analyze model deficiencies and question difficulty.
Why it matters
Existing audio QA benchmarks often allow models to succeed via shortcuts. AUDITA provides a challenging, real-world dataset to truly evaluate AI's auditory reasoning, exposing current model limitations. This highlights the need for more robust AI development in audio understanding.
Original Abstract
Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.