AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

April 23, 20262604.21766

Tasnim Kabir, Dmytro Kurdydyk, Aadi Palnitkar, Liam Dorn, Ahmed Haj Ahmed + 1 more

cs.CL

TLDR

AUDITA is a new audio QA dataset designed to rigorously test AI's auditory reasoning against human skill, revealing significant performance gaps.

Key contributions

Introduces AUDITA, a large-scale, real-world audio QA benchmark for robust auditory reasoning.
Features human-authored trivia questions with challenging distractors and long-range temporal dependencies.
Humans achieve 32.13% accuracy, while SOTA AI models perform poorly at <8.86%.
Applies Item Response Theory (IRT) to analyze model deficiencies and question difficulty.

Why it matters

Existing audio QA benchmarks often allow models to succeed via shortcuts. AUDITA provides a challenging, real-world dataset to truly evaluate AI's auditory reasoning, exposing current model limitations. This highlights the need for more robust AI development in audio understanding.

Original Abstract

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even bypassing audio via metadata and captions rather than genuine reasoning Thus, we present AUDITA (Audio Understanding from Diverse Internet Trivia Authors), a large-scale, real-world benchmark to rigorously evaluate audio reasoning beyond surface-level acoustic recognition. AUDITA comprises carefully curated, human-authored trivia questions grounded in real-world audio, designed to stress robust auditory reasoning through challenging distractors and long-range temporal dependencies, using probing queries that cannot be answered from isolated text or sound cues alone. Human average accuracy of 32.13% shows both the challenge of the task while demonstrating meaningful comprehension of the audio. In stark contrast, state of-the-art audio question answering models perform poorly, with average accuracy below 8.86%. Beyond raw accuracy, we apply Item Response Theory (IRT) to estimate latent proficiency, question difficulty, and expose systematic deficiencies of the models and data.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers