BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

April 17, 20262604.16241

Jiacheng Shen, Masato Hagiwara, Milad Alizadeh, Ellen Gilsenan-McMahon, Marius Miron + 7 more

cs.CLcs.AI

TLDR

BAGEL is a new closed-book benchmark evaluating language models' specialized animal knowledge across diverse categories and scientific sources.

Key contributions

Introduces BAGEL, a closed-book benchmark to evaluate language models' specialized animal knowledge.
Constructed from diverse scientific and reference sources, including bioRxiv, Xeno-canto, and Wikipedia.
Covers multiple aspects: taxonomy, morphology, habitat, behavior, vocalization, and species interactions.
Supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories.

Why it matters

This paper addresses a critical gap in evaluating LLM performance on specialized animal knowledge. BAGEL provides a crucial testbed for improving LLM reliability in biodiversity applications, helping characterize model strengths and systematic failure modes.

Original Abstract

Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers