ArXiv TLDR

RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

🐦 Tweet
2604.20738

Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor

cs.CL

TLDR

RespondeoQA introduces the first bilingual Latin-English QA benchmark with 7,800 pairs, revealing LLM weaknesses in skill-based Latin tasks.

Key contributions

  • Introduces RespondeoQA, the first bilingual Latin-English QA benchmark with ~7,800 diverse question-answer pairs.
  • Questions sourced from historical and modern Latin pedagogical materials, covering various types.
  • Evaluates LLaMa 3, Qwen QwQ, and o3-mini, finding all perform worse on skill-oriented Latin questions.
  • Highlights LLM limitations in specialized linguistic domains, with potential for adaptation to other languages.

Why it matters

This paper is significant as it provides the first dedicated benchmark for Latin-English question answering, a crucial step for evaluating LLMs in specialized linguistic and cultural contexts. It reveals current models struggle with skill-based Latin tasks, guiding future research in multilingual AI development. The dataset's creation method is also adaptable for other low-resource languages.

Original Abstract

We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models -- LLaMa 3, Qwen QwQ, and OpenAI's o3-mini -- finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.