RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

April 22, 20262604.20738

Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor

cs.CL

TLDR

RespondeoQA introduces the first bilingual Latin-English QA benchmark with 7,800 pairs, revealing LLM weaknesses in skill-based Latin tasks.

Key contributions

Introduces RespondeoQA, the first bilingual Latin-English QA benchmark with ~7,800 diverse question-answer pairs.
Questions sourced from historical and modern Latin pedagogical materials, covering various types.
Evaluates LLaMa 3, Qwen QwQ, and o3-mini, finding all perform worse on skill-oriented Latin questions.
Highlights LLM limitations in specialized linguistic domains, with potential for adaptation to other languages.

Why it matters

This paper is significant as it provides the first dedicated benchmark for Latin-English question answering, a crucial step for evaluating LLMs in specialized linguistic and cultural contexts. It reveals current models struggle with skill-based Latin tasks, guiding future research in multilingual AI development. The dataset's creation method is also adaptable for other low-resource languages.

Original Abstract

We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models -- LLaMa 3, Qwen QwQ, and OpenAI's o3-mini -- finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers