ArXiv TLDR

DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation

🐦 Tweet
2605.04458

Bryan Li, William Walden, Yu Hou, Gabrielle Kaili-May Liu, Dawn Lawrie + 4 more

cs.CLcs.IR

TLDR

DoGMaTiQ automates the generation of high-quality, QA-based "nuggets" for evaluating RAG reports, showing strong correlation with human judgments.

Key contributions

  • Introduces DoGMaTiQ, a three-stage pipeline for automated QA nugget generation.
  • Integrates with AutoArgue for fully automatic report evaluation.
  • Achieves strong rank correlations with human judgments on cross-lingual TREC tasks.
  • Analysis shows LLM nugget generator is key and system rankings are robust.

Why it matters

Evaluating RAG-generated reports is challenging, especially in cross-lingual settings, due to the laborious manual curation of evaluation nuggets. DoGMaTiQ offers an automated, scalable solution for generating these QA-based nuggets. This significantly reduces manual effort, enabling more efficient and reliable assessment of RAG systems.

Original Abstract

Evaluation of long-form, citation-backed reports has lately received significant attention due to the wide-scale adoption of retrieval-augmented generation (RAG) systems. Core to many evaluation frameworks is the use of atomic facts, or nuggets, to assess a report's coverage of query-relevant information attested in the underlying collection. While nuggets have traditionally been represented as short statements, recent work has used question-answer (QA) representations, enabling fine-grained evaluations that decouple the information need (i.e. the question) from the potentially diverse content that satisfies it (i.e. its answers). A persistent challenge for nugget-based evaluation is the need to manually curate sets of nuggets for each topic in a test collection -- a laborious process that scales poorly to novel information needs. This challenge is acute in cross-lingual settings, where information is found in multilingual source documents. Accordingly, we introduce DoGMaTiQ, a pipeline for generating high-quality QA-based nugget sets in three stages: (1) document-grounded nugget generation, (2) paraphrase clustering, and (3) nugget subselection based on principled quality criteria. We integrate DoGMaTiQ nuggets with AutoArgue -- a recent nugget-based evaluation framework -- to enable fully automatic evaluation of generated reports. We conduct extensive experiments on two cross-lingual TREC shared tasks, NeuCLIR and RAGTIME, showing strong rank correlations with both human-in-the-loop and fully manual judgments. Finally, detailed analysis of our pipeline reveals that a strong LLM nugget generator is key, and that the system rankings induced by DoGMaTiQ are robust to outlier systems. We facilitate future research in report evaluation by publicly releasing our code and artifacts at https://github.com/manestay/dogmatiq.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.