MasterSet: A Large-Scale Benchmark for Must-Cite Citation Recommendation in the AI/ML Literature
Md Toyaha Rahman Ratul, Zhiqian Chen, Kaiqun Fu, Taoran Ji, Lei Zhang
TLDR
MasterSet is a new large-scale benchmark for identifying critical 'must-cite' papers in AI/ML literature, addressing a gap in existing citation recommendation systems.
Key contributions
- Introduces MasterSet, a large-scale benchmark for "must-cite" citation recommendation in AI/ML.
- Comprises over 150,000 papers from 15 top AI/ML venues as a comprehensive candidate pool.
- Features a three-tier annotation scheme for citations, including experimental baselines and core relevance.
- Leverages an LLM-based judge for annotation, validated by human experts, setting new baselines.
Why it matters
The rapid growth of AI/ML literature makes it hard for researchers to identify essential prior work. MasterSet directly tackles this by focusing on "must-cite" papers, which are crucial for accurate novelty assessment and reproducibility. This benchmark will drive the development of more effective and precise citation recommendation systems.
Original Abstract
The explosive growth of AI and machine learning literature -- with venues like NeurIPS and ICLR now accepting thousands of papers annually -- has made comprehensive citation coverage increasingly difficult for researchers. While citation recommendation has been studied for over a decade, existing systems primarily focus on broad relevance rather than identifying the critical set of ``must-cite'' papers: direct experimental baselines, foundational methods, and core dependencies whose omission would misrepresent a contribution's novelty or undermine reproducibility. We introduce MasterSet, a large-scale benchmark specifically designed to evaluate must-cite recommendation in the AI/ML domain. MasterSet incorporates over 150,000 papers collected from official conference proceedings/websites of 15 leading venues, serving as a comprehensive candidate pool for retrieval. We annotate citations with a three-tier labeling scheme: (I) experimental baseline status, (II) core relevance (1--5 scale), and (III) intra-paper mention frequency. Our annotation pipeline leverages an LLM-based judge, validated by human experts on a stratified sample. The benchmark task requires retrieving must-cite papers from the candidate pool given only a query paper's title and abstract, evaluated by Recall@$K$. We establish baselines using sparse retrieval, dense scientific embeddings, and graph-based methods, demonstrating that must-cite retrieval remains a challenging open problem.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.