Multilingual and Domain-Agnostic Tip-of-the-Tongue Query Generation for Simulated Evaluation

April 22, 20262604.21096

Xuhong He, To Eun Kim, Maik Fröbe, Jaime Arguello, Bhaskar Mitra + 1 more

cs.IRcs.CL

TLDR

This paper introduces an LLM-based framework to generate multilingual Tip-of-the-Tongue queries, creating the first large-scale benchmark.

Key contributions

Developed an LLM-based framework to simulate Tip-of-the-Tongue (ToT) queries across multiple languages.
Constructed the first large-scale multilingual ToT test collections for Chinese, Japanese, Korean, and English.
Studied how prompt and source document languages affect simulated query fidelity, offering language-aware design guidance.
Validated synthetic queries by correlating system ranks with real user queries for reliability.

Why it matters

This work addresses the critical need for multilingual Tip-of-the-Tongue retrieval benchmarks, which have been largely English-centric. It provides the first large-scale dataset and practical guidance, enabling more inclusive and robust information access research.

Original Abstract

Tip-of-the-Tongue (ToT) retrieval benchmarks have largely focused on English, limiting their applicability to multilingual information access. In this work, we construct multilingual ToT test collections for Chinese, Japanese, Korean, and English, using an LLM-based query simulation framework. We systematically study how prompt language and source document language affect the fidelity of simulated ToT queries, validating synthetic queries through system rank correlation against real user queries. Our results show that effective ToT simulation requires language-aware design choices: non-English language sources are generally important, while English Wikipedia can be beneficial when non-English sources provide insufficient information for query generation. Based on these findings, we release four ToT test collections with 5,000 queries per language across multiple domains. This work provides the first large-scale multilingual ToT benchmark and offers practical guidance for constructing realistic ToT datasets beyond English.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers