TRACE: Tourism Recommendation with Accountable Citation Evidence

May 8, 20262605.07677

Zixu Zhao, Sijin Wang, Yu Hou, Yuanyuan Xu, Yufan Sheng + 4 more

cs.IRcs.AIcs.CL

TLDR

TRACE introduces a new dataset and benchmark for conversational tourism recommender systems, focusing on verifiable evidence and rejection recovery.

Key contributions

New TRACE dataset: 10,000 multi-turn tourism dialogues with review-span citations and rejection turns.
Benchmarks 14 baselines (LLM, retrieval) using 25 metrics for Accuracy, Grounding, and Recovery.
Identifies a "Three-Competency Gap" in current systems for accountable tourism recommendations.
Proposes a Grounding Score, highly correlated with human citation precision (Spearman rho=+0.80).

Why it matters

Tourism recommendations are high-stakes, requiring trustworthy, verifiable, and adaptive systems. Existing benchmarks lack multi-turn evaluation with evidence and rejection recovery. TRACE fills this gap by reframing accountable tourism recommendation as a joint target of accuracy, verifiable evidence, and adaptive repair.

Original Abstract

Tourism is a high-stakes setting for conversational recommender systems (CRS): a plausible-sounding suggestion can waste real money and trip time once a traveler acts on it. Existing CRS benchmarks primarily evaluate systems with a single Recall@k score over entity mentions, and tourism-specific resources add spatial or knowledge-graph context, yet none of them couple multi-turn recommendation with verbatim review-span evidence and rejection recovery. This leaves an evaluation gap for tourism recommendation that is simultaneously trustworthy, verifiable, and adaptive: recommend the right point of interest (POI) for multi-aspect preferences (such as cuisine, price, atmosphere, walking distance), justify each suggestion with verifiable evidence from prior visitors so the traveler can act without trial and error, and recover when the first recommendation is rejected mid-dialogue. We introduce TRACE, where each item is a multi-turn tourism recommendation dialogue with review-span citations and explicit rejection turns: 10,000 dialogues over 2,400 Yelp POIs and 34,208 reviews across eight U.S. cities, paired with 14 retrieval, planning, and LLM baselines, along with 25 metrics organized under Accuracy, Grounding, and Recovery. Across these baselines, TRACE reveals the Three-Competency Gap: LLM Zero-Shot leads in closed-set Recall@1 and rejection recovery but cites less densely than retrievers; non-LLM retrievers achieve surface-verbatim grounding but with low accuracy; Multi-Review Synthesis fails at recovery. The Grounding Score agrees with human citation precision (Spearman rho=+0.80, p<10^-20), and paired t-tests reproduce the per-baseline ranking (p<0.01 on the dominant contrasts). TRACE reframes accountable tourism recommendation as a joint target (right POI, verifiable evidence, adaptive repair) rather than a single-axis leaderboard.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers