ArXiv TLDR

TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering

🐦 Tweet
2604.28076

An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan, Han-Jia Ye

cs.CLcs.AIcs.LG

TLDR

TopBench is a new benchmark for evaluating LLMs on implicitly predictive tabular question answering, revealing struggles with intent recognition.

Key contributions

  • Introduces TopBench, a benchmark for implicit prediction and reasoning in Tabular QA.
  • Comprises 779 samples across four sub-tasks, including decision making and treatment effects.
  • Evaluates LLMs on recognizing latent intent and performing predictive reasoning over tables.
  • Shows current LLMs struggle with intent recognition, often defaulting to simple data lookups.

Why it matters

Current LLMs struggle with implicitly predictive queries over tables. TopBench fills this gap, offering a benchmark to evaluate and advance models in complex reasoning and intent recognition. It reveals critical limitations, guiding future research for more robust tabular question answering.

Original Abstract

Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.