ArXiv TLDR

FollowTable: A Benchmark for Instruction-Following Table Retrieval

🐦 Tweet
2605.00400

Rihui Jin, Yuchen Lu, Ting Zhang, Jun Wang, Kuicai Dong + 5 more

cs.IRcs.CL

TLDR

FollowTable introduces a new benchmark and metric for Instruction-Following Table Retrieval (IFTR), revealing existing models struggle with fine-grained instructions.

Key contributions

  • Formalizes Instruction-Following Table Retrieval (IFTR) for LLM agents, beyond topical similarity.
  • Introduces FollowTable, the first large-scale benchmark for evaluating IFTR capabilities.
  • Proposes Instruction Responsiveness Score, a new metric for instruction adherence in TR.
  • Reveals existing models struggle with fine-grained and schema-grounded table instructions.

Why it matters

As LLM-based agents increasingly interact with structured data, instruction-following table retrieval is crucial. This paper defines a critical new task, provides the first benchmark, and a new metric. It highlights significant limitations in current models, guiding future research in more robust and instruction-aware table retrieval systems.

Original Abstract

Table Retrieval (TR) has traditionally been formulated as an ad-hoc retrieval problem, where relevance is primarily determined by topical semantic similarity. With the growing adoption of LLM-based agentic systems, access to structured data is increasingly instruction-driven, where relevance is conditional on explicit content and schema constraints rather than topical similarity alone. We therefore formalize Instruction-Following Table Retrieval (IFTR), a new task that requires models to jointly satisfy topical relevance and fine-grained instruction constraints. We identify two core challenges in IFTR: (i) sensitivity to content scope, such as inclusion and exclusion constraints, and (ii) awareness of schema-grounded requirements, including column semantics and representation granularity--capabilities largely absent in existing retrievers. To support systematic evaluation, we introduce FollowTable, the first large-scale benchmark for IFTR, constructed via a taxonomy-driven annotation pipeline. We further propose a new metric, termed the Instruction Responsiveness Score, to evaluate whether retrieval rankings consistently adapt to user instructions relative to a topic-only baseline. Our results indicate that existing retrieval models struggle to follow fine-grained instructions over tabular data. In particular, they exhibit systematic biases toward surface-level semantic cues and remain limited in handling schema-grounded constraints, highlighting substantial room for future improvements.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.