CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

May 8, 20262605.08057

cs.CLcs.AI

TLDR

CA-SQL improves Text-to-SQL performance on challenging tasks by dynamically scaling solution exploration and using evolutionary prompt seeding and novel voting.

Key contributions

Dynamically scales solution exploration based on estimated task difficulty.
Utilizes custom prompt seeding, inspired by evolutionary search, to enhance LLM exploration.
Introduces a novel voting method to select the optimal SQL query from generated candidates.
Achieves state-of-the-art 51.72% on challenging BIRD tasks using only GPT-4o-mini.

Why it matters

Current LLMs struggle with complex Text-to-SQL tasks due to limited solution exploration. CA-SQL addresses this by intelligently expanding the search space. Its state-of-the-art performance on challenging benchmarks, even with smaller models, demonstrates a significant leap in efficiency and accuracy for database interaction.

Original Abstract

While recent advancements in inference-time learning have improved LLM reasoning on Text-to-SQL tasks, current solutions still struggle to perform well on the most challenging tasks in the Bird-Bench (BIRD) benchmark. This is due to inadequate solution space exploration, which is necessary to uncover promising candidate queries that can be further refined to produce the correct output. To address this challenge, we introduce CA-SQL, a novel Text-to-SQL pipeline that utilizes the estimated difficulty of a task to dynamically scale the breadth of the exploration for generating solution candidates. In addition, we use a custom prompt seeding method, based on principles of evolutionary search, to further elicit exploratory behavior from the base LLM and a novel voting method to select the best candidate solution at the end of the search. Experiments demonstrate that our solution achieves a state-of-the-art score of 51.72% on the "challenging" tier of BIRD development set problems, using only GPT-4o-mini, out-performing other in-context learning approaches, even those that leverage larger models. Overall, our method attains a competitive 61.06% execution accuracy and 68.77% Soft F1 score on the BIRD development dataset.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers