Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL
Vishnu Murali, Anmol Gulati, Elias Lumer, Kevin Frank, Sindy Campagna + 1 more
TLDR
This paper introduces Jackal, an execution-based text-to-JQL benchmark, and Agentic Jackal, an LLM agent improving query accuracy by 9%.
Key contributions
- Introduces Jackal, the first large-scale, execution-based text-to-JQL benchmark with 100K NL-JQL pairs.
- Proposes Agentic Jackal, a tool-augmented LLM agent for live JQL execution and semantic value grounding.
- JiraAnchor, a semantic retrieval tool, boosts categorical-value accuracy from 48.7% to 71.7%.
Why it matters
Translating natural language to JQL is complex. This paper introduces the first execution-based benchmark, Jackal, and an agentic solution that significantly improves LLM accuracy. It identifies core semantic ambiguities as dominant failure modes, guiding future research in robust NL-to-code generation.
Original Abstract
Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and complex Boolean predicates. Single-pass LLMs cannot discover which categorical values (e.g., component names or fix versions) actually exist in a given Jira instance, nor can they verify generated queries against a live data source, limiting accuracy on paraphrased or ambiguous requests. No open, execution-based benchmark exists for mapping natural language to JQL. We introduce Jackal, the first large-scale, execution-based text-to-JQL benchmark comprising 100,000 validated NL-JQL pairs on a live Jira instance with over 200,000 issues. To establish baselines on Jackal, we propose Agentic Jackal, a tool-augmented agent that equips LLMs with live query execution via the Jira MCP server and JiraAnchor, a semantic retrieval tool that resolves natural-language mentions of categorical values through embedding-based similarity search. Among 9 frontier LLMs evaluated, single-pass models average only 43.4% execution accuracy on short natural-language queries, highlighting that text-to-JQL remains an open challenge. The agentic approach improves 7 of 9 models, with a 9.0% relative gain on the most linguistically challenging variant; in a controlled ablation isolating JiraAnchor, categorical-value accuracy rises from 48.7% to 71.7%, with component-field accuracy jumping from 16.9% to 66.2%. Our analysis identifies inherent semantic ambiguities, such as issue-type disambiguation and text-field selection, as the dominant failure modes rather than value-resolution errors, pointing to concrete directions for future work. We publicly release the benchmark, all agent transcripts, and evaluation code to support reproducibility.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.