ArXiv TLDR

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

🐦 Tweet
2604.02709

Yihong Dong, Xiaoha Jian, Xue Jiang, Xuyuan Guo, Zhiyuan Fan + 5 more

cs.CLcs.AIcs.LGcs.SE

TLDR

ChomskyBench evaluates LLM formal reasoning across the Chomsky Hierarchy, revealing performance stratification and severe efficiency barriers for complex tasks.

Key contributions

  • Introduces ChomskyBench, a new benchmark for LLM formal reasoning across the full Chomsky Hierarchy.
  • ChomskyBench uses process-trace evaluation via natural language and deterministic symbolic verifiability.
  • Experiments show LLM performance stratifies with task complexity, revealing clear limitations at higher levels.
  • Finds LLMs face severe efficiency barriers for complex formal tasks, needing prohibitive computational costs.

Why it matters

This paper fills a critical gap in systematically evaluating LLM formal reasoning. It delineates practical limits of current LLMs, highlighting their inefficiency compared to traditional algorithms. These insights are crucial for guiding future LLM development and understanding the continued indispensability of traditional software tools.

Original Abstract

The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed to test capabilities at each level. Extensive experiments indicate a clear performance stratification that correlates with the hierarchy's levels of complexity. Our analysis reveals a direct relationship where increasing task difficulty substantially impacts both inference length and performance. Furthermore, we find that while larger models and advanced inference methods offer notable relative gains, they face severe efficiency barriers: achieving practical reliability would require prohibitive computational costs, revealing that current limitations stem from inefficiency rather than absolute capability bounds. A time complexity analysis further indicates that LLMs are significantly less efficient than traditional algorithmic programs for these formal tasks. These results delineate the practical limits of current LLMs, highlight the indispensability of traditional software tools, and provide insights to guide the development of future LLMs with more powerful formal reasoning capabilities.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.