ArXiv TLDR

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

🐦 Tweet
2604.14140

Sumeet Ramesh Motwani, Daniel Nichols, Charles London, Peggy Li, Fabio Pizzati + 15 more

cs.LGcs.AI

TLDR

LongCoT is a new benchmark with 2,500 expert-designed problems to measure long-horizon chain-of-thought reasoning in frontier language models.

Key contributions

  • Introduces LongCoT, a benchmark of 2,500 expert-designed problems across diverse domains.
  • Measures long-horizon CoT reasoning, requiring tens to hundreds of thousands of reasoning tokens.
  • Problems have tractable local steps, isolating failures to long-horizon reasoning limitations.
  • Frontier models achieve <10% accuracy, highlighting a significant gap in current capabilities.

Why it matters

This paper introduces a critical benchmark for evaluating advanced language models on complex, multi-step reasoning. It reveals that even frontier models struggle significantly with long-horizon tasks, achieving less than 10% accuracy. This highlights a major limitation in current AI and provides a clear target for future research and development.

Original Abstract

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve &lt;10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.