ArXiv TLDR

Generalization in LLM Problem Solving: The Case of the Shortest Path

🐦 Tweet
2604.15306

Yao Tong, Jiayuan Ye, Anastasia Borovykh, Reza Shokri

cs.AIcs.LG

TLDR

LLMs generalize spatially but fail at length scaling in shortest-path problems due to recursive instability, despite various training improvements.

Key contributions

  • Introduces a synthetic shortest-path environment to study LLM generalization.
  • LLMs show strong spatial transfer to unseen maps but fail with longer problem horizons.
  • Failure in length scaling is attributed to recursive instability in LLMs.
  • Data coverage, RL, and inference scaling do not resolve length-scaling issues.

Why it matters

This paper clarifies LLM generalization limits by isolating factors in a controlled setting. It highlights a critical failure mode (length scaling) that persists even with advanced training, suggesting fundamental architectural or learning limitations. Understanding these limits is crucial for developing more robust and truly generalizable LLMs.

Original Abstract

Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.