Barriers to Universal Reasoning With Transformers (And How to Overcome Them)
Oliver Kraus, Yash Sarrof, Yuekun Yao, Alexander Koller, Michael Hahn
TLDR
Transformers with Chain-of-Thought struggle with length generalization beyond TC^0, but a new signpost token method enables Turing-complete reasoning.
Key contributions
- CoT Transformers fail length generalization for problems beyond TC^0 with standard positional encodings.
- Introduces a novel method using unique "signpost tokens" and value change encodings.
- Enables length-generalizable simulation of Turing machines with linear CoT trace length.
- Empirically demonstrates improved length generalization on complex reasoning tasks.
Why it matters
This research uncovers a fundamental barrier to universal reasoning in Transformers using Chain-of-Thought, highlighting their inability to generalize to longer reasoning traces. By proposing and validating a novel tokenization strategy, the paper offers a crucial pathway to developing more robust and truly general-purpose AI models capable of complex, length-generalizable computation.
Original Abstract
Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied. We use recent theoretical frameworks for Transformer length generalization and find that -- under standard positional encodings and a finite alphabet -- Transformers with CoT cannot solve problems beyond $TC^0$, i.e. the expressivity benefits do not hold under the stricter requirement of length-generalizable learnability. However, if we allow the vocabulary to grow with problem size, we attain a length-generalizable simulation of Turing machines where the CoT trace length is linear in the simulated runtime up to a constant. Our construction overcomes two core obstacles to reliable length generalization: repeated copying and last-occurrence retrieval. We assign each tape position a unique signpost token, and log only value changes to enable recovery of the current tape symbol through counts circumventing both barriers. Further, we empirically show that the use of such signpost tokens and value change encodings provide actionable guidance to improve length generalization on hard problems.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.