Learning to Think from Multiple Thinkers

April 27, 20262604.24737

Nirmit Joshi, Roey Magen, Nathan Srebro, Nikolaos Tsilivis, Gal Vardi

cs.LGcs.AIcs.CCstat.ML

TLDR

This paper explores learning with Chain-of-Thought from multiple thinkers, showing passive learning can be hard but active learning is efficient.

Key contributions

Studies learning with Chain-of-Thought (CoT) supervision from multiple thinkers providing diverse step-by-step solutions.
Demonstrates that passive learning from even a few CoT thinkers can be computationally hard under cryptographic assumptions.
Introduces a generic, efficient active learning algorithm for CoT supervision from multiple thinkers.
The active learning algorithm requires minimal CoT data per thinker and moderate total thinkers.

Why it matters

This research highlights the challenges of learning from diverse expert explanations, especially with Chain-of-Thought supervision. It offers a practical active learning strategy to efficiently leverage multiple thinkers, improving model training in complex problem domains.

Original Abstract

We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers