ArXiv TLDR

$π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

🐦 Tweet
2604.14054

Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang + 5 more

cs.LGcs.CL

TLDR

$π$-Play enhances self-play for search agents by using internally generated "question construction paths" as privileged information for dense self-distillation.

Key contributions

  • Introduces $π$-Play, a multi-agent self-evolution framework for training deep search agents.
  • Leverages "question construction paths" (QCPs) as privileged information, naturally generated during self-play.
  • Uses QCPs for dense self-distillation, transforming sparse-reward self-play into a dense-feedback loop.
  • Achieves 2-3x higher evolutionary efficiency and outperforms fully supervised search agents.

Why it matters

This paper addresses the challenge of training deep search agents with sparse rewards and limited data. By using internally generated privileged information, $π$-Play significantly improves training efficiency and performance over existing methods. This approach offers a scalable, data-free solution for complex information-seeking tasks.

Original Abstract

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($π$-Play), a multi-agent self-evolution framework. In $π$-Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free $π$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.