Co-Evolution of Policy and Internal Reward for Language Agents
Xinyu Wang, Hanwei Wu, Jingwei Song, Shuyuan Zhang, Jiayi Zhang + 6 more
TLDR
Self-Guide enables language agents to generate and refine internal rewards, improving policy optimization and inference-time guidance for long-horizon tasks.
Key contributions
- Introduces Self-Guide, a novel internal reward mechanism for language agents.
- Uses self-generated signals for both inference-time action guidance and dense training rewards.
- Establishes a co-evolving loop where better policy improves guidance, which in turn enhances the policy.
- Achieves up to 8% performance gains over baselines using only environment rewards.
Why it matters
This paper addresses the critical challenge of sparse rewards in LLM agent training by proposing a self-generated internal reward. By integrating guidance and training signals, it offers a more efficient and robust way for agents to learn complex, long-horizon tasks. The co-evolutionary approach could significantly advance autonomous agent capabilities.
Original Abstract
Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.