Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

May 6, 20262605.04984

Senkang Hu, Yong Dai, Xudong Han, Zhengru Fang, Yuzhi Zhao + 2 more

cs.LGcs.CL

TLDR

SIOP provides turn-level credit assignment for LLM agents without verifiers by clustering final answers into latent outcome states.

Key contributions

Proposes Self-Induced Outcome Potential (SIOP) for turn-level credit assignment in LLM agents.
Clusters multiple rollout final answers into semantic outcome modes as latent future states.
Rewards turns by increasing posterior support for reliable future states via cluster-level approximation.
Improves performance on 7 reasoning benchmarks, nearing gold-supervised methods without verifiers.

Why it matters

This paper tackles the challenge of providing turn-level feedback for long-horizon LLM agents without costly human annotation or task-specific verifiers. SIOP offers a novel, self-supervised approach that learns from diverse outcomes, making agent training more efficient and scalable. It significantly advances agentic AI by generalizing information-potential shaping without gold supervision.

Original Abstract

Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards require high-quality human annotation. Existing turn-level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task-specific verifiers. Conversely, label-free RL methods extract self-signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self-Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential-based turn-level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability-aware target distribution over these states. It then rewards turns for increasing posterior support for reliable future states using a tractable cluster-level approximation. The objective generalizes information-potential shaping from gold-answer supervision to settings without task-specific gold verifiers while avoiding the broadcasted rollout-level advantages used by standard GRPO. We formalize the framework, characterize its supervised gold-answer limit, and show that SIOP improves average performance over verifier-free outcome-level baselines on seven search-augmented agentic reasoning benchmarks while approaching a gold-supervised outcome baseline. Code is available at https://github.com/dl-m9/SIOP.git.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers