ArXiv TLDR

Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

🐦 Tweet
2604.18701

Vin Bhaskara, Haicheng Wang

cs.LGcs.AIstat.ML

TLDR

Curiosity-Critic uses cumulative prediction error improvement as a tractable intrinsic reward for training world models, outperforming baselines.

Key contributions

  • Introduces Curiosity-Critic, an intrinsic reward based on cumulative prediction error improvement for world models.
  • Reduces to a tractable per-step form: current error minus an asymptotic error baseline.
  • A learned critic estimates the baseline online, separating epistemic from aleatoric error.
  • Outperforms baselines in world model convergence speed and final accuracy on stochastic grid worlds.

Why it matters

This paper introduces a novel intrinsic reward that improves world model training by considering cumulative prediction error. It offers a more robust exploration strategy by distinguishing between reducible and irreducible prediction errors. This leads to faster convergence and more accurate world models in stochastic environments.

Original Abstract

Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it reduces to a tractable per-step form: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this baseline online with a learned critic co-trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error and visitation-count baselines in convergence speed and final world model accuracy.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.