Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning

May 4, 20262605.02435

Mehryar Mohri, Jon Schneider, Yutao Zhong

cs.LGstat.ML

TLDR

This paper resolves systematic estimation bias in Distributional Alignment Games for Answer-Level Fine-Tuning, leading to more stable and efficient training.

Key contributions

Generalizes alignment games to Bregman divergences, enabling exact, unbiased estimators for polynomial rewards via U-statistics.
Develops a provably optimal minimax polynomial estimator for KL divergence games, achieving fundamental statistical error limits.
Introduces the AQP Estimator, combining approaches to reduce variance, optimize bias, and accelerate game convergence.
Enables more efficient and stable Answer-Level Fine-Tuning (ALFT) with zero online computational overhead.

Why it matters

This paper addresses a critical bias issue in Answer-Level Fine-Tuning, a powerful framework for training models. By providing provably unbiased and optimal estimators, it significantly improves training stability and efficiency. This advancement is crucial for developing more reliable and robust AI systems.

Original Abstract

The Distributional Alignment Game framework provides a powerful variational perspective on Answer-Level Fine-Tuning (ALFT). However, standard algorithms for these games rely on estimating logarithmic rewards from small batches, introducing a systematic bias due to Jensen's inequality that can destabilize training. In this paper, we systematically resolve this structural estimation bias. First, we generalize the alignment game to arbitrary Bregman divergences, showing that for a family of geometries inducing polynomial rewards, we can construct provably exact and unbiased estimators using U-statistics. Second, for the canonical KL divergence game where an exact solution is impossible, we derive a globally robust minimax polynomial estimator that is provably optimal, achieving the fundamental statistical error limit of $Θ(1/K^2)$, which we establish via the Ditzian-Totik theorem. Finally, we synthesize these two approaches to propose a novel Variance-Optimal Augmented Polynomial Optimization Program (AQP) Estimator, proving that by systematically reducing variance, our method achieves not only optimal bias but also provably accelerated game convergence, leading to more efficient and stable training with zero online computational overhead.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers