On the Rejection Criterion for Proxy-based Test-time Alignment

April 17, 20262604.16146

Ayoub Hammal, Pierre Zweigenbaum, Caio Corro

cs.CL

TLDR

This paper unifies test-time alignment methods, critiques existing confidence criteria, and proposes a novel, superior rejection criterion.

Key contributions

Unifies existing test-time alignment methods (implicit reward, nudging) under a common graphical model framework.
Identifies the "rejection criterion" as the key differentiator between these approaches.
Argues that the common "confidence criterion" is flawed due to linguistic ambiguity.
Proposes a novel "conservative confidence bet" rejection criterion, outperforming prior methods.

Why it matters

This paper unifies test-time alignment methods, revealing their core difference lies in the rejection criterion. It critically argues against the common confidence criterion due to linguistic ambiguity. The proposed novel "conservative confidence bet" offers a more robust and effective solution, outperforming prior methods.

Original Abstract

Recent works proposed test-time alignment methods that rely on a small aligned model as a proxy that guides the generation of a larger base (unaligned) model. The implicit reward approach skews the large model distribution, whereas the nudging approach defers the generation of the next token to the small aligned model when the large base one is unconfident about its outcome. In this work, we first show that both approaches can be reduced to sampling from similar graphical models, where they differ only in the definition of a rejection criterion (or distribution). Moreover, we argue that the confidence criterion is ill-motivated due to linguistic phenomena like ambiguous phrasing. We propose a novel rejection criterion based on a conservative confidence bet. Experimentally, our novel approach outperforms previous work on several datasets.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers