On the Rejection Criterion for Proxy-based Test-time Alignment
Ayoub Hammal, Pierre Zweigenbaum, Caio Corro
TLDR
This paper unifies test-time alignment methods, critiques existing confidence criteria, and proposes a novel, superior rejection criterion.
Key contributions
- Unifies existing test-time alignment methods (implicit reward, nudging) under a common graphical model framework.
- Identifies the "rejection criterion" as the key differentiator between these approaches.
- Argues that the common "confidence criterion" is flawed due to linguistic ambiguity.
- Proposes a novel "conservative confidence bet" rejection criterion, outperforming prior methods.
Why it matters
This paper unifies test-time alignment methods, revealing their core difference lies in the rejection criterion. It critically argues against the common confidence criterion due to linguistic ambiguity. The proposed novel "conservative confidence bet" offers a more robust and effective solution, outperforming prior methods.
Original Abstract
Recent works proposed test-time alignment methods that rely on a small aligned model as a proxy that guides the generation of a larger base (unaligned) model. The implicit reward approach skews the large model distribution, whereas the nudging approach defers the generation of the next token to the small aligned model when the large base one is unconfident about its outcome. In this work, we first show that both approaches can be reduced to sampling from similar graphical models, where they differ only in the definition of a rejection criterion (or distribution). Moreover, we argue that the confidence criterion is ill-motivated due to linguistic phenomena like ambiguous phrasing. We propose a novel rejection criterion based on a conservative confidence bet. Experimentally, our novel approach outperforms previous work on several datasets.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.