Mind the Gap: Structure-Aware Consistency in Preference Learning
TLDR
This paper introduces SA-DPO, a new preference learning method for LLMs that ensures theoretical consistency and adapts margins based on semantic distance.
Key contributions
- Reveals theoretical inconsistency of standard surrogate losses (e.g., DPO) in LLM preference learning.
- Develops a margin-shifted ranking framework with rigorous H-consistency bounds for LLM alignment.
- Introduces SA-DPO, adapting margins based on semantic distance for structure-aware consistency.
- Proves heavy-tailed surrogates offer superior consistency for capacity-bounded models over logistic loss.
Why it matters
This paper resolves a critical theoretical inconsistency in current LLM preference learning, improving generalization. It offers a robust, structure-aware framework (SA-DPO) for aligning LLMs with human intent, enhancing reliability. The work also guides the selection of more effective loss functions.
Original Abstract
Preference learning has become the foundation of aligning Large Language Models (LLMs) with human intent. Popular methods, such as Direct Preference Optimization (DPO), minimize surrogate losses as proxies for the intractable pairwise ranking loss. However, we demonstrate that for the equicontinuous hypothesis sets typical of neural networks, these standard surrogates are theoretically inconsistent, yielding vacuous generalization guarantees. To resolve this, we formulate LLM alignment within a margin-shifted ranking framework. We derive rigorous $H$-consistency bounds that depend on enforcing a separation margin $γ$. Crucially, we extend this to Structure-Aware $H$-consistency, introducing a novel objective (SA-DPO) that adapts the margin based on the semantic distance between responses to handle synonyms and hard pairs. Finally, we analyze the trade-off between consistency and model limitations via the Margin-Capacity Profile, proving that heavy-tailed surrogates (such as the Polynomial Hinge family) offer superior consistency guarantees for capacity-bounded models compared to the standard logistic loss used in DPO.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.