DDO-RM for LLM Preference Optimization: A Minimal Held-Out Benchmark against DPO

April 13, 20262604.11119

Tiantian Zhang, Jierui Zuo, Wenping Wang

stat.MLcs.LG

TLDR

DDO-RM, a new method for LLM preference optimization, shows improved performance over DPO on a minimal held-out benchmark using reward-guided updates.

Key contributions

DDO-RM uses a reward-guided decision-distribution update, forming a policy over candidate responses.
It distills reward-model scores into the policy, rather than optimizing only a binary chosen-rejected relation.
DDO-RM improves mean pair accuracy, AUC, and margin over DPO on Pythia-410m in preliminary benchmarks.

Why it matters

This paper introduces DDO-RM, an alternative to DPO for LLM preference optimization, showing that reward-guided updates can outperform direct pairwise objectives. This offers a promising new direction for improving LLM alignment and preference learning.

Original Abstract

This paper reorganizes the current manuscript around the DPO versus DDO-RM preference-optimization project and focuses on two parts: the algorithmic view and the preliminary held-out benchmark. The benchmark asks a narrow question: even in a minimal pairwise chosen-versus-rejected setting, can a reward-guided decision-distribution update outperform a direct pairwise objective? We compare Direct Preference Optimization (DPO) against DDO-RM on EleutherAI/pythia-410m using HuggingFaceH4/ultrafeedback\_binarized, evaluate on the held-out test\_prefs split, and report results for seeds 42, 13, and 3407. Algorithmically, DDO-RM treats each prompt as a finite decision problem over candidate responses. Instead of optimizing only a binary chosen-rejected relation, it forms a policy distribution over candidates, centers reward-model scores under that distribution, and distills a reward-guided target distribution back into the policy. In the current public benchmark, DDO-RM improves mean pair accuracy from 0.5238 to 0.5602, AUC from 0.5315 to 0.5382, and mean margin from 0.1377 to 0.5353 relative to DPO. These are encouraging but still preliminary results: the study covers one model family, one dataset, one held-out evaluation split, and three seeds.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers