ArXiv TLDR

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

🐦 Tweet
2604.18518

Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Chengyuan Wang + 3 more

cs.CVcs.LG

TLDR

UDM-GRPO integrates Uniform Discrete Diffusion Models with RL using novel insights for stable and efficient policy optimization, achieving SOTA results.

Key contributions

  • Introduces UDM-GRPO, the first framework to integrate Uniform Discrete Diffusion Models with RL.
  • Optimizes by treating the final clean sample as the action for stable signals.
  • Reconstructs trajectories via diffusion forward process for better path alignment.
  • Achieves SOTA performance on T2I tasks (GenEval 96%, PickScore 23.81) and OCR (57%).

Why it matters

Naively applying GRPO to Uniform Discrete Diffusion Models (UDM) leads to instability. UDM-GRPO provides the first stable and efficient RL integration for UDM, achieving state-of-the-art performance in T2I and OCR tasks. This advancement unlocks new potential for discrete generative modeling.

Original Abstract

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. \Ours significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8\%$ to $57\%$, further validating the generalization ability of our method. Code is available at \href{https://github.com/Yovecent/UDM-GRPO}{https://github.com/Yovecent/UDM-GRPO}.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.