Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

October 25, 20232310.16834

stat.MLcs.CLcs.LG

TLDR

This paper introduces Score Entropy, a novel loss function that extends score matching to discrete data, enabling discrete diffusion models that outperform existing language diffusion methods and rival autoregressive models like GPT-2.

Key contributions

Proposes Score Entropy loss to effectively generalize score matching for discrete data domains.
Develops Score Entropy Discrete Diffusion (SEDD) models that significantly reduce perplexity on language modeling tasks.
Demonstrates SEDD's advantages over autoregressive models, including better un-annealed generation quality, fewer network evaluations, and flexible controllable infilling.

Why it matters

Diffusion models have revolutionized continuous data generation but struggled with discrete data like natural language due to limitations in score matching theory. By introducing Score Entropy, this work overcomes these challenges, enabling discrete diffusion models that not only outperform prior diffusion approaches but also compete with and surpass popular autoregressive models in key metrics. This advancement broadens the applicability of diffusion models to discrete domains, offering more efficient, controllable, and high-quality generative modeling for language and other discrete data.

Original Abstract

Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel loss that naturally extends score matching to discrete spaces, integrates seamlessly to build discrete diffusion models, and significantly boosts performance. Experimentally, we test our Score Entropy Discrete Diffusion models (SEDD) on standard language modeling tasks. For comparable model sizes, SEDD beats existing language diffusion paradigms (reducing perplexity by $25$-$75$\%) and is competitive with autoregressive models, in particular outperforming GPT-2. Furthermore, compared to autoregressive mdoels, SEDD generates faithful text without requiring distribution annealing techniques like temperature scaling (around $6$-$8\times$ better generative perplexity than un-annealed GPT-2), can trade compute and quality (similar quality with $32\times$ fewer network evaluations), and enables controllable infilling (matching nucleus sampling quality while enabling other strategies besides left to right prompting).

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers