ArXiv TLDR

The Interplay of Data Structure and Imbalance in the Learning Dynamics of Diffusion Models

🐦 Tweet
2605.06367

Flavio Nicoletti, Chenxiao Ma, Enrico Ventura, Luca Saglietti, Stefano Sarao Mannelli

stat.MLcond-mat.dis-nncs.LG

TLDR

This paper analyzes how data structure and imbalance affect diffusion model learning, showing class variance and sampling bias dictate generalization and memorization.

Key contributions

  • Introduces a high-dimensional analytical framework to study class-dependent learning in score-based diffusion models.
  • Identifies class variance as the primary determinant of learning order, consistently favoring higher-variance classes.
  • Shows sampling imbalance can reverse learning order, delaying memorization for minority classes during backward diffusion.
  • Suggests diffusion models can memorize some classes while others remain insufficiently learned, impacting fairness.

Why it matters

This work sheds light on how real-world data heterogeneity impacts diffusion model training, revealing potential disparities. Understanding these dynamics is crucial for developing more robust and fair generative AI systems, especially when dealing with imbalanced datasets.

Original Abstract

Real-world datasets are inherently heterogeneous, yet how per-class structural differences and sampling imbalance shape the training dynamics of diffusion models-and potentially exacerbate disparities-remains poorly understood. While models typically transition from an initial phase of generalization to memorizing the training set, existing theory assumes homogeneous data, leaving open how class imbalance and heterogeneity reshape these dynamics. In this work, we develop a high-dimensional analytical framework to study class-dependent learning in score-based diffusion models. Analyzing a random-features model trained on Gaussian mixtures, we derive the feature-covariance spectrum to characterize per-class generalization and memorization times. We reveal the explicit hierarchy governing these dynamics: class variance is the primary determinant of learning order-consistently favoring higher-variance classes-while centroid geometry plays a secondary role. Sampling imbalance acts as a modulator that can reverse this ordering and, under strong imbalance, forces minority classes to acquire distinct, delayed speciation times during backward diffusion. Together, these results suggest that diffusion models can memorize some classes while others remain insufficiently learned. We validate our theoretical predictions empirically using U-Net models trained on Fashion MNIST.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.