Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

May 8, 20262605.07924

Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade, Manuel R. Ciosici, Yizhe Zhang + 1 more

cs.LGcs.AIcs.CL

TLDR

TS-DFM improves discrete flow matching by guiding trajectory generation with an energy compass, achieving 128x faster text generation.

Key contributions

Identifies trajectory quality, not student capacity, as the bottleneck in discrete flow matching distillation.
Introduces Trajectory-Shaped Discrete Flow Matching (TS-DFM) using an energy compass for guided navigation.
TS-DFM's 8-step student outperforms a 1,024-step teacher by 32% lower perplexity and is 128x faster.
Achieves state-of-the-art perplexity among discrete-generation baselines, even with less data or smaller models.

Why it matters

This paper addresses a core limitation in discrete flow matching, making text generation significantly more efficient and accurate. By improving the training trajectory itself, it enables much faster models to surpass the performance of their slower teachers. This advancement could accelerate the development of high-quality generative AI.

Original Abstract

Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers