ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

May 6, 20262605.04647

Huimin Wang, Yue Wang, Bihao Cui, Pengxiang Li, Ben Lu + 5 more

cs.RO

TLDR

ReflectDrive-2 introduces a discrete diffusion planner for autonomous driving with self-editing capabilities, significantly improved by reinforcement learning.

Key contributions

Proposes ReflectDrive-2, a masked discrete diffusion planner for autonomous driving using discrete trajectory tokens.
Enables in-place trajectory self-editing (AutoEdit) without auxiliary networks, using the same model.
Uses a two-stage training: supervised pre-training followed by crucial RL fine-tuning for editing.
Achieves 91.0 PDMS on NAVSIM (camera-only) with low latency (31.8 ms on NVIDIA Thor).

Why it matters

This paper advances autonomous driving by enabling planners to self-edit trajectories efficiently. The integration of reinforcement learning is key, significantly boosting performance beyond supervised methods. This approach offers a path to more robust and adaptable driving systems.

Original Abstract

We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete trajectory tokens and generates them through parallel masked decoding. This discrete token space enables in-place trajectory revision: AutoEdit rewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we construct structure-aware perturbations of expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory. We then fine-tune the full decision--draft--reflect rollout with reinforcement learning (RL), assigning terminal driving reward to the final post-edit trajectory and propagating policy-gradient credit through full-rollout transitions. Full-rollout RL proves crucial for coupling drafting and editing: under supervised training alone, inference-time AutoEdit improves PDMS by at most $0.3$, whereas RL increases its gain to $1.9$. We also co-design an efficient reflective decoding stack for the decision--draft--reflect pipeline, combining shared-prefix KV reuse, Alternating Step Decode, and fused on-device unmasking. On NAVSIM, ReflectDrive-2 achieves $91.0$ PDMS with camera-only input and $94.8$ PDMS in a best-of-6 oracle setting, while running at $31.8$ ms average latency on NVIDIA Thor.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers