Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

April 13, 20262604.11734

Haojie Bai, Aimin Li, Ruoyu Yao, Xiongwei Zhao, Tingting Zhang + 3 more

cs.ROcs.AI

TLDR

Multi-ORFT improves cooperative driving safety and efficiency by combining scene-conditioned diffusion pre-training with stable online reinforcement fine-tuning.

Key contributions

Combines scene-conditioned diffusion pre-training with stable online reinforcement fine-tuning.
Utilizes inter-agent attention and AdaLN-Zero for robust scene-consistent trajectory generation.
Employs a two-level MDP and VG-GRPO for stable online diffusion policy optimization.
Demonstrates improved safety (reduced collisions/off-road) and efficiency in cooperative driving.

Why it matters

This paper offers a robust solution for multi-agent cooperative driving by stabilizing online fine-tuning of diffusion planners, significantly enhancing safety and traffic efficiency in closed-loop scenarios, crucial for reliable autonomous systems.

Original Abstract

Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult. We present Multi-ORFT, which couples scene-conditioned diffusion pre-training with stable online reinforcement post-training. In pre-training, the planner uses inter-agent self-attention, cross-attention, and AdaLN-Zero-based scene conditioning to improve scene consistency and road adherence of joint trajectories. In post-training, we formulate a two-level MDP that exposes step-wise reverse-kernel likelihoods for online optimization, and combine dense trajectory-level rewards with variance-gated group-relative policy optimization (VG-GRPO) to stabilize training. On the WOMD closed-loop benchmark, Multi-ORFT reduces collision rate from 2.04% to 1.89% and off-road rate from 1.68% to 1.36%, while increasing average speed from 8.36 to 8.61 m/s relative to the pre-trained planner, and it outperforms strong open-source baselines including SMART-large, SMART-tiny-CLSFT, and VBD on the primary safety and efficiency metrics. These results show that coupling scene-consistent denoising with stable online diffusion-policy optimization improves the reliability of closed-loop cooperative driving.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers