PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

April 30, 20262604.28123

Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin + 7 more

cs.CVcs.AIcs.CL

TLDR

PRISM introduces a black-box on-policy distillation stage to align large multimodal models, mitigating distributional drift between SFT and RLVR for improved performance.

Key contributions

PRISM pipeline inserts a distribution-alignment stage between SFT and RLVR for LMMs.
Employs black-box on-policy distillation with an MoE discriminator for disentangled corrective signals.
Mitigates distributional drift and improves alignment without needing teacher logits.
Boosts RLVR performance by +4.4 to +6.0 points on Qwen3-VL across diverse benchmarks.

Why it matters

Supervised fine-tuning often causes distributional drift in large multimodal models, hindering subsequent reinforcement learning. PRISM addresses this by explicitly aligning the model's distribution, preserving capabilities and improving multimodal reasoning. This leads to more robust and higher-performing LMMs for complex tasks.

Original Abstract

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL. We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits. While 1.26M public demonstrations suffice for broad SFT initialization, distribution alignment demands higher-fidelity supervision; we therefore curate 113K additional demonstrations from Gemini 3 Flash, featuring dense visual grounding and step-by-step reasoning on the hardest unsolved problems. Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT-to-RLVR baseline on 4B and 8B, respectively. Our code, data, and model checkpoints are publicly available at https://github.com/XIAO4579/PRISM.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers