NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
Wen Huang, Haoran Sun, Yongjian Guo, Yunxuan Ma, Haoran Li + 6 more
TLDR
NoiseGate introduces a learnable per-latent timestep schedule as an information-gating policy for World Action Models, improving robot manipulation.
Key contributions
- Proposes NoiseGate, a novel approach for World Action Models using learnable per-latent timestep schedules.
- Treats per-latent timesteps as an information-gating policy, modulating reliability of latent frame contributions.
- Combines independent latent timestep sampling, a Gating Policy Network, and task-reward optimization.
- Achieves consistent performance gains on diverse RoboTwin random-scene manipulation tasks.
Why it matters
NoiseGate addresses a critical flaw in World Action Models with learnable, per-latent timestep schedules. This dynamic information-gating policy enhances perception-prediction-control coupling, leading to more robust robot action generation.
Original Abstract
World Action Models (WAMs) are an emerging family of policies that tie robot action generation to future-observation modeling. In this work, we focus on the joint video--action modeling paradigm, where actions and imagined future observations are co-generated along a shared denoising or flow trajectory, so that perception, prediction, and control are coupled within one generative process. Existing WAMs typically realize this paradigm with a Mixture-of-Transformers (MoT), where video and action tokens interact through shared self-attention. This architecture can in principle assign a separate timestep $t_f$ to each predicted latent frame, yet current systems collapse this degree of freedom onto a single shared scalar $t$. Under the noise-as-masking view of Diffusion Forcing, this shared schedule imposes the unjustified prior that every predicted latent is equally reliable for action generation. We instead view the per-latent schedule as a \emph{learnable information-gating policy}: by changing a latent frame's noise level, the policy modulates the reliability of its Key/Value contribution to the action tokens. We propose \textbf{NoiseGate}, which combines independent per-latent timestep sampling during backbone training, a lightweight Gating Policy Network that emits per-latent time increments during denoising, and task-reward optimization that trains the schedule policy without hand-crafted shape priors. Built on a joint video--action MoT backbone, NoiseGate delivers consistent gains on diverse RoboTwin random-scene manipulation tasks.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.