ArXiv TLDR

PhyCo: Learning Controllable Physical Priors for Generative Motion

🐦 Tweet
2604.28169

Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan, Manmohan Chandraker

cs.CVcs.AIcs.LG

TLDR

PhyCo introduces a framework for video diffusion models to generate physically consistent and controllable motion by learning from simulated physics data and VLM feedback.

Key contributions

  • Curated a 100K video dataset of diverse simulations with systematically varied physical properties.
  • Physics-supervised fine-tuning of diffusion models via ControlNet conditioned on property maps.
  • VLM-guided reward optimization provides differentiable feedback for enhanced physical consistency.
  • Achieves state-of-the-art physical realism and controllable outputs on benchmarks.

Why it matters

This paper addresses a critical limitation of current video diffusion models: their lack of physical consistency. By integrating a novel dataset, physics-supervised training, and VLM feedback, PhyCo enables generative models to produce realistic and controllable physical interactions. This work paves a scalable path towards more robust and useful generative AI for video.

Original Abstract

Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes-without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.