PhyCo: Learning Controllable Physical Priors for Generative Motion
Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan, Manmohan Chandraker
TLDR
PhyCo introduces a framework for video diffusion models to generate physically consistent and controllable motion by learning from simulated physics data and VLM feedback.
Key contributions
- Curated a 100K video dataset of diverse simulations with systematically varied physical properties.
- Physics-supervised fine-tuning of diffusion models via ControlNet conditioned on property maps.
- VLM-guided reward optimization provides differentiable feedback for enhanced physical consistency.
- Achieves state-of-the-art physical realism and controllable outputs on benchmarks.
Why it matters
This paper addresses a critical limitation of current video diffusion models: their lack of physical consistency. By integrating a novel dataset, physics-supervised training, and VLM feedback, PhyCo enables generative models to produce realistic and controllable physical interactions. This work paves a scalable path towards more robust and useful generative AI for video.
Original Abstract
Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes-without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.