Seeing Fast and Slow: Learning the Flow of Time in Videos

April 23, 20262604.21931

Yen-Siang Wu, Rundong Luo, Jingsen Zhu, Tao Tu, Ali Farhadi + 4 more

cs.CVcs.AIcs.GR

TLDR

This paper introduces self-supervised models to detect and manipulate video playback speed, enabling temporal super-resolution and speed-conditioned video generation.

Key contributions

Learns self-supervised models to detect and estimate video playback speed from multimodal cues.
Curates the largest slow-motion video dataset from noisy, in-the-wild sources.
Develops speed-conditioned video generation, producing motion at specified playback speeds.
Achieves temporal super-resolution, transforming low-FPS videos into high-FPS sequences.

Why it matters

This work highlights time as a manipulable perceptual dimension in video learning. It opens doors for temporally controllable video generation, forensics, and richer world-models that understand event unfolding.

Original Abstract

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers