FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

May 7, 20262605.06509

Fangda Chen, Shanshan Zhao, Longrong Yang, Chuanfu Xu, Zhigang Luo + 1 more

cs.CV

TLDR

FreeSpec introduces a training-free method for long video generation, leveraging singular-spectrum reconstruction to overcome temporal inconsistencies.

Key contributions

Analyzes long video generation issues through a singular-spectrum lens, identifying spectral concentration.
Proposes FreeSpec, a training-free framework using SVD for global low-rank guidance and local high-rank reconstruction.
Achieves spectrum-level fusion, avoiding rigid feature partitioning and preserving details and dynamics.
Significantly improves temporal dynamics and visual quality in long video generation on benchmarks.

Why it matters

Current video diffusion models struggle with long videos, leading to content drift and over-smoothed motion. FreeSpec offers a novel, training-free solution by addressing spectral concentration, making long video generation more practical and higher quality. This advances the field by enabling more consistent and dynamic long-form content creation.

Original Abstract

Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly coupled, such as in camera motion and sequential motion. We analyze the video temporal extension issue from a singular-spectrum perspective and show that enlarged self-attention windows induce spectral concentration: spectral energy becomes dominated by a few low-rank singular directions, preserving coarse structure but suppressing high-rank spatial details and motion-rich temporal variations. To mitigate this problem, we propose FreeSpec, a training-free spectral reconstruction framework for long-video generation. FreeSpec decomposes global and local features with singular value decomposition, and uses the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid feature partitioning of previous decomposition rules, preserving long-range consistency while better retaining spatial details and temporal dynamics. Experiments on Wan2.1 and LTX-Video demonstrate that FreeSpec improves long-video generation, especially for temporal dynamics, while maintaining strong visual quality and temporal consistency. Project demo: https://fdchen24.github.io/FreeSpec-Website/.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers