Generating Long Videos of Dynamic Scenes

Abstract

We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time. Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence. A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency, such as a single latent code that dictates content for the entire video. On the other extreme, without long-term consistency, generated videos may morph unrealistically between different scenes. To address these limitations, we prioritize the time axis by redesigning the temporal latent representation and learning long-term consistency from data by training on longer videos. To this end, we leverage a two-phase training strategy, where we separately train using longer videos at a low resolution and shorter videos at a high resolution. To evaluate the capabilities of our model, we introduce two new benchmark datasets with explicit focus on long-term temporal dynamics.

@inproceedings{brooks2022generating,
  title={Generating Long Videos of Dynamic Scenes},
  author={Brooks, Tim and Hellsten, Janne and Aittala, Miika and Wang, Ting-Chun and Aila, Timo and Lehtinen, Jaakko and Liu, Ming-Yu and Efros, Alexei A and Karras, Tero},
  booktitle=NeurIPS,
  year={2022}
}

Videos

The first set of videos illustrate our model's ability to generate new content that arises over time. The StyleGAN-V baseline method repeats the same content -- for example, the horse ends in front of the same jump it starts in front of, and the clouds move back and forth. Our model is able to produce new scenery and objects that enter the scene over time, while maintaining long-term temporal consistency.

Video 1: Single videos on horseback riding dataset
Video 2: Single videos on mountain biking dataset
Video 3: Single videos on ACID dataset
Video 4: Single videos on SkyTimelapse dataset at 256x256 resolution

We compare with additional pre-trained video generation models on the SkyTimelapse dataset at 128x128 resolution. MoCoGAN-HD and TATS models change too rapidly over time, and DIGAN suffers from repetative patterns. Our model is able to produce a stream of new clouds over time.

Video 5: Single videos on SkyTimelapse dataset at 128x128 resolution

We next show the same comparisons as above on randomly sampled grids of 6 videos per dataset and method. The same effects are visible in these reesults.

Video 6: Random video grids on SkyTimelapse dataset at 128x128 resolution
Video 7: Random video grids on SkyTimelapse dataset at 128x128 resolution
Video 8: Random video grids on SkyTimelapse dataset at 128x128 resolution
Video 9: Random video grids on SkyTimelapse dataset at 128x128 resolution
Video 10: Random video grids on SkyTimelapse dataset at 128x128 resolution

Our video generator consists of two modular networks: a low-resolution generator trained on long sequences, and a seperate super-resolution network trained on short sequences. Here we show the intermediate low-resolution as well as the final high-resolution output.

Video 11: Low-resolution generated videos and corresponding super-resolution generated videos

Acknowledgments

We thank William Peebles, Samuli Laine, Axel Sauer and David Luebke for helpful discussion and feedback; Ivan Skorokhodov for providing additional results and insight into the StyleGAN-V baseline; Tero Kuosmanen for maintaining compute infrastructure; Elisa Wallace Eventing and Brian Kennedy for videos used to make the horseback riding and mountain biking datasets. Tim Brooks is supported by the National Science Foundation Graduate Research Fellowship under Grant No. 2020306087.