arXiv:2304.08818 Abstract | arXiv Analytics

arXiv:2304.08818 [cs.CV]Abstract References Reviews Resources

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis

Published 2023-04-18Version 1

Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512 x 1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Project page: https://research.nvidia.com/labs/toronto-ai/VideoLDM/

Comments: Conference on Computer Vision and Pattern Recognition (CVPR) 2023. Project page: https://research.nvidia.com/labs/toronto-ai/VideoLDM/

Categories: cs.CV, cs.LG

Keywords: latent diffusion models, high-resolution video synthesis, consistent video super resolution, align diffusion model upsamplers, video super resolution models

Tags: conference paper

Related articles: Most relevant | Search more

arXiv:2404.01367 [cs.CV] (Published 2024-04-01)

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, Peyman Milanfar

arXiv:2312.09792 [cs.CV] (Published 2023-12-15)

Latent Diffusion Models with Image-Derived Annotations for Enhanced AI-Assisted Cancer Diagnosis in Histopathology

Pedro Osorio et al.

arXiv:2211.17084 [cs.CV] (Published 2022-11-30)

High-Fidelity Guided Image Synthesis with Latent Diffusion Models

Jaskirat Singh, Stephen Gould, Liang Zheng

arXiv Analytics

arXiv:2304.08818 [cs.CV]Abstract References Reviews Resources

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Links

Toolbox

arXiv:2304.08818 [cs.CV]AbstractReferencesReviewsResources

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Links

Toolbox

arXiv:2304.08818 [cs.CV]Abstract References Reviews Resources