arXiv:2406.02540 Abstract | arXiv Analytics

arXiv:2406.02540 [cs.CV]Abstract References Reviews Resources

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

Tianchen Zhao, Tongcheng Fang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang

Published 2024-06-04, updated 2024-06-30Version 2

Diffusion transformers (DiTs) have exhibited remarkable performance in visual generation tasks, such as generating realistic images or videos based on textual instructions. However, larger model sizes and multi-frame processing for video generation lead to increased computational and memory costs, posing challenges for practical deployment on edge devices. Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity. When quantizing diffusion transformers, we find that applying existing diffusion quantization methods designed for U-Net faces challenges in preserving quality. After analyzing the major challenges for quantizing diffusion transformers, we design an improved quantization scheme: "ViDiT-Q": Video and Image Diffusion Transformer Quantization) to address these issues. Furthermore, we identify highly sensitive layers and timesteps hinder quantization for lower bit-widths. To tackle this, we improve ViDiT-Q with a novel metric-decoupled mixed-precision quantization method (ViDiT-Q-MP). We validate the effectiveness of ViDiT-Q across a variety of text-to-image and video models. While baseline quantization methods fail at W8A8 and produce unreadable content at W4A8, ViDiT-Q achieves lossless W8A8 quantization. ViDiTQ-MP achieves W4A8 with negligible visual quality degradation, resulting in a 2.5x memory optimization and a 1.5x latency speedup.

Comments: Project Page: https://a-suozhang.xyz/viditq.github.io/

Categories: cs.CV

Keywords: video generation, accurate quantization, existing diffusion quantization methods, metric-decoupled mixed-precision quantization method, achieves lossless w8a8 quantization

Tags: github project

Related articles: Most relevant | Search more

arXiv:2203.14074 [cs.CV] (Published 2022-03-26)

V3GAN: Decomposing Background, Foreground and Motion for Video Generation

Arti Keshari, Sonam Gupta, Sukhendu Das

arXiv:2410.22979 [cs.CV] (Published 2024-10-30)

LumiSculpt: A Consistency Lighting Control Network for Video Generation

Yuxin Zhang, Dandan Zheng, Biao Gong, Jingdong Chen, Ming Yang, Weiming Dong, Changsheng Xu

arXiv:2404.13026 [cs.CV] (Published 2024-04-19)

PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation

Tianyuan Zhang et al.

arXiv Analytics

arXiv:2406.02540 [cs.CV]Abstract References Reviews Resources

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

Links

Toolbox

arXiv:2406.02540 [cs.CV]AbstractReferencesReviewsResources

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

Links

Toolbox

arXiv:2406.02540 [cs.CV]Abstract References Reviews Resources