arXiv:2405.20935 Abstract | arXiv Analytics

arXiv:2405.20935 [cs.LG]Abstract References Reviews Resources

Effective Interplay between Sparsity and Quantization: From Theory to Practice

Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay Subramanian, Amir Yazdanbakhsh

Published 2024-05-31Version 1

The increasing size of deep neural networks necessitates effective model compression to improve computational efficiency and reduce their memory footprint. Sparsity and quantization are two prominent compression methods that have individually demonstrated significant reduction in computational and memory footprints while preserving model accuracy. While effective, the interplay between these two methods remains an open question. In this paper, we investigate the interaction between these two methods and assess whether their combination impacts final model accuracy. We mathematically prove that applying sparsity before quantization is the optimal sequence for these operations, minimizing error in computation. Our empirical studies across a wide range of models, including OPT and Llama model families (125M-8B) and ViT corroborate these theoretical findings. In addition, through rigorous analysis, we demonstrate that sparsity and quantization are not orthogonal; their interaction can significantly harm model accuracy, with quantization error playing a dominant role in this degradation. Our findings extend to the efficient deployment of large models in resource-limited compute platforms and reduce serving cost, offering insights into best practices for applying these compression methods to maximize efficacy without compromising accuracy.

Categories: cs.LG, cs.AI

Keywords: quantization, effective interplay, combination impacts final model accuracy, networks necessitates effective model compression, neural networks necessitates effective model

Related articles: Most relevant | Search more

arXiv:2308.07209 [cs.LG] (Published 2023-08-14)

Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning

Shipeng Bai, Jun Chen, Xintian Shen, Yixuan Qian, Yong Liu

arXiv:2405.07140 [cs.LG] (Published 2024-05-12)

Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization

Xinyuan Zhang, Jiang Liu, Zehui Xiong, Yudong Huang, Gaochang Xie, Ran Zhang

arXiv:2407.04803 [cs.LG] (Published 2024-07-05)

The Impact of Quantization and Pruning on Deep Reinforcement Learning Models

Heng Lu, Mehdi Alemi, Reza Rawassizadeh