arXiv Analytics

Sign in

arXiv:1712.04432 [cs.LG]AbstractReferencesReviewsResources

Integrated Model and Data Parallelism in Training Neural Networks

Amir Gholami, Ariful Azad, Kurt Keutzer, Aydin Buluc

Published 2017-12-12Version 1

We propose a new integrated method of exploiting both model and data parallelism for the training of deep neural networks (DNNs) on large distributed-memory computers using mini-batch stochastic gradient descent (SGD). Our goal is to find an efficient parallelization strategy for a fixed batch size using $P$ processes. Our method is inspired by the communication-avoiding algorithms in numerical linear algebra. We see $P$ processes as logically divided into a $P_r \times P_c$ grid where the $P_r$ dimension is implicitly responsible for model parallelism and the $P_c$ dimension is implicitly responsible for data parallelism. In practice, the integrated matrix-based parallel algorithm encapsulates both types of parallelism automatically. We analyze the communication complexity and analytically demonstrate that the lowest communication costs are often achieved neither with pure model parallelism nor with pure data parallelism. We also show the positive effect of our approach in the computational performance of SGD based DNN training where the reduced number of processes responsible for data parallelism result in "fatter" matrices that enable higher-throughput matrix multiplication.

Related articles: Most relevant | Search more
arXiv:1906.05661 [cs.LG] (Published 2019-06-13)
Training Neural Networks for and by Interpolation
arXiv:1811.03600 [cs.LG] (Published 2018-11-08)
Measuring the Effects of Data Parallelism on Neural Network Training
arXiv:1905.05894 [cs.LG] (Published 2019-05-15)
Online Normalization for Training Neural Networks