arXiv Analytics

Sign in

arXiv:1811.03600 [cs.LG]AbstractReferencesReviewsResources

Measuring the Effects of Data Parallelism on Neural Network Training

Christopher J. Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, George E. Dahl

Published 2018-11-08Version 1

Recent hardware developments have made unprecedented amounts of data parallelism available for accelerating neural network training. Among the simplest ways to harness next-generation accelerators is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured in the number of steps necessary to reach a goal out-of-sample error. Eventually, increasing the batch size will no longer reduce the number of training steps required, but the exact relationship between the batch size and how many training steps are necessary is of critical importance to practitioners, researchers, and hardware designers alike. We study how this relationship varies with the training algorithm, model, and dataset and find extremely large variation between workloads. Along the way, we reconcile disagreements in the literature on whether batch size affects model quality. Finally, we discuss the implications of our results for efforts to train neural networks much faster in the future.

Related articles: Most relevant | Search more
arXiv:1712.04432 [cs.LG] (Published 2017-12-12)
Integrated Model and Data Parallelism in Training Neural Networks
arXiv:2003.11316 [cs.LG] (Published 2020-03-25)
Data Parallelism in Training Sparse Neural Networks
arXiv:2010.08899 [cs.LG] (Published 2020-10-18)
Fast Distributed Training of Deep Neural Networks: Dynamic Communication Thresholding for Model and Data Parallelism
Vipul Gupta et al.