arXiv:1711.00489 Abstract | arXiv Analytics

arXiv:1711.00489 [cs.LG]Abstract References Reviews Resources

Don't Decay the Learning Rate, Increase the Batch Size

Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le

Published 2017-11-01Version 1

It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate $\epsilon$ and scaling the batch size $B \propto \epsilon$. Finally, one can increase the momentum coefficient $m$ and scale $B \propto 1/(1-m)$, although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train Inception-ResNet-V2 on ImageNet to $77\%$ validation accuracy in under 2500 parameter updates, efficiently utilizing training batches of 65536 images.

Comments: 11 pages, 7 figures

Categories: cs.LG, cs.CV, cs.DC, stat.ML

Keywords: learning rate, batch size, dont decay, reaches equivalent test accuracies, test accuracy

Related articles: Most relevant | Search more

arXiv:1711.04623 [cs.LG] (Published 2017-11-13)

Three Factors Influencing Minima in SGD

Stanisław Jastrzębski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, Amos Storkey

arXiv:2006.08517 [cs.LG] (Published 2020-06-15)

The Limit of the Batch Size

Yang You, Yuhui Wang, Huan Zhang, Zhao Zhang, James Demmel, Cho-Jui Hsieh

arXiv:1612.05086 [cs.LG] (Published 2016-12-15)

Coupling Adaptive Batch Sizes with Learning Rates