arXiv:1806.01796 Abstract | arXiv Analytics

arXiv:1806.01796 [stat.ML]Abstract References Reviews Resources

Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

Mor Shpigel Nacson, Nathan Srebro, Daniel Soudry

Published 2018-06-05Version 1

Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed learning rate --- in the special case of linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous proofs with an exact asymptotic convergence of SGD required a learning rate that asymptotically vanishes to zero, or averaging of the SGD iterates. Furthermore, if the loss function has an exponential tail (e.g., logistic regression), then we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ --- similarly to gradient descent. These results suggest an explanation to the similar behavior observed in deep networks when trained with SGD.

Comments: 7 pages in main paper, 10 pages of proofs in appendix, 2 figures, 1 table

Categories: stat.ML, cs.LG

Keywords: stochastic gradient descent, fixed learning rate, separable data, exact convergence, smooth monotone loss functions

Related articles: Most relevant | Search more

arXiv:1803.01905 [stat.ML] (Published 2018-03-05)

Convergence of Gradient Descent on Separable Data

Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Nathan Srebro, Daniel Soudry

arXiv:2408.02839 [stat.ML] (Published 2024-08-05)

Optimizing Cox Models with Stochastic Gradient Descent: Theoretical Foundations and Practical Guidances

Lang Zeng, Weijing Tang, Zhao Ren, Ying Ding

arXiv:2407.07670 [stat.ML] (Published 2024-07-10)

Stochastic Gradient Descent for Two-layer Neural Networks