arXiv Analytics

Sign in

arXiv:1806.01796 [stat.ML]AbstractReferencesReviewsResources

Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

Mor Shpigel Nacson, Nathan Srebro, Daniel Soudry

Published 2018-06-05Version 1

Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed learning rate --- in the special case of linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous proofs with an exact asymptotic convergence of SGD required a learning rate that asymptotically vanishes to zero, or averaging of the SGD iterates. Furthermore, if the loss function has an exponential tail (e.g., logistic regression), then we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ --- similarly to gradient descent. These results suggest an explanation to the similar behavior observed in deep networks when trained with SGD.

Comments: 7 pages in main paper, 10 pages of proofs in appendix, 2 figures, 1 table
Categories: stat.ML, cs.LG
Related articles: Most relevant | Search more
arXiv:1803.01905 [stat.ML] (Published 2018-03-05)
Convergence of Gradient Descent on Separable Data
arXiv:2408.02839 [stat.ML] (Published 2024-08-05)
Optimizing Cox Models with Stochastic Gradient Descent: Theoretical Foundations and Practical Guidances
arXiv:2407.07670 [stat.ML] (Published 2024-07-10)
Stochastic Gradient Descent for Two-layer Neural Networks