{ "id": "1912.00018", "version": "v1", "published": "2019-11-29T16:56:02.000Z", "updated": "2019-11-29T16:56:02.000Z", "title": "On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks", "authors": [ "Umut Şimşekli", "Mert Gürbüzbalaban", "Thanh Huy Nguyen", "Gaël Richard", "Levent Sagun" ], "comment": "32 pages. arXiv admin note: substantial text overlap with arXiv:1901.06053", "categories": [ "stat.ML", "cs.LG", "math.CA" ], "abstract": "The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the \\emph{classical} central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the \\emph{generalized} CLT, which suggests that the GN converges to a \\emph{heavy-tailed} $\\alpha$-stable random vector, where \\emph{tail-index} $\\alpha$ determines the heavy-tailedness of the distribution. Accordingly, we propose to analyze SGD as a discretization of an SDE driven by a L\\'{e}vy motion. Such SDEs can incur `jumps', which force the SDE and its discretization \\emph{transition} from narrow minima to wider minima, as proven by existing metastability theory and the extensions that we proved recently. In this study, under the $\\alpha$-stable GN assumption, we further establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index $\\alpha$. To validate the $\\alpha$-stable assumption, we conduct experiments on common deep learning scenarios and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.", "revisions": [ { "version": "v1", "updated": "2019-11-29T16:56:02.000Z" } ], "analyses": { "keywords": [ "stochastic gradient descent", "deep neural networks", "heavy-tailed theory", "sgd prefers wide minima", "assumption" ], "note": { "typesetting": "TeX", "pages": 32, "language": "en", "license": "arXiv", "status": "editable" } } }