arXiv:2002.04710 Abstract | arXiv Analytics

arXiv:2002.04710 [cs.LG]Abstract References Reviews Resources

Unique Properties of Wide Minima in Deep Networks

Published 2020-02-11Version 1

It is well known that (stochastic) gradient descent has an implicit bias towards wide minima. In deep neural network training, this mechanism serves to screen out minima. However, the precise effect that this has on the trained network is not yet fully understood. In this paper, we characterize the wide minima in linear neural networks trained with a quadratic loss. First, we show that linear ResNets with zero initialization necessarily converge to the widest of all minima. We then prove that these minima correspond to nearly balanced networks whereby the gain from the input to any intermediate representation does not change drastically from one layer to the next. Finally, we show that consecutive layers in wide minima solutions are coupled. That is, one of the left singular vectors of each weight matrix, equals one of the right singular vectors of the next matrix. This forms a distinct path from input to output, that, as we show, is dedicated to the signal that experiences the largest gain end-to-end. Experiments indicate that these properties are characteristic of both linear and nonlinear models trained in practice.

Categories: cs.LG, stat.ML

Keywords: deep networks, unique properties, left singular vectors, wide minima solutions, largest gain end-to-end

Related articles: Most relevant | Search more

arXiv:1908.09375 [cs.LG] (Published 2019-08-25)

Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization

Tomaso Poggio, Andrzej Banburski, Qianli Liao

arXiv:2007.10099 [cs.LG] (Published 2020-07-20)

Early Stopping in Deep Networks: Double Descent and How to Eliminate it

Reinhard Heckel, Fatih Furkan Yilmaz

arXiv:1906.00150 [cs.LG] (Published 2019-06-01)

Sparsity Normalization: Stabilizing the Expected Outputs of Deep Networks

Joonyoung Yi, Juhyuk Lee, Sung Ju Hwang, Eunho Yang