arXiv:1811.09358 Abstract | arXiv Analytics

arXiv:1811.09358 [cs.LG]Abstract References Reviews Resources

A Sufficient Condition for Convergences of Adam and RMSProp

Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu

Published 2018-11-23Version 1

Adam and RMSProp, as two of the most influential adaptive stochastic algorithms for training deep neural networks, have been pointed out to be divergent even in the convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, \textit{etc.}, have been tried to promote Adam/RMSProp-type algorithms to converge. In contrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam/RMSProp for solving large-scale non-convex stochastic optimization. Moreover, we show that the convergences of several variants of Adam, such as AdamNC, AdaEMA, \textit{etc.}, can be directly implied via the proposed sufficient condition in the non-convex setting. In addition, we illustrate that Adam is essentially a specifically weighted AdaGrad with exponential moving average momentum, which provides a novel perspective for understanding Adam and RMSProp. This observation together with this sufficient condition gives much deeper interpretations on their divergences. At last, we validate the sufficient condition by applying Adam and RMSProp to tackle the counterexamples and train deep neural networks. Numerical results are exactly in accord with the analysis in theory.

Comments: This paper provides a sufficient condition for the global convergences of Adam and RMSProp

Categories: cs.LG, cs.CV, cs.NA, math.OC, stat.ML

Keywords: convergence, solving large-scale non-convex stochastic optimization, train deep neural networks, alternative easy-to-check sufficient condition, temporal decorrelation technique

Related articles: Most relevant | Search more

arXiv:1810.00122 [cs.LG] (Published 2018-09-29)

On the Convergence and Robustness of Batch Normalization

Yongqiang Cai, Qianxiao Li, Zuowei Shen

arXiv:2109.03194 [cs.LG] (Published 2021-09-07)

On the Convergence of Decentralized Adaptive Gradient Methods

Xiangyi Chen, Belhal Karimi, Weijie Zhao, Ping Li

arXiv:1502.03537 [cs.LG] (Published 2015-02-12)

Convergence of gradient based pre-training in Denoising autoencoders