arXiv:2305.02790 Abstract | arXiv Analytics

arXiv:2305.02790 [cs.LG]Abstract References Reviews Resources

BranchNorm: Robustly Scaling Extremely Deep Transformers

Yijin Liu, Xianfeng Zeng, Fandong Meng, Jie Zhou

Published 2023-05-04Version 1

Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000 layers) and reveals the promising potential of deep scaling. To stabilize the training of deep models, DeepNorm (Wang et al., 2022) attempts to constrain the model update to a constant value. Although applying such a constraint can benefit the early stage of model training, it may lead to undertrained models during the whole training procedure. In this paper, we propose BranchNorm, which dynamically rescales the non-residual branch of Transformer in accordance with the training period. BranchNorm not only theoretically stabilizes the training with smooth gradient norms at the early stage, but also encourages better convergence in the subsequent training stage. Experiment results on multiple translation tasks demonstrate that BranchNorm achieves a better trade-off between training stability and converge performance.

Comments: Long paper, 9 pages

Categories: cs.LG, cs.CL

Keywords: robustly scaling extremely deep transformers, branchnorm, multiple translation tasks demonstrate, early stage, smooth gradient norms

Related articles: Most relevant | Search more

arXiv:1504.00948 [cs.LG] (Published 2015-04-03)

The Child is Father of the Man: Foresee the Success at the Early Stage

Liangyue Li, Hanghang Tong

arXiv:1809.03185 [cs.LG] (Published 2018-09-10)

Shallow vs deep learning architectures for white matter lesion segmentation in the early stages of multiple sclerosis

Francesco La Rosa, Mário João Fartaria, Tobias Kober, Jonas Richiardi, Cristina Granziera, Jean-Philippe Thiran, Meritxell Bach Cuadra

arXiv:2206.04472 [cs.LG] (Published 2022-06-09)

Early Transferability of Adversarial Examples in Deep Neural Networks

Oriel BenShmuel

arXiv Analytics

arXiv:2305.02790 [cs.LG]Abstract References Reviews Resources

BranchNorm: Robustly Scaling Extremely Deep Transformers

Links

Toolbox

arXiv:2305.02790 [cs.LG]AbstractReferencesReviewsResources

BranchNorm: Robustly Scaling Extremely Deep Transformers

Links

Toolbox

arXiv:2305.02790 [cs.LG]Abstract References Reviews Resources