arXiv:2105.04553 Abstract | arXiv Analytics

arXiv:2105.04553 [cs.CV]Abstract References Reviews Resources

Self-Supervised Learning with Swin Transformers

Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, Han Hu

Published 2021-05-10Version 1

We are witnessing a modeling shift from CNN to Transformers in computer vision. In this paper, we present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture. The approach is basically a combination of MoCo v2 and BYOL, tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation: 72.8% and 75.0% top-1 accuracy using DeiT-S and Swin-T, respectively, by 300-epoch training. The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks. More importantly, the general-purpose Swin Transformer backbone enables us to also evaluate the learnt representations on downstream tasks such as object detection and semantic segmentation, in contrast to a few recent approaches built on ViT/DeiT which only report linear evaluation results on ImageNet-1K due to ViT/DeiT not tamed for these dense prediction tasks. We hope our results can facilitate more comprehensive evaluation of self-supervised learning methods designed for Transformer architectures. Our code and models are available at https://github.com/SwinTransformer/Transformer-SSL, which will be continually enriched.

Categories: cs.CV

Keywords: self-supervised learning, general-purpose swin transformer backbone enables, report linear evaluation results, dense prediction tasks, imagenet-1k linear evaluation

Related articles: Most relevant | Search more

arXiv:1912.01991 [cs.CV] (Published 2019-12-04)

Self-Supervised Learning of Pretext-Invariant Representations

Ishan Misra, Laurens van der Maaten

arXiv:1903.11412 [cs.CV] (Published 2019-03-27)

Self-Supervised Learning via Conditional Motion Propagation

Xiaohang Zhan, Xingang Pan, Ziwei Liu, Dahua Lin, Chen Change Loy

arXiv:2103.13413 [cs.CV] (Published 2021-03-24)

Vision Transformers for Dense Prediction

René Ranftl, Alexey Bochkovskiy, Vladlen Koltun