arXiv Analytics

Sign in

arXiv:2105.04553 [cs.CV]AbstractReferencesReviewsResources

Self-Supervised Learning with Swin Transformers

Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, Han Hu

Published 2021-05-10Version 1

We are witnessing a modeling shift from CNN to Transformers in computer vision. In this paper, we present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture. The approach is basically a combination of MoCo v2 and BYOL, tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation: 72.8% and 75.0% top-1 accuracy using DeiT-S and Swin-T, respectively, by 300-epoch training. The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks. More importantly, the general-purpose Swin Transformer backbone enables us to also evaluate the learnt representations on downstream tasks such as object detection and semantic segmentation, in contrast to a few recent approaches built on ViT/DeiT which only report linear evaluation results on ImageNet-1K due to ViT/DeiT not tamed for these dense prediction tasks. We hope our results can facilitate more comprehensive evaluation of self-supervised learning methods designed for Transformer architectures. Our code and models are available at https://github.com/SwinTransformer/Transformer-SSL, which will be continually enriched.

Related articles: Most relevant | Search more
arXiv:1912.01991 [cs.CV] (Published 2019-12-04)
Self-Supervised Learning of Pretext-Invariant Representations
arXiv:1903.11412 [cs.CV] (Published 2019-03-27)
Self-Supervised Learning via Conditional Motion Propagation
arXiv:2103.13413 [cs.CV] (Published 2021-03-24)
Vision Transformers for Dense Prediction