arXiv Analytics

Sign in

arXiv:1505.00687 [cs.CV]AbstractReferencesReviewsResources

Unsupervised Learning of Visual Representations using Videos

Xiaolong Wang, Abhinav Gupta

Published 2015-05-04Version 1

Is strong supervision necessary for learning a good visual representation? Do we really need millions of semantically-labeled images to train a ConvNet? In this paper, we present a simple yet surprisingly powerful approach for unsupervised learning of ConvNets. Specifically, we use hundreds of thousands of unlabeled videos from the web to learn visual representations. Our key idea is that we track millions of patches in these videos. Visual tracking provides the key supervision. That is, two patches connected by a track should have similar visual representation in deep feature space since they probably belong to same object or object part. We design a Siamese-triplet network with a ranking loss function to train this ConvNet representation. Without using a single image from ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train an ensemble of unsupervised networks that achieves 52% mAP (no bounding box regression). This performance comes tantalizingly close to its ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We also show that our unsupervised network can perform competitive in other tasks such as surface-normal estimation.

Related articles: Most relevant | Search more
arXiv:1511.04166 [cs.CV] (Published 2015-11-13)
Unsupervised Learning of Edges
arXiv:1804.00946 [cs.CV] (Published 2018-04-03, updated 2018-04-26)
Unsupervised Learning of Sequence Representations by Autoencoders
arXiv:2007.00062 [cs.CV] (Published 2020-06-30)
Deep Feature Space: A Geometrical Perspective