arXiv:2006.16228 Abstract | arXiv Analytics

arXiv:2006.16228 [cs.CV]Abstract References Reviews Resources

Self-Supervised MultiModal Versatile Networks

Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman

Published 2020-06-29Version 1

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of audio and vision can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51 and ESC-50 when compared to previous self-supervised work.

Categories: cs.CV

Keywords: self-supervised multimodal versatile networks, ingest multiple modalities, representations enable downstream tasks, multiple challenging benchmarks, learn representations

Related articles: Most relevant | Search more

arXiv:1707.09873 [cs.CV] (Published 2017-07-25)

Representation Learning on Large and Small Data

Chun-Nan Chou, Chuen-Kai Shie, Fu-Chieh Chang, Jocelyn Chang, Edward Y. Chang

arXiv:2007.13007 [cs.CV] (Published 2020-07-25)

HATNet: An End-to-End Holistic Attention Network for Diagnosis of Breast Biopsy Images

Sachin Mehta, Ximing Lu, Donald Weaver, Joann G. Elmore, Hannaneh Hajishirzi, Linda Shapiro

arXiv:2105.09270 [cs.CV] (Published 2021-05-19)

Do We Really Need to Learn Representations from In-domain Data for Outlier Detection?

Zhisheng Xiao, Qing Yan, Yali Amit