arXiv:2406.08866 Abstract | arXiv Analytics

arXiv:2406.08866 [cs.CV]Abstract References Reviews Resources

Zoom and Shift are All You Need

Published 2024-06-13Version 1

Feature alignment serves as the primary mechanism for fusing multimodal data. We put forth a feature alignment approach that achieves full integration of multimodal information. This is accomplished via an alternating process of shifting and expanding feature representations across modalities to obtain a consistent unified representation in a joint feature space. The proposed technique can reliably capture high-level interplay between features originating from distinct modalities. Consequently, substantial gains in multimodal learning performance are attained. Additionally, we demonstrate the superiority of our approach over other prevalent multimodal fusion schemes on a range of tasks. Extensive experimental evaluation conducted on multimodal datasets comprising time series, image, and text demonstrates that our method achieves state-of-the-art results.

Comments: 8 pages, 3 figures

Categories: cs.CV, cs.AI

Keywords: method achieves state-of-the-art results, multimodal datasets comprising time series, prevalent multimodal fusion schemes, joint feature space, reliably capture high-level interplay

Related articles: Most relevant | Search more

arXiv:2304.06708 [cs.CV] (Published 2023-04-13)

Verbs in Action: Improving verb understanding in video-language models

Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid

arXiv:2306.15045 [cs.CV] (Published 2023-06-26)

Action Anticipation with Goal Consistency

Olga Zatsarynna, Juergen Gall

arXiv:1902.07304 [cs.CV] (Published 2019-02-19)

DeepBall: Deep Neural-Network Ball Detector