arXiv Analytics

Sign in

arXiv:1712.06317 [cs.CV]AbstractReferencesReviewsResources

Spatial-Temporal Memory Networks for Video Object Detection

Fanyi Xiao, Yong Jae Lee

Published 2017-12-18Version 1

We introduce Spatial-Temporal Memory Networks (STMN) for video object detection. At its core, we propose a novel Spatial-Temporal Memory module (STMM) as the recurrent computation unit to model long-term temporal appearance and motion dynamics. The STMM's design enables the integration of ImageNet pre-trained backbone CNN weights for both the feature stack as well as the prediction head, which we find to be critical for accurate detection. Furthermore, in order to tackle object motion in videos, we propose a novel MatchTrans module to align the spatial-temporal memory from frame to frame. We compare our method to state-of-the-art detectors on ImageNet VID, and conduct ablative studies to dissect the contribution of our different design choices. We obtain state-of-the-art results with the VGG backbone, and competitive results with the ResNet backbone. To our knowledge, this is the first video object detector that is equipped with an explicit memory mechanism to model long-term temporal dynamics.

Related articles: Most relevant | Search more
arXiv:2009.09660 [cs.CV] (Published 2020-09-21)
Feature Flow: In-network Feature Flow Estimation for Video Object Detection
arXiv:1602.08465 [cs.CV] (Published 2016-02-26)
Seq-NMS for Video Object Detection
Wei Han et al.
arXiv:1712.05896 [cs.CV] (Published 2017-12-16)
Impression Network for Video Object Detection