{ "id": "2009.09660", "version": "v1", "published": "2020-09-21T07:55:50.000Z", "updated": "2020-09-21T07:55:50.000Z", "title": "Feature Flow: In-network Feature Flow Estimation for Video Object Detection", "authors": [ "Ruibing Jin", "Guosheng Lin", "Changyun Wen", "Jianliang Wang", "Fayao Liu" ], "categories": [ "cs.CV" ], "abstract": "Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent to the pixel displacement, a common approach is to:forward optical flow to a neural network and fine-tune this network on the task dataset. With this method,they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an \\textbf{I}n-network \\textbf{F}eature \\textbf{F}low estimation module (IFF module) for video object detection. Without resorting pre-training on any additional dataset, our IFF module is able to directly produce \\textbf{feature flow} which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on \\textit{self-supervision}, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and sets a state-of-the-art performance on ImageNet VID.", "revisions": [ { "version": "v1", "updated": "2020-09-21T07:55:50.000Z" } ], "analyses": { "keywords": [ "video object detection", "in-network feature flow estimation", "tensors encoding feature-level motion", "encoding feature-level motion information" ], "note": { "typesetting": "TeX", "pages": 0, "language": "en", "license": "arXiv", "status": "editable" } } }