{ "id": "2205.03776", "version": "v1", "published": "2022-05-08T04:00:28.000Z", "updated": "2022-05-08T04:00:28.000Z", "title": "SparseTT: Visual Tracking with Sparse Transformers", "authors": [ "Zhihong Fu", "Zehua Fu", "Qingjie Liu", "Wenrui Cai", "Yunhong Wang" ], "comment": "Accepted by IJCAI2022 as a long oral presentation", "categories": [ "cs.CV" ], "abstract": "Transformers have been successfully applied to the visual tracking task and significantly promote tracking performance. The self-attention mechanism designed to model long-range dependencies is the key to the success of Transformers. However, self-attention lacks focusing on the most relevant information in the search regions, making it easy to be distracted by background. In this paper, we relieve this issue with a sparse attention mechanism by focusing the most relevant information in the search regions, which enables a much accurate tracking. Furthermore, we introduce a double-head predictor to boost the accuracy of foreground-background classification and regression of target bounding boxes, which further improve the tracking performance. Extensive experiments show that, without bells and whistles, our method significantly outperforms the state-of-the-art approaches on LaSOT, GOT-10k, TrackingNet, and UAV123, while running at 40 FPS. Notably, the training time of our method is reduced by 75% compared to that of TransT. The source code and models are available at https://github.com/fzh0917/SparseTT.", "revisions": [ { "version": "v1", "updated": "2022-05-08T04:00:28.000Z" } ], "analyses": { "keywords": [ "visual tracking", "sparse transformers", "search regions", "relevant information", "sparse attention mechanism" ], "note": { "typesetting": "TeX", "pages": 0, "language": "en", "license": "arXiv", "status": "editable" } } }