arXiv:1707.07998 Abstract | arXiv Analytics

arXiv:1707.07998 [cs.CV]Abstract References Reviews Resources

Bottom-Up and Top-Down Attention for Image Captioning and VQA

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang

Published 2017-07-25Version 1

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, improving the best published result in terms of CIDEr score from 114.7 to 117.9 and BLEU-4 from 35.2 to 36.9. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain a new state-of-the-art on the VQA v2.0 dataset with 70.2% overall accuracy.

Categories: cs.CV

Keywords: image captioning, top-down mechanism determines feature weightings, top-down visual attention mechanisms, salient image regions, top-down attention mechanism

Related articles: Most relevant | Search more

arXiv:1912.08226 [cs.CV] (Published 2019-12-17)

M$^2$: Meshed-Memory Transformer for Image Captioning

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Rita Cucchiara

arXiv:2210.10914 [cs.CV] (Published 2022-10-19)

Prophet Attention: Predicting Attention with Future Attention for Improved Image Captioning

Fenglin Liu, Xuewei Ma, Xuancheng Ren, Xian Wu, Wei Fan, Yuexian Zou, Xu Sun

arXiv:1706.08474 [cs.CV] (Published 2017-06-26)

Paying More Attention to Saliency: Image Captioning with Saliency and Context Attention

Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, Rita Cucchiara