arXiv:2104.03762 Abstract | arXiv Analytics

arXiv:2104.03762 [cs.CV]Abstract References Reviews Resources

Video Question Answering with Phrases via Semantic Roles

Published 2021-04-08Version 1

Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases. These metrics limit the VidQA models' application scenario. In this work, we leverage semantic roles derived from video descriptions to mask out certain phrases, to introduce VidQAP which poses VidQA as a fill-in-the-phrase task. To enable evaluation of answer phrases, we compute the relative improvement of the predicted answer compared to an empty string. To reduce the influence of language bias in VidQA datasets, we retrieve a video having a different answer for the same question. To facilitate research, we construct ActivityNet-SRL-QA and Charades-SRL-QA and benchmark them by extending three vision-language models. We further perform extensive analysis and ablative studies to guide future work.

Comments: NAACL21 Camera Ready including appendix

Categories: cs.CV, cs.CL

Keywords: video question answering, leverage semantic roles, perform extensive analysis, single-word answer, poses vidqa

Related articles: Most relevant | Search more

arXiv:1705.01253 [cs.CV] (Published 2017-05-03)

The Forgettable-Watcher Model for Video Question Answering

Hongyang Xue, Zhou Zhao, Deng Cai

arXiv:1909.02218 [cs.CV] (Published 2019-09-05)

A Better Way to Attend: Attention with Trees for Video Question Answering

Hongyang Xue, Wenqing Chu, Zhou Zhao, Deng Cai

arXiv:2210.03941 [cs.CV] (Published 2022-10-08)

Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling