arXiv:2210.09263 Abstract | arXiv Analytics

arXiv:2210.09263 [cs.CV]Abstract References Reviews Resources

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao

Published 2022-10-17Version 1

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: ($i$) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; ($ii$) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and ($iii$) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.

Comments: A survey paper/book on Vision-Language Pre-training (102 pages)

Categories: cs.CV, cs.CL

Keywords: vision-language pre-training, core computer vision tasks, paper surveys vision-language, big foundation models, image-text retrieval

Related articles: Most relevant | Search more

arXiv:2202.09061 [cs.CV] (Published 2022-02-18)

VLP: A Survey on Vision-Language Pre-training

Feilong Chen, Duzhan Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang Xu, Bo Xu

arXiv:2111.12233 [cs.CV] (Published 2021-11-24)

Scaling Up Vision-Language Pre-training for Image Captioning

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang

arXiv:2212.08281 [cs.CV] (Published 2022-12-16)

HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval