{ "id": "2206.11894", "version": "v1", "published": "2022-06-23T17:59:33.000Z", "updated": "2022-06-23T17:59:33.000Z", "title": "MaskViT: Masked Visual Pre-Training for Video Prediction", "authors": [ "Agrim Gupta", "Stephen Tian", "Yunzhi Zhang", "Jiajun Wu", "Roberto Martín-Martín", "Li Fei-Fei" ], "comment": "Project page: https://maskedvit.github.io/", "categories": [ "cs.CV", "cs.LG", "cs.RO" ], "abstract": "The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution videos (256x256). Further, we demonstrate the benefits of inference speedup (up to 512x) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.", "revisions": [ { "version": "v1", "updated": "2022-06-23T17:59:33.000Z" } ], "analyses": { "keywords": [ "masked visual pre-training", "maskvit outperforms prior works", "simple design decisions", "minimal domain knowledge", "video prediction models" ], "tags": [ "github project" ], "note": { "typesetting": "TeX", "pages": 0, "language": "en", "license": "arXiv", "status": "editable" } } }