{ "id": "2303.15233", "version": "v1", "published": "2023-03-27T14:15:17.000Z", "updated": "2023-03-27T14:15:17.000Z", "title": "Text-to-Image Diffusion Models are Zero-Shot Classifiers", "authors": [ "Kevin Clark", "Priyank Jaini" ], "categories": [ "cs.CV", "cs.AI", "cs.LG" ], "abstract": "The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Imagen, using it to probe fine-grained aspects of Imagen's knowledge and comparing it with CLIP's zero-shot abilities. Imagen performs competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, it achieves state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision and vision-language problems.", "revisions": [ { "version": "v1", "updated": "2023-03-27T14:15:17.000Z" } ], "analyses": { "keywords": [ "text-to-image diffusion models", "zero-shot classifiers", "zero-shot image classification datasets", "clips zero-shot abilities", "diffusion models ability" ], "note": { "typesetting": "TeX", "pages": 0, "language": "en", "license": "arXiv", "status": "editable" } } }