The CLIP paper and repo show a straightforward way to use the model for zero shot image classification. CLIP consists of two model components: a vision model and a language model. While the language model is a basic Transformer, the vision model can be either a modified ResNet or a VisionTransformer. The models have been pretrained on 400m image-text-pairs in a contrastive setting.

CLIP mainly provides APIs to encode images or text into this learned latent space. After encoding a list of images, this can be utilized to find the image closest to an encoded text prompt. Alternatively a list of text options can be encoded an utilized to find the best match to an encoded image.

The latter option is presented as Zero Shot Prediction, effectively creating a custom classifier from text options and then retrieving the classification results for an encoded image. It’s called Zero Shot, because the CLIP model was never trained on these labels.

As detailed by OpenAI in the CLIP paper and the repo and is actively researched for example by the BigScience working group on Prompt Engineering (also here), the framing of the classification label text can play a big role in prediction performance. E.g. instead of labels cat vs. dog a better performance can usually be achieved if the labels a photo of a cat and a photo of a dog are chosen.

image

I’ve conducted some initial tests and found, that CLIP (ViT-B/16) can zero shot recognize patterns in fashion (e.g. dots, stripes, flowers, checkered, print, …) with ~60% accuracy without significant prompt engineering or fine tuning on additional data.

The work on multi-language fine tuning of CLIP shows additional ways of more traditional optimization approaches building upon the extensively pretrained CLIP model. Last but not least, in combination with Facebooks faiss, I’ve found CLIP to be quite useful as part of a high performance recommendation and/or search engine.

More to come, no doubt!