After the surprising Image Classification results with zero-shot CLIP, I’ve run a test over a number of deep computer vision architectures from torchvision, fastai and the awesome timm library by Ross Wightman.

I’m using a 20-class subset of DeepFashion with around 100k images as the dataset. To accomodate memory requirements of different architectures, I’ve set batch_size to 32 (this leaves room for improvements in smaller models). All models are pretrained on ImageNet, are built with a new initialized head and subsequently fine-tuned on DeepFashion for one epoch with frozen weights (new head only) and three epochs with unfrozen weights.

image

Training took on average ~20 minutes per epoch for all models on my hardware, so around 1:15h per model (the first epoch with frozen weights is generally a bit faster, than the others). The full run took around 18 hours in total. Eventually, I might do a few more runs, just to get a feel for the actual distribution of results. Here are the results:

Model Library Accuracy Training per epoch (mins)
resnet34 torch 0.428767 12:34
resnet50 torch 0.441164 18:08
resnet101 torch 0.446995 24:50
resnet152 torch 0.448151 32:15
xresnet34 fastai 0.326014 12:44
xresnext34 fastai 0.255726 15:24
efficientnet_b0 timm 0.382433 13:39
efficientnet_b2 timm 0.385690 16:47
efficientnet_b4 timm 0.360475 24:01
densenet121 timm 0.395409 20:14
inception_v4 timm 0.395409 23:00
inception_resnet_v2 timm 0.405652 25:18
mobilenetv3_large_100 timm 0.385007 11:40
vit_base_patch16_224 timm failed n/a
xception41 timm 0.389262 23:40

I was a little surprised, how well the good-old ResNets perform in this setting. Even the smaller ones. Certainly in terms of efficiency (cost/time to accuracy), they are a great choice. I guess, it’s partly due to the training setup (dataset, hardware, batch size, epochs …), that some models didn’t perform better. But for my use case, this was quite informative.

image

I’m investigating, why the timm-ViT model failed to run my benchmark and also, how to run the same test with one of the pretrained CLIP vision models (ViT or ResNet).

Update:
I’ve wanted to try Weights and Biases for some time now, and this project has provided me with a great opportunity. After comparing architectures, I’ve run a couple of (parallel) experiments locally and in Colab to compare ResNet34 with different Learning Rate Schedules on the same dataset. Please see here for more details.

image