How to measure the semantic similarities among image features extracted by pre-trained models(e.g. vgg, resnet...)?

Question

As far as I know, pre-trained models play well in many tasks as a feature-extractor, thanks to their abundant training dataset.

However, I'm wondering that whether the model, let's say vgg-16,

have certain ability to extract some "semantic" information from input image?

If the answer is positive, given an unlabeled dataset,

is it possible to "cluster" images by measuring the semantic similarities of the extracted features?

Actually, I've spent some efforts:

Load pre-trained vgg-16 through Pytorch.
Load Cifar-10 dataset and transform to batched-tensor X, of size(5000, 3, 224, 224).
Fine-tune vgg.classifier, define its output dimension as 4096.
Extract features:

 features = vgg.features(X).view(X.shape[0], -1) # X: (5000, 3, 224, 224)

 features = vgg.classifier(features) # features: (5000, 25088)

 return features # features: (5000, 4096)

Try out cosine similarity, inner product, torch.cdist, however, only to find several bad clusters.

Any suggestion? Thanks in advance.

score 1 · Accepted Answer · answered Sep 09 '21 at 09:37

1

You might not want to go all the way to the last layer, as these contain features specific to the classification task at hand. Using features from layers higher up in the classifier might help. Additionally, you want to switch to eval mode since VGG-16 has a dropout layer in its classifier.

>>> vgg16 = torchvision.models.vgg(pretrained=True).eval()

Truncate the classifier:

>>> vgg16.classifier = vgg16.classifier[:4]

Now vgg16's classifier will look like:

(classifier): Sequential(
  (0): Linear(in_features=25088, out_features=4096, bias=True)
  (1): ReLU(inplace=True)
  (2): Dropout(p=0.5, inplace=False)
  (3): Linear(in_features=4096, out_features=4096, bias=True)
)

Then extract the features:

>>> vgg16(torch.rand(1, 3, 124, 124)).shape
torch.Size([1, 4096])

answered Sep 09 '21 at 09:37

Ivan

34,531
8
55
100

Thanks, your method did work for me (at least the performance was greatly improved)! BTW, in my code practice I've set `param.requires_grad = False` in `vgg.parameters()`, what's the difference between that setting and `vgg.eval()`? – Karl Sep 09 '21 at 13:23
`param.requires_grad = False, or equivalently `vgg.requires_grad_(False)` will make sure the network's weights are not affected by an optimizer, it essentially corresponds to *freezing* the model. On the other hand `eval` will deactivate certain layer features (batchnorms and dropouts mainly) which should be used during evaluation, these generally must only be activated when training. You can read more about the [latter](https://stackoverflow.com/questions/60018578/what-does-model-eval-do-in-pytorch) here. – Ivan Sep 09 '21 at 13:35

How to measure the semantic similarities among image features extracted by pre-trained models(e.g. vgg, resnet...)?

1 Answers1