-
Notifications
You must be signed in to change notification settings - Fork 7
1512.04150
[1512.04150]Learning Deep Features for Discriminative Localization[PDF] [notes]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba
read 06/03/2018
Use networks trained for classification to generate activation maps that reveal how much each image location has contributed to the prediction. The goal is therefore to generate reliable class activation maps that give for each class the localization of the evidence in the image.
Uses the global average pooling (GAP) layer to preserve the localization ability of the network. This layer performs spatial averaging of the feature map activations and therefore produces one output per feature map.
- several convolutional layers
- GAP
- fully connected layer (linear layer)
- trained with softmax for classification
Averages of the feature maps are linearly combined to produce a class score before the softmax layer. The linear weights therefore explicitely weight the contribution of each average of feature map average to the final prediction score.
The activations of the feature map are a spatial representation of the cues towards the final average activation.
A class activation map (CAM) can therefore be produced by weighted summing of the feature maps (before averaging) where the weights are the ones learned by the fully connected linear layer to produce the production for a given class. This produces an image of low resolution, because the resolution is the one of the output of the last convolutional layer (7x7 in this case). This image gives the contribution of each location to the final class prediction score.
They evaluate the localization ability of the network on localization tasks. For this purpose they generate bounding boxes from Class Activation Mappings by the following thresholding scheme :
- they detect the regions with at least 20% of max value of the CAM
- they take the bounding box that covers the largest connected component
This produces weakly-supervised bounding boxes although the network has never been explicitely trained for localization.