1512.04150

CVPR 2016

[1512.04150]Learning Deep Features for Discriminative Localization[PDF] [notes]

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba

read 06/03/2018

Objective

Use networks trained for classification to generate activation maps that reveal how much each image location has contributed to the prediction. The goal is therefore to generate reliable class activation maps that give for each class the localization of the evidence in the image.

Synthesis

Uses the global average pooling (GAP) layer to preserve the localization ability of the network. This layer performs spatial averaging of the feature map activations and therefore produces one output per feature map.

Network Structure

several convolutional layers
GAP
fully connected layer (linear layer)
trained with softmax for classification

Class activation mapping

Averages of the feature maps are linearly combined to produce a class score before the softmax layer. The linear weights therefore explicitely weight the contribution of each average of feature map average to the final prediction score.

The activations of the feature map are a spatial representation of the cues towards the final average activation.

A class activation map (CAM) can therefore be produced by weighted summing of the feature maps (before averaging) where the weights are the ones learned by the fully connected linear layer to produce the production for a given class. This produces an image of low resolution, because the resolution is the one of the output of the last convolutional layer (7x7 in this case). This image gives the contribution of each location to the final class prediction score.

Experiments

They evaluate the localization ability of the network on localization tasks. For this purpose they generate bounding boxes from Class Activation Mappings by the following thresholding scheme :