Skip to content

1512.04150

Yana edited this page Mar 6, 2018 · 1 revision

CVPR 2016

[1512.04150]Learning Deep Features for Discriminative Localization[PDF] [notes]

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba

read 06/03/2018

Objective

Use networks trained for classification to generate activation maps that reveal how much each image location has contributed to the prediction. The goal is therefore to generate reliable class activation maps that give for each class the localization of the evidence in the image.

Synthesis

Uses the global average pooling (GAP) layer to preserve the localization ability of the network. This layer performs spatial averaging of the feature map activations and therefore produces one output per feature map.

Network Structure

  • several convolutional layers
  • GAP
  • fully connected layer (linear layer)
  • trained with softmax for classification

Class activation mapping

Averages of the feature maps are linearly combined to produce a class score before the softmax layer. The linear weights therefore explicitely weight the contribution of each average of feature map average to the final prediction score.

The activations of the feature map are a spatial representation of the cues towards the final average activation.

A class activation map (CAM) can therefore be produced by weighted summing of the feature maps (before averaging) where the weights are the ones learned by the fully connected linear layer to produce the production for a given class. This produces an image of low resolution, because the resolution is the one of the output of the last convolutional layer (7x7 in this case). This image gives the contribution of each location to the final class prediction score.

Experiments

They evaluate the localization ability of the network on localization tasks. For this purpose they generate bounding boxes from Class Activation Mappings by the following thresholding scheme :

  • they detect the regions with at least 20% of max value of the CAM
  • they take the bounding box that covers the largest connected component

This produces weakly-supervised bounding boxes although the network has never been explicitely trained for localization.

Clone this wiki locally