ieee 8014893

2017

CVPR Workshop

[ieee-8014893] Robust Hand Detection and Classification in Vehicles and in the Wild [PDF] [notes]

T. Hoang Ngan Le, Chenchen Zhu, Yutong Zheng, Khoa Luu, Marios Savvides

read 25/09/2017

Objective

Address hand detection using ConvNet based approach using Multiple Scale Region-based Fully Convolutional Networks (MS-RFCN)

All layers are convolutional and computed only once for the entire image.

Winning approach of 2017 Viva Hand detection challenge see leaderboard

Synthesis

This article, as 1607.07155 focuses on multi-scale region proposal and relies on several feature maps of one network, extracted at different scales to perform hand detection. Unlike 1607.07155, this paper takes a fully convolutional approach to this problem.

Structure

Region proposal

Extracts feature maps based on different depth in a resnet network (at 3 different levels). (More precisely, they use conv3_1, conv4_1 and conv4_23, with stride-2 pooling to conv3_1 to bring feature maps to same size)
they apply l2 normalization long the channel axis for each feature map and concatenate the features together to generate the feature map for the region proposal.

Ms-RFCN

They use renet's conv2_3, conv3_4 and conv5_3, apply the same structure as conv5 to the outputs of conv2_3 and conv3_4, which they name conv6 and conv7
features are then concatenated and a position-sensitive RoI pooling layer generates scores for each RoI, which is classified as background or hand. Scores are produced for kxk locations, producing kxkx(c + 1) outputs where c is the number of object categories (the + 1 is for background). Each RoI layer conducts selective pooling (by taking into account the region proposed by the RPN)
channel sizes are shrinked using 1x1 convolutional layers

The values of filter responses range in different scales in each layer: when the layer is deeper, the values of the filter responses are smaller. (according to empirical studies), this is what justifies the l2 normalization step

Results

A box is considered positive if the overlap score (IoU) is above 0.5

This architecture obtains state of the art results on VIVA database and Oxford Hand Dataset

Notes

RoI pooling layer : performs max-pooling of inputs of nonuform sizes to obtain fixed-size feature maps.

For this:

it divides the feature map matching the region proposal into equal-sized sections (for instance 7 in width and 7 in height)
it keeps the largest value in each section as output