-
Notifications
You must be signed in to change notification settings - Fork 7
ieee 8014893
[ieee-8014893] Robust Hand Detection and Classification in Vehicles and in the Wild [PDF] [notes]
T. Hoang Ngan Le, Chenchen Zhu, Yutong Zheng, Khoa Luu, Marios Savvides
read 25/09/2017
Address hand detection using ConvNet based approach using Multiple Scale Region-based Fully Convolutional Networks (MS-RFCN)
All layers are convolutional and computed only once for the entire image.
Winning approach of 2017 Viva Hand detection challenge see leaderboard
This article, as 1607.07155 focuses on multi-scale region proposal and relies on several feature maps of one network, extracted at different scales to perform hand detection. Unlike 1607.07155, this paper takes a fully convolutional approach to this problem.
- Extracts feature maps based on different depth in a resnet network (at 3 different levels). (More precisely, they use conv3_1, conv4_1 and conv4_23, with stride-2 pooling to conv3_1 to bring feature maps to same size)
- they apply l2 normalization long the channel axis for each feature map and concatenate the features together to generate the feature map for the region proposal.
-
They use renet's conv2_3, conv3_4 and conv5_3, apply the same structure as conv5 to the outputs of conv2_3 and conv3_4, which they name conv6 and conv7
-
features are then concatenated and a position-sensitive RoI pooling layer generates scores for each RoI, which is classified as background or hand. Scores are produced for kxk locations, producing kxkx(c + 1) outputs where c is the number of object categories (the + 1 is for background). Each RoI layer conducts selective pooling (by taking into account the region proposed by the RPN)
-
channel sizes are shrinked using 1x1 convolutional layers
The values of filter responses range in different scales in each layer: when the layer is deeper, the values of the filter responses are smaller. (according to empirical studies), this is what justifies the l2 normalization step
A box is considered positive if the overlap score (IoU) is above 0.5
This architecture obtains state of the art results on VIVA database and Oxford Hand Dataset
RoI pooling layer : performs max-pooling of inputs of nonuform sizes to obtain fixed-size feature maps.
For this:
- it divides the feature map matching the region proposal into equal-sized sections (for instance 7 in width and 7 in height)
- it keeps the largest value in each section as output