1605.03688

CVPR 2016

[Arxiv 1605.03688] Going Deeper into First-Person Activity Recognition [project page] [PdF] [notes]

Minghuang Ma, Haoqi Fan, Kris M Kitani

Synthesis

Pipeline

2 Conv nets ObjectNet and ActionNet trained respectively on object labels and action labels

The output of these two networks are then fused (concatenated) in order to present a joint representation of action, object and activity. There are therefore 3 labels for this last stage : activity, action, object

ObjectNet

First localize and then recognize the object of interest (the object that is being interacted with)

Most often, the object of interest is in the vicinity of the hand, they therefore use hand appear- ance to predict the location of the object of interest

Hand Segmentation

A hand segmentation network is trained using images and binary hand masks, ouptuts hand probability map, trained with loss : sum of per-pixel two-class softmax loss

Object Localization

Fine-tune to produce pixel-level object occurence probability map (gaussian 2D distribution for ground truth and per-pixel Euclidean loss)

Final object region prediction, run network on image sequence and generate object heatmap predictions, then threshold probability map and use centroid of largest blob as predicted center of the object. Then crop the object out of the raw image using fixed size centered bounding box.

Object recognition CNN

The sequence of crops is the input to the object recognition CNN.

Base of CNN model : CNN-M-2048

Trained on object labels, with softmax as loss function

At test time, choose the label with the largest mean score over the sequence of frames

ActionNet

Background motion is often a good approximation of head motion, and might thus be usefull to keep to help recognizing actions

Input :

optical flow of consecutive frames and encode horizontal and vertical flow separately
several consecutive optical flow images are stacked as input samples of the network, in practice, 10 optical flow images are stacked

Train on action labels with softmax as loss function.

For all frames in sequence we pick action class with max average score as predictred label of the action