-
Notifications
You must be signed in to change notification settings - Fork 7
1605.03688
[Arxiv 1605.03688] Going Deeper into First-Person Activity Recognition [project page] [PdF] [notes]
Minghuang Ma, Haoqi Fan, Kris M Kitani
2 Conv nets ObjectNet and ActionNet trained respectively on object labels and action labels
The output of these two networks are then fused (concatenated) in order to present a joint representation of action, object and activity. There are therefore 3 labels for this last stage : activity, action, object
First localize and then recognize the object of interest (the object that is being interacted with)
Most often, the object of interest is in the vicinity of the hand, they therefore use hand appear- ance to predict the location of the object of interest
A hand segmentation network is trained using images and binary hand masks, ouptuts hand probability map, trained with loss : sum of per-pixel two-class softmax loss
Fine-tune to produce pixel-level object occurence probability map (gaussian 2D distribution for ground truth and per-pixel Euclidean loss)
Final object region prediction, run network on image sequence and generate object heatmap predictions, then threshold probability map and use centroid of largest blob as predicted center of the object. Then crop the object out of the raw image using fixed size centered bounding box.
The sequence of crops is the input to the object recognition CNN.
Base of CNN model : CNN-M-2048
Trained on object labels, with softmax as loss function
At test time, choose the label with the largest mean score over the sequence of frames
Background motion is often a good approximation of head motion, and might thus be usefull to keep to help recognizing actions
Input :
- optical flow of consecutive frames and encode horizontal and vertical flow separately
- several consecutive optical flow images are stacked as input samples of the network, in practice, 10 optical flow images are stacked
Train on action labels with softmax as loss function.
For all frames in sequence we pick action class with max average score as predictred label of the action
Concatenate the action an object networks together as one network at the second last fully connected layer + add a fully connected layer on top.
- another layer for activity on top. We therefore have three weighted losses for the class, activity and object labels.
Train by transferring weights and finetuning on activity recognition
On GTEA, GTEA gaze (Gaze) and GTEA gaze+ (Gaze+).
Leave one subject out cross validation
One neuron for object of interest localization is specialized on hand appearance
Finetuning with lsos weights for action and object of 0.2 and activity weigth of 1 boosts the performance also of the action and object networks