Skip to content

1605.03688

Yana edited this page Jul 24, 2017 · 3 revisions

CVPR 2016

[Arxiv 1605.03688] Going Deeper into First-Person Activity Recognition [project page] [PdF] [notes]

Minghuang Ma, Haoqi Fan, Kris M Kitani

Synthesis

Pipeline

2 Conv nets ObjectNet and ActionNet trained respectively on object labels and action labels

The output of these two networks are then fused (concatenated) in order to present a joint representation of action, object and activity. There are therefore 3 labels for this last stage : activity, action, object

ObjectNet

First localize and then recognize the object of interest (the object that is being interacted with)

Most often, the object of interest is in the vicinity of the hand, they therefore use hand appear- ance to predict the location of the object of interest

Hand Segmentation

A hand segmentation network is trained using images and binary hand masks, ouptuts hand probability map, trained with loss : sum of per-pixel two-class softmax loss

Object Localization

Fine-tune to produce pixel-level object occurence probability map (gaussian 2D distribution for ground truth and per-pixel Euclidean loss)

Final object region prediction, run network on image sequence and generate object heatmap predictions, then threshold probability map and use centroid of largest blob as predicted center of the object. Then crop the object out of the raw image using fixed size centered bounding box.

Object recognition CNN

The sequence of crops is the input to the object recognition CNN.

Base of CNN model : CNN-M-2048

Trained on object labels, with softmax as loss function

At test time, choose the label with the largest mean score over the sequence of frames

ActionNet

Background motion is often a good approximation of head motion, and might thus be usefull to keep to help recognizing actions

Input :

  • optical flow of consecutive frames and encode horizontal and vertical flow separately
  • several consecutive optical flow images are stacked as input samples of the network, in practice, 10 optical flow images are stacked

Train on action labels with softmax as loss function.

For all frames in sequence we pick action class with max average score as predictred label of the action

Fusion

Concatenate the action an object networks together as one network at the second last fully connected layer + add a fully connected layer on top.

  • another layer for activity on top. We therefore have three weighted losses for the class, activity and object labels.

Train by transferring weights and finetuning on activity recognition

Evaluation

On GTEA, GTEA gaze (Gaze) and GTEA gaze+ (Gaze+).

Leave one subject out cross validation

Results

One neuron for object of interest localization is specialized on hand appearance

Finetuning with lsos weights for action and object of 0.2 and activity weigth of 1 boosts the performance also of the action and object networks

Clone this wiki locally