Skip to content

ieee 7298625

Yana edited this page Jul 19, 2017 · 2 revisions

CVPR 2015

[IEEE 7298625] Delving into egocentric actions [project page] [PdF] [notes]

Yin Li, Zhefan Ye, James M. Rehg

read 19/07/2017

Objective

Evaluate performance of using hand-crafted features for action recognition in First person view

Gives baseline on GTEA, GTEA Gaze and GTEA Gaze + of the performance of the various features (measured as action recognition accuracy) and their combinations

Synthesis

Traditionnal hand-crafted spatio-temporal features such as STIP perform poorly because of camera motion

  • Removing camera motion allows local features to perform well

  • but camera motion is an action cue

Features

Motion features

  • Trajectory features
  • Histogram of Flow
  • Motion boundary histogram (gradient of optical flow in horizontal and vertical directions)

Object features

  • Histogram of Oriented Gradien (HOG) wich encode 2D image boundaries
  • Local Binary Patterns (compares value of central pixel to several neighbors and encodes the difference with 1 or 0 for above or beyond, then histogram)
  • Histogram of LAB color (L for lightness a and b for color opponents green-red and blue-yellow)

Egocentric features

  • Hand feature : manipulation point (point where the person is most likely to be manipulating an image), obtained from hand segmentation

  • Head feature : corresponds to camera motion

  • Gaze direction : 2D image point on each frame

Feature Engineering

  • Removing camera motion : subtract camera motion from dense optical flow. This produces better motion features and selects trajectories on foreground regions that move differently from camera motion

  • Trajectory selection : use local descriptors in vicinity of manipulation and gaze point

Implementation

Extract set of local descriptors (HOG, LAB, LBP, ...) aggregated along trajectories

Trajectory is divided in 2x2x3 grids and histograms of features are concatenated within each grid

Encode descriptors using Improved Fischer Vector (encoded as mean and variance of gaussian mixture model)

Results

Using object + motion + egocentric + manipulation point trajectory cues produces the best results

Obtained accuracies

  • GTEA Gaze + : 60%
  • GTEA with 17 or 61 classes : 60%
  • GTEA Gaze with 25 classes : 60%, with 40 : 40%
Clone this wiki locally