1702.02738

ICCV 2017

[Arxiv 1702.02738] Joint Discovery of Object States and Manipulation Actions [project page] [PdF] [code (matlab)] [notes]

Jean-Baptiste Alayrac, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien

read 2019/07/10

Objective

Propose to automatically discover object states and associated actions that modify their states. Introduce a dataset of real-life object manipulation

Synthesis

Model

Hypothesis

two distinct object states are tempo- rally separated by a manipulation action
cluster ++- groups object states with similar appearance and consistent temporal locations with respect to the action ++- finds similar manipulation actions separating those object states
only one object is manipulated at the time
get several videos of manipulation of the same object
a-priori model of the corresponding object in the form of pre-trained object detector
jointly cluster appearance of the object into two classes, while localizing a consistent action that causes the state transition

Objective

localize the temporal extent of the action
spatially/temporally localize the manipulated object and identify its states over time.

Discovering object states

rely on pre-trained object detectors to spatially localize manipulated objects
temporally identify individual states
each clip n is accompanied with at set of M_n of tracklets (less than 1 sec, hoping that each tracklet covers only one state of the object) of the object of interest
The object states of video n are described by a matrix of size Y_n M_n x 2 containing ones and zeros where a 1 at position m, k indicates that the tracklet m represents the object in state k, potentially (Y_n){m,1} == 0 and (Y_n){m,2} == 0 (this tracklet represents an object that is neither in state 1 or 2 (this can model false detections of object)
tracklets are captured using d_s (for dim_state)-dimensional features, X_s is a Mxd_s matrix that contains the features of all the tracklets
W_s is the object classifier that is learnt on top of the features to classify the tracklet as in state 1, 2 or none
Constraints
- Only one tracklet is in state 1 or 2 at a specific time (only one object manipulated)
- Both states are present, and state 1 should appear before state 2(ordering constraint)
- At least one tracklet is labeled as state 1 and one as state 2 (at least one)

Action localization

Expressed as assignment matrix Z_n in {0, 1}^{T_n} for each clip n

(Z_n){t} = 1 means the action is present at time step t, whereas (Z_n){t} = 0 means the action is absent.

W_v is the action classifier.

Constraints
- Exactly one time interval for an action per clip (action saliency constraint)

Joint constraint

Enforce that actions occurs between two object states
- Penalize that state 1 is detected after the action
- Penalize that state 2 is detected before the action
Links the time of the tracklet and the time of the action (note that time does not intervene in the object detection loss so far)

$ d(Z_n, Y_n) = \sum_{y \in S_1(Y_n)} [t_y - t_{Z_n}]{-}+\sum{y \in S_2(Y_n)} [t_{Z_n} - t_y]_{+} $

Where $t_{Z_n}$ is the time when action $Z_n$ occurs, and $t_y$ is the time tracklet y occurs.

$S_1(/2)(Y_n)$ are the tracklets in the n-th clip assigned to state 1(/2)

[x]_{+(/-)} is the positive(/negative) part

"The function penalizes the inconsistent assignment of objects states $Y_n$ by the amount of time that separates the incorrectly assigned tracklet and the manipulation action in the clip"

Optimization

The problem is NP hard in general. To relax it, instead of having values in {0, 1} they allow values in [0, 1].

Experiments

Getting tracklets

run object detector for each of the involved objects
subdivide track into small tracklets
fine-tune object detector for specific object classes (labeled ~500 images per class) (Fast-RCNN)
track objects using generic object tracker

Features

Object features

Average over tracklet of object features from ROI pooling
8k dimensional

Action features

motion features (2k dimensional) bag-of-word of HOF (Histogram of local optical flow)
appearance vector 1K-dimensional bag-of-word vector of conv5 features from VGG16
3k dimensional featur vector for each chunk of 10 frames

Dataset

630 annotated occurences of ground truth manipulation actions
5 distinct objects
7 actions
label ground truth temporal extent of actions
label ground truth object states in the 40 seconds before and after manipulation action
annotate 19k tracklets, of which 35 are state 1, 40 are state 2, 25% ambiguous (half-full cup), 40% (!) false positives. Note that annotation is used for evaluation, not model training.

Evaluation

use precision score averaged over all the videos
temporal action localization is said correct if it falls within the ground truth time interval
state prediction is correct if it matches the ground truth state

Baselines

use random features to avoid non-trivial analytic derivation for the ”Constraints only” problem

Results

Most improvement from joint model comes from actions, while actions only mildly benefit object state detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1702.02738

ICCV 2017

[Arxiv 1702.02738] Joint Discovery of Object States and Manipulation Actions [project page] [PdF] [code (matlab)] [notes]

Objective

Synthesis

Model

Hypothesis

Objective

Discovering object states

Action localization

Joint constraint

Optimization

Experiments

Getting tracklets

Features

Object features

Action features

Dataset

Evaluation

Baselines

Results

Further readings

Cited

Clone this wiki locally