Skip to content

1702.02738

Yana edited this page Jul 10, 2019 · 2 revisions

ICCV 2017

[Arxiv 1702.02738] Joint Discovery of Object States and Manipulation Actions [project page] [PdF] [code (matlab)] [notes]

Jean-Baptiste Alayrac, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien

read 2019/07/10

Objective

Propose to automatically discover object states and associated actions that modify their states. Introduce a dataset of real-life object manipulation

Synthesis

Model

Hypothesis

  • two distinct object states are tempo- rally separated by a manipulation action

  • cluster ++- groups object states with similar appearance and consistent temporal locations with respect to the action ++- finds similar manipulation actions separating those object states

  • only one object is manipulated at the time

  • get several videos of manipulation of the same object

  • a-priori model of the corresponding object in the form of pre-trained object detector

  • jointly cluster appearance of the object into two classes, while localizing a consistent action that causes the state transition

Objective

  • localize the temporal extent of the action
  • spatially/temporally localize the manipulated object and identify its states over time.

Discovering object states

  • rely on pre-trained object detectors to spatially localize manipulated objects
  • temporally identify individual states
  • each clip n is accompanied with at set of M_n of tracklets (less than 1 sec, hoping that each tracklet covers only one state of the object) of the object of interest
  • The object states of video n are described by a matrix of size Y_n M_n x 2 containing ones and zeros where a 1 at position m, k indicates that the tracklet m represents the object in state k, potentially (Y_n){m,1} == 0 and (Y_n){m,2} == 0 (this tracklet represents an object that is neither in state 1 or 2 (this can model false detections of object)
  • tracklets are captured using d_s (for dim_state)-dimensional features, X_s is a Mxd_s matrix that contains the features of all the tracklets
  • W_s is the object classifier that is learnt on top of the features to classify the tracklet as in state 1, 2 or none
  • Constraints
    • Only one tracklet is in state 1 or 2 at a specific time (only one object manipulated)
    • Both states are present, and state 1 should appear before state 2(ordering constraint)
    • At least one tracklet is labeled as state 1 and one as state 2 (at least one)

Action localization

Expressed as assignment matrix Z_n in {0, 1}^{T_n} for each clip n

(Z_n){t} = 1 means the action is present at time step t, whereas (Z_n){t} = 0 means the action is absent.

W_v is the action classifier.

  • Constraints
    • Exactly one time interval for an action per clip (action saliency constraint)

Joint constraint

  • Enforce that actions occurs between two object states
    • Penalize that state 1 is detected after the action
    • Penalize that state 2 is detected before the action
  • Links the time of the tracklet and the time of the action (note that time does not intervene in the object detection loss so far)

$ d(Z_n, Y_n) = \sum_{y \in S_1(Y_n)} [t_y - t_{Z_n}]{-}+\sum{y \in S_2(Y_n)} [t_{Z_n} - t_y]_{+} $

Where $t_{Z_n}$ is the time when action $Z_n$ occurs, and $t_y$ is the time tracklet y occurs.

$S_1(/2)(Y_n)$ are the tracklets in the n-th clip assigned to state 1(/2)

[x]_{+(/-)} is the positive(/negative) part

"The function penalizes the inconsistent assignment of objects states $Y_n$ by the amount of time that separates the incorrectly assigned tracklet and the manipulation action in the clip"

Optimization

The problem is NP hard in general. To relax it, instead of having values in {0, 1} they allow values in [0, 1].

Experiments

Getting tracklets

  • run object detector for each of the involved objects
  • subdivide track into small tracklets
  • fine-tune object detector for specific object classes (labeled ~500 images per class) (Fast-RCNN)
  • track objects using generic object tracker

Features

Object features

  • Average over tracklet of object features from ROI pooling
  • 8k dimensional

Action features

  • motion features (2k dimensional) bag-of-word of HOF (Histogram of local optical flow)
  • appearance vector 1K-dimensional bag-of-word vector of conv5 features from VGG16
  • 3k dimensional featur vector for each chunk of 10 frames

Dataset

  • 630 annotated occurences of ground truth manipulation actions
  • 5 distinct objects
  • 7 actions
  • label ground truth temporal extent of actions
  • label ground truth object states in the 40 seconds before and after manipulation action
  • annotate 19k tracklets, of which 35 are state 1, 40 are state 2, 25% ambiguous (half-full cup), 40% (!) false positives. Note that annotation is used for evaluation, not model training.

Evaluation

  • use precision score averaged over all the videos
  • temporal action localization is said correct if it falls within the ground truth time interval
  • state prediction is correct if it matches the ground truth state

Baselines

  • use random features to avoid non-trivial analytic derivation for the ”Constraints only” problem

Results

Most improvement from joint model comes from actions, while actions only mildly benefit object state detection

Further readings

Cited

  • P. Isola, J. J. Lim, and E. H. Adelson. Discovering states and transformations in image collections. In CVPR, 2015.
  • A. Farhadi, I. E. Lim, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. 1
  • X. Wang, A. Farhadi, and A. Gupta. Actions ˜ transforma- tions. In CVPR, 2016.
  • F. Bach and Z. Harchaoui. DIFFRAC: A discriminative and flexible framework for clustering. In NIPS, 2007.
  • A. Joulin, K. Tang, and L. Fei-Fei. Efficient image and video co-localization with Frank-Wolfe algorithm. In ECCV, 2014
Clone this wiki locally