-
Notifications
You must be signed in to change notification settings - Fork 7
1702.02738
[Arxiv 1702.02738] Joint Discovery of Object States and Manipulation Actions [project page] [PdF] [code (matlab)] [notes]
Jean-Baptiste Alayrac, Josef Sivic, Ivan Laptev, Simon Lacoste-Julien
read 2019/07/10
Propose to automatically discover object states and associated actions that modify their states. Introduce a dataset of real-life object manipulation
-
two distinct object states are tempo- rally separated by a manipulation action
-
cluster ++- groups object states with similar appearance and consistent temporal locations with respect to the action ++- finds similar manipulation actions separating those object states
-
only one object is manipulated at the time
-
get several videos of manipulation of the same object
-
a-priori model of the corresponding object in the form of pre-trained object detector
-
jointly cluster appearance of the object into two classes, while localizing a consistent action that causes the state transition
- localize the temporal extent of the action
- spatially/temporally localize the manipulated object and identify its states over time.
- rely on pre-trained object detectors to spatially localize manipulated objects
- temporally identify individual states
- each clip n is accompanied with at set of M_n of tracklets (less than 1 sec, hoping that each tracklet covers only one state of the object) of the object of interest
- The object states of video n are described by a matrix of size Y_n M_n x 2 containing ones and zeros where a 1 at position m, k indicates that the tracklet m represents the object in state k, potentially (Y_n){m,1} == 0 and (Y_n){m,2} == 0 (this tracklet represents an object that is neither in state 1 or 2 (this can model false detections of object)
- tracklets are captured using d_s (for dim_state)-dimensional features, X_s is a Mxd_s matrix that contains the features of all the tracklets
- W_s is the object classifier that is learnt on top of the features to classify the tracklet as in state 1, 2 or none
- Constraints
- Only one tracklet is in state 1 or 2 at a specific time (only one object manipulated)
- Both states are present, and state 1 should appear before state 2(ordering constraint)
- At least one tracklet is labeled as state 1 and one as state 2 (at least one)
Expressed as assignment matrix Z_n in {0, 1}^{T_n} for each clip n
(Z_n){t} = 1 means the action is present at time step t, whereas (Z_n){t} = 0 means the action is absent.
W_v is the action classifier.
- Constraints
- Exactly one time interval for an action per clip (action saliency constraint)
- Enforce that actions occurs between two object states
- Penalize that state 1 is detected after the action
- Penalize that state 2 is detected before the action
- Links the time of the tracklet and the time of the action (note that time does not intervene in the object detection loss so far)
$ d(Z_n, Y_n) = \sum_{y \in S_1(Y_n)} [t_y - t_{Z_n}]{-}+\sum{y \in S_2(Y_n)} [t_{Z_n} - t_y]_{+} $
Where
[x]_{+(/-)} is the positive(/negative) part
"The function penalizes the inconsistent assignment of objects states
The problem is NP hard in general. To relax it, instead of having values in {0, 1} they allow values in [0, 1].
- run object detector for each of the involved objects
- subdivide track into small tracklets
- fine-tune object detector for specific object classes (labeled ~500 images per class) (Fast-RCNN)
- track objects using generic object tracker
- Average over tracklet of object features from ROI pooling
- 8k dimensional
- motion features (2k dimensional) bag-of-word of HOF (Histogram of local optical flow)
- appearance vector 1K-dimensional bag-of-word vector of conv5 features from VGG16
- 3k dimensional featur vector for each chunk of 10 frames
- 630 annotated occurences of ground truth manipulation actions
- 5 distinct objects
- 7 actions
- label ground truth temporal extent of actions
- label ground truth object states in the 40 seconds before and after manipulation action
- annotate 19k tracklets, of which 35 are state 1, 40 are state 2, 25% ambiguous (half-full cup), 40% (!) false positives. Note that annotation is used for evaluation, not model training.
- use precision score averaged over all the videos
- temporal action localization is said correct if it falls within the ground truth time interval
- state prediction is correct if it matches the ground truth state
- use random features to avoid non-trivial analytic derivation for the ”Constraints only” problem
Most improvement from joint model comes from actions, while actions only mildly benefit object state detection
- P. Isola, J. J. Lim, and E. H. Adelson. Discovering states and transformations in image collections. In CVPR, 2015.
- A. Farhadi, I. E. Lim, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. 1
- X. Wang, A. Farhadi, and A. Gupta. Actions ˜ transforma- tions. In CVPR, 2016.
- F. Bach and Z. Harchaoui. DIFFRAC: A discriminative and flexible framework for clustering. In NIPS, 2007.
- A. Joulin, K. Tang, and L. Fei-Fei. Efficient image and video co-localization with Frank-Wolfe algorithm. In ECCV, 2014