Skip to content

CVPR19 1273

Yana edited this page Jul 12, 2019 · 3 revisions

CVPR 2019

[CVPR19 1273] Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization [PdF] [Sup Mat] [code] [notes]

Daochang Liu, Tingting Jiang, Yizhou Wang

read 2019/07/12

Objective

Synthesis

Focus on temporal action localization in untrimmed videos, separate action part attention and category attention to perform weakly-supervised action part localization.

Definitions

  • action: temporal composition of elementary sub-actions
  • action-context separation: differentiate between action an background video class that still contains semantically close context
  • context clips: co-occur with action clips but do not contain the action (video of a billiard while nobody is playing)
  • background clips: class-independent clips not containing a specific action or its context (video of a landscape)

Model

Classification

Train time

Predict a-class_nb+1(background)-dimensional one hot encoding vector y.

Test time

Produces a start and end time, class, and confidence score for the given class.

Context-videos (hard-negative) mining

  • compute tv-l1 optical flow
  • get the average optical flow for each frame
  • a percentage (25%) of frames with the lowest optical flow intensity are picked out of each video
  • these frames with low optical flow are concatenated into pseudo videos, labelled as background class (yana: risk of sharp transitions in these pseudo-videos ? if frames are sampled before and after the main action, e.g. with a gap in the middle)

Mutli-branch network

Feature extraction

  • extract video feature X \in R^{TxD} where T is a number of snippets and D the snippet feature dimension (I3D or UntrimmedNet embeddings).
  • temporal convolutional layer + relu is used to embed these features so as to make them specialized for action localization (e.g. one layer non-linear embedding)
  • multi-branch classification
    • K classification branches in parallel
    • takes embedded feature sequence as input and performes temporal convolutions to outputs a sequence of classification scores (output in R^{TxC+1}) + softmax on top along the category dimension resulting in a class activation sequence(CAS) (class distribution at each time location)
  • diversity loss: sum of pairwise cosine scores between the CASes This encourages branches to produce activations in different action parts.
  • final score is obtained by averaging the CASses from multiple branches and passed through a softmax along the category dimension
  • additionally, there appears to be a need to normalize the norms of the CASes to be close to the average norm, so that one branch doesn't dominate the others significantly. (otherwise they observe explosion/collapse)
  • this formulation allows the network to discover diverse action parts without full supervision

Temporal attention

Feeds the feature sequence X into a temporal convolutional layer followed by a softmax along the temporal dimension. Video-level predictions are obtained using a softmax along the category dimension on the temporal-attention-weighted summation of the CASes.

  • Classification is performed using standard cross-entropy loss
  • training is performed using a weighted sum of the losses
  • thresholding on the CAS scores is performed to obtain the candidate action segments

Experiments

  • Using at least two classificaton heads is important, more doesn't bring visible improvements
  • Diversity loss is important, brings significant improvement
  • mutliple heads brings more improvement than hard negative mining
  • temporal attention and diversity loss bring marginal improvmeents both

To read

Citing

Zhao et al., Temporal action detection with structured segment networks, ICCV 2017

Hou et al., Real-time temporal action localization in untrimmed videos by sub- action discovery. BMVC 2017.

Yuan et al. Temporal action localization by structured maximal sums, CVPR 2017

structure action clips into beginning middle and end to model temporal evolution

Shou et al. AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. ECCV 2018

Selction module in UntrimmedNet Wang et al. Untrimmednets for weakly supervised action recognition and detection. CVPR 2017

Clone this wiki locally