CVPR19 1273

CVPR 2019

[CVPR19 1273] Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization [PdF] [Sup Mat] [code] [notes]

Daochang Liu, Tingting Jiang, Yizhou Wang

read 2019/07/12

Objective

Synthesis

Focus on temporal action localization in untrimmed videos, separate action part attention and category attention to perform weakly-supervised action part localization.

Definitions

action: temporal composition of elementary sub-actions
action-context separation: differentiate between action an background video class that still contains semantically close context
context clips: co-occur with action clips but do not contain the action (video of a billiard while nobody is playing)
background clips: class-independent clips not containing a specific action or its context (video of a landscape)

Model

Classification

Train time

Predict a-class_nb+1(background)-dimensional one hot encoding vector y.

Test time

Produces a start and end time, class, and confidence score for the given class.

Context-videos (hard-negative) mining

compute tv-l1 optical flow
get the average optical flow for each frame
a percentage (25%) of frames with the lowest optical flow intensity are picked out of each video
these frames with low optical flow are concatenated into pseudo videos, labelled as background class (yana: risk of sharp transitions in these pseudo-videos ? if frames are sampled before and after the main action, e.g. with a gap in the middle)

Mutli-branch network

Feature extraction

extract video feature X \in R^{TxD} where T is a number of snippets and D the snippet feature dimension (I3D or UntrimmedNet embeddings).
temporal convolutional layer + relu is used to embed these features so as to make them specialized for action localization (e.g. one layer non-linear embedding)
multi-branch classification
- K classification branches in parallel
- takes embedded feature sequence as input and performes temporal convolutions to outputs a sequence of classification scores (output in R^{TxC+1}) + softmax on top along the category dimension resulting in a class activation sequence(CAS) (class distribution at each time location)
diversity loss: sum of pairwise cosine scores between the CASes This encourages branches to produce activations in different action parts.
final score is obtained by averaging the CASses from multiple branches and passed through a softmax along the category dimension
additionally, there appears to be a need to normalize the norms of the CASes to be close to the average norm, so that one branch doesn't dominate the others significantly. (otherwise they observe explosion/collapse)
this formulation allows the network to discover diverse action parts without full supervision

Temporal attention

Feeds the feature sequence X into a temporal convolutional layer followed by a softmax along the temporal dimension. Video-level predictions are obtained using a softmax along the category dimension on the temporal-attention-weighted summation of the CASes.

Classification is performed using standard cross-entropy loss
training is performed using a weighted sum of the losses
thresholding on the CAS scores is performed to obtain the candidate action segments

Experiments

Using at least two classificaton heads is important, more doesn't bring visible improvements
Diversity loss is important, brings significant improvement
mutliple heads brings more improvement than hard negative mining
temporal attention and diversity loss bring marginal improvmeents both