-
Notifications
You must be signed in to change notification settings - Fork 7
CVPR19 1273
[CVPR19 1273] Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization [PdF] [Sup Mat] [code] [notes]
Daochang Liu, Tingting Jiang, Yizhou Wang
read 2019/07/12
Focus on temporal action localization in untrimmed videos, separate action part attention and category attention to perform weakly-supervised action part localization.
- action: temporal composition of elementary sub-actions
- action-context separation: differentiate between action an background video class that still contains semantically close context
- context clips: co-occur with action clips but do not contain the action (video of a billiard while nobody is playing)
- background clips: class-independent clips not containing a specific action or its context (video of a landscape)
Predict a-class_nb+1(background)-dimensional one hot encoding vector y.
Produces a start and end time, class, and confidence score for the given class.
- compute tv-l1 optical flow
- get the average optical flow for each frame
- a percentage (25%) of frames with the lowest optical flow intensity are picked out of each video
- these frames with low optical flow are concatenated into pseudo videos, labelled as background class (yana: risk of sharp transitions in these pseudo-videos ? if frames are sampled before and after the main action, e.g. with a gap in the middle)
- extract video feature X \in R^{TxD} where T is a number of snippets and D the snippet feature dimension (I3D or UntrimmedNet embeddings).
- temporal convolutional layer + relu is used to embed these features so as to make them specialized for action localization (e.g. one layer non-linear embedding)
- multi-branch classification
- K classification branches in parallel
- takes embedded feature sequence as input and performes temporal convolutions to outputs a sequence of classification scores (output in R^{TxC+1}) + softmax on top along the category dimension resulting in a class activation sequence(CAS) (class distribution at each time location)
- diversity loss: sum of pairwise cosine scores between the CASes This encourages branches to produce activations in different action parts.
- final score is obtained by averaging the CASses from multiple branches and passed through a softmax along the category dimension
- additionally, there appears to be a need to normalize the norms of the CASes to be close to the average norm, so that one branch doesn't dominate the others significantly. (otherwise they observe explosion/collapse)
- this formulation allows the network to discover diverse action parts without full supervision
Feeds the feature sequence X into a temporal convolutional layer followed by a softmax along the temporal dimension. Video-level predictions are obtained using a softmax along the category dimension on the temporal-attention-weighted summation of the CASes.
- Classification is performed using standard cross-entropy loss
- training is performed using a weighted sum of the losses
- thresholding on the CAS scores is performed to obtain the candidate action segments
- Using at least two classificaton heads is important, more doesn't bring visible improvements
- Diversity loss is important, brings significant improvement
- mutliple heads brings more improvement than hard negative mining
- temporal attention and diversity loss bring marginal improvmeents both
Zhao et al., Temporal action detection with structured segment networks, ICCV 2017
Hou et al., Real-time temporal action localization in untrimmed videos by sub- action discovery. BMVC 2017.
Yuan et al. Temporal action localization by structured maximal sums, CVPR 2017
structure action clips into beginning middle and end to model temporal evolution
Shou et al. AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. ECCV 2018
Selction module in UntrimmedNet Wang et al. Untrimmednets for weakly supervised action recognition and detection. CVPR 2017