1812.02772.md

Object Discovery in Videos as Foreground Motion Clustering, CVPR19 {paper} {poster}

Christopher Xie, Yu Xiang, Zaid Harchaoui, Dieter Fox

Objective

Formulate the object discovery problem as foreground motion clustering

Introduce an encoder-decoder to learn feature embeddings of pixels in videos that combine appearance and motion cues.

Learn pixel trajectory embeddings using RNN

Use foreground masks as attention for pixel trajectory clustering

Method

Y-net

Take RGB as one input, flow as another, and encode them, then decode them with residual connections to produce features: pixel embeddings (spatial dense vectors that encode appearance and motion).

Foreground segmentation

Additional conv + logit on top of pixel embeddings

Trajectory embedding

trajectory embedding: weighted sum of the pixel embeddings along the trajectory

Compute pixel trajectories using forward/backward optical flow consistency for pixels classified as foreground

Not super clear to me what the RNN is doing, as far as I understand, output is also dense I think

Pixel trajectory

Losses

The setting is fully-supervised on the foreground segmentation task (binary cross-entropy loss)

For trajectory embeddings, use cosine distances between trajectory embeddings, encouraging embeddings to be close among spatio-temporal spots that belong to the same object (encouraging embeddings belonging to the same object to be close to their spherical mean), and encouragind embeddings from different objects to have angles greater then some margin.

Inference

Use clustering on trajectory embeddings to produce spatio-temporal instance-specific masks. Use the von Mises-Fisher mean shift algorigthm, which produces both the number of clusters and clusters themselves.

Experiments

Y-net architecture

Performs better then early/late fusion (concatenating RGB and flow, and using 2 separate U-nets and then a final conv layer) Improvement looks significant (+5-7% absolute IoU on 2 datasets, no improvement on a third)