-
Notifications
You must be signed in to change notification settings - Fork 7
1812.02772.md
Christopher Xie, Yu Xiang, Zaid Harchaoui, Dieter Fox
Formulate the object discovery problem as foreground motion clustering
Introduce an encoder-decoder to learn feature embeddings of pixels in videos that combine appearance and motion cues.
Learn pixel trajectory embeddings using RNN
Use foreground masks as attention for pixel trajectory clustering
Take RGB as one input, flow as another, and encode them, then decode them with residual connections to produce features: pixel embeddings (spatial dense vectors that encode appearance and motion).
Additional conv + logit on top of pixel embeddings
trajectory embedding: weighted sum of the pixel embeddings along the trajectory
Compute pixel trajectories using forward/backward optical flow consistency for pixels classified as foreground
Not super clear to me what the RNN is doing, as far as I understand, output is also dense I think
The setting is fully-supervised on the foreground segmentation task (binary cross-entropy loss)
For trajectory embeddings, use cosine distances between trajectory embeddings, encouraging embeddings to be close among spatio-temporal spots that belong to the same object (encouraging embeddings belonging to the same object to be close to their spherical mean), and encouragind embeddings from different objects to have angles greater then some margin.
Use clustering on trajectory embeddings to produce spatio-temporal instance-specific masks. Use the von Mises-Fisher mean shift algorigthm, which produces both the number of clusters and clusters themselves.
Performs better then early/late fusion (concatenating RGB and flow, and using 2 separate U-nets and then a final conv layer) Improvement looks significant (+5-7% absolute IoU on 2 datasets, no improvement on a third)