-
Notifications
You must be signed in to change notification settings - Fork 7
2001.04583.md
Yana edited this page May 29, 2020
·
1 revision
EGO-TOPO: Environment Affordances from Egocentric Video, ArXiv'20 {project page} {paper} {notes}
Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman
-
Get an understanding of the ego-centric space by leveraging the fact that it is a visual depiction of a persistent visual space
-
re-organize egocentric video into “visits” to known zones, rather than a series of unconnected clips
Demonstrate application of their model on two tasks:
-
inferring likely object interactions in novel views, more robustly
-
long-term action anticipation, (SOTA)
-
localize commonly visited spaces in the video as 'zones'
- train a similarity network that takes pairs of frames as input and predicts similarity using a siamese network
- similar if
- close in time
- exist consistent homography of each-other, consistent on at least 10 keypoints (to get repeated backgrounds) ( SuperPoint keypoint descriptors to estimate homographies, euclidean distance between pretrained ResNet-152 features for visual similarity)
-
generate action localization graph
- visits: clips from the egocentric video at that location
- nodes: groups of frames with high similarity
- edges: connect temporally adjacent nodes
- uncertain frames are ignored
- aggregate nodes accross videos by using a action clasiffier + object detector, link nodes with functional similarity
-
Affordances
- action localization graph allows to predict new actions/object pairs at locations where they have never been observed (by transfering them from previously seen locations)
- as multi-class classification of possible interactions