2001.04583.md

Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman

Get an understanding of the ego-centric space by leveraging the fact that it is a visual depiction of a persistent visual space
re-organize egocentric video into “visits” to known zones, rather than a series of unconnected clips

Demonstrate application of their model on two tasks:

localize commonly visited spaces in the video as 'zones'
- train a similarity network that takes pairs of frames as input and predicts similarity using a siamese network
- similar if
  - close in time
  - exist consistent homography of each-other, consistent on at least 10 keypoints (to get repeated backgrounds) ( SuperPoint keypoint descriptors to estimate homographies, euclidean distance between pretrained ResNet-152 features for visual similarity)
generate action localization graph
- visits: clips from the egocentric video at that location
- nodes: groups of frames with high similarity
- edges: connect temporally adjacent nodes
- uncertain frames are ignored
- aggregate nodes accross videos by using a action clasiffier + object detector, link nodes with functional similarity
Affordances
- action localization graph allows to predict new actions/object pairs at locations where they have never been observed (by transfering them from previously seen locations)
- as multi-class classification of possible interactions