Skip to content

2001.04583.md

Yana edited this page May 29, 2020 · 1 revision
EGO-TOPO: Environment Affordances from Egocentric Video, ArXiv'20 {project page} {paper} {notes}

Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, Kristen Grauman

Objective
  • Get an understanding of the ego-centric space by leveraging the fact that it is a visual depiction of a persistent visual space

  • re-organize egocentric video into “visits” to known zones, rather than a series of unconnected clips

Demonstrate application of their model on two tasks:

  • inferring likely object interactions in novel views, more robustly

  • long-term action anticipation, (SOTA)

Datasets
Method
  • localize commonly visited spaces in the video as 'zones'

    • train a similarity network that takes pairs of frames as input and predicts similarity using a siamese network
    • similar if
      • close in time
      • exist consistent homography of each-other, consistent on at least 10 keypoints (to get repeated backgrounds) ( SuperPoint keypoint descriptors to estimate homographies, euclidean distance between pretrained ResNet-152 features for visual similarity)
  • generate action localization graph

    • visits: clips from the egocentric video at that location
    • nodes: groups of frames with high similarity
    • edges: connect temporally adjacent nodes
    • uncertain frames are ignored
    • aggregate nodes accross videos by using a action clasiffier + object detector, link nodes with functional similarity
  • Affordances

    • action localization graph allows to predict new actions/object pairs at locations where they have never been observed (by transfering them from previously seen locations)
    • as multi-class classification of possible interactions
Experiments
Clone this wiki locally