-
Notifications
You must be signed in to change notification settings - Fork 7
1904.09882.md
Yana edited this page May 29, 2020
·
1 revision
You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions, ArXiv'19 {notes} {paper} {project page} {missing dataset?}
Evonne Ng, Donglai Xiang, Hanbyul Joo, Kristen Grauman
Leverage visual cues from interactions with other persons to infer pose of first-person camera equiped person.
Supervised learning of ego-poses from videos.
You2me
- 42 two-minute sequences from one-on-one interactions
- 10 different individuals
- chest mounted GoPro
- four classes of activities: hand games, tossing and catching, sports, and conversation
- in 50% of frames, first-person body parts are not visible
- 2 capture methods
- panoptic studio (for accuracy of 3d poses) allows to reconstruct gt poses for the wearer and interacting subjects
- kinect (for variability of bg)
Revisiting and improving Seeing invisible poses, CVPR'17)
Pose is user-centric and scale-normalized is same way.
-
Video features
- Same motion features as CVPR'17 are used (homography of scene which is related to camera rotation), note that camera rotation is relevant because chest-mounted, wouldn't be that informative for head-mounted ones.
- Use ResNet-152 as visual feature extractor
- open-pose 2nd person pose 2D keypoints, flattened into a vector, absence of 2nd person is encoded with 0s
-
LSTM for predicting temporal sequences
- outputs classes of pose clusters
- also uses previous pose as input
Show that 2nd person pose features are more important then static scene (resnet) features
Show that using GT poses for 2nd person is marginally better then using estimated ones