1904.09882.md

Evonne Ng, Donglai Xiang, Hanbyul Joo, Kristen Grauman

Leverage visual cues from interactions with other persons to infer pose of first-person camera equiped person.

Supervised learning of ego-poses from videos.

You2me

42 two-minute sequences from one-on-one interactions
10 different individuals
chest mounted GoPro
four classes of activities: hand games, tossing and catching, sports, and conversation
in 50% of frames, first-person body parts are not visible
2 capture methods
- panoptic studio (for accuracy of 3d poses) allows to reconstruct gt poses for the wearer and interacting subjects
- kinect (for variability of bg)

Pose is user-centric and scale-normalized is same way.

Video features
- Same motion features as CVPR'17 are used (homography of scene which is related to camera rotation), note that camera rotation is relevant because chest-mounted, wouldn't be that informative for head-mounted ones.
- Use ResNet-152 as visual feature extractor
- open-pose 2nd person pose 2D keypoints, flattened into a vector, absence of 2nd person is encoded with 0s
LSTM for predicting temporal sequences
- outputs classes of pose clusters
- also uses previous pose as input