1705.01389

Arxiv 2017

[arxiv 1705.01389] Learning to Estimate 3D Hand Pose from Single RGB Images [PDF] [synthetic dataset] [notes]

Christian Zimmermann, Thomas Brox

read 04/05/2017

Synthesis

3 Networks are used sequentially :

hand localization through segmentation
21 2D keypoint localization in hand
deduction of 3D hand pose from 2D keypoints

Synthesized dataset (freely available humans mixamo + Blender) : 41258 training 2728 testing images in resolution 320x320 21 keypoints 33 segmentation masks

Interesting analysys of prediction for the 2d to 3d network with more or less data keypoints that shows what the network predicts given more or less data

Datasets

No existing dataset for 3d hand poses with enough variability ==> synthetic one created for this article

NYU not good dataset as only registered images provided

Evaluates on Dexter

Ideas

This paper separates the viewpoint and the estimate of the keypoint positions in the canonical base, this implies that viewpoint is not used to estimate the coordinates.

Which one is more robust ? If viewpoint, use this knowledgee to estimate coordinates ?

But both are minimized jointly (as an unweighted sum)

Also, synthesizes not manipulation actions, therefore few occlusion examples, and therefore almost always fails in this case