1711.07399

CVPR 2018

[arxiv 1711.07399] V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map [PDF] [code] [notes]

Gyeongsik Moon, Kyoung Mu Lee

read 02/07/2018

Objective

Realtime accurate 3D hand pose estimation from depth map.

Compare different input/output representations and show that voxel-to-voxel representation performs best.

Placed first in Hands 2017 challenge.

Synthesis

3D voxel as input and output to ease the network's job (no need for perspective distorsion learning)

Details

Various inputs:

2D depth h x w x 1
voxels h x w x d, obtained by reprojecting the depth map from 2D to 3D + discretization of voxels (dim 88x88x88 or 48x48x48 when GPU memory limitations)

Various outputs:

coordinates nb_joints x 3
3D heatmaps (gaussians)

Structure

3D convnet `~hourglass

Experiments

Input/output comparison

==> 2D depth map are worse (as input representations) than 3D voxelized grid

==> 3D coordinates are worse than per-voxel likelihood, and the main improvement comes from changing the output from regression to per-voxel likelihoods

Tricks

get better crop centered on the hands
- take the simple thresholded depth map and produce with regression the offset between the original and ground truth reference point (~center of hand)
- 11 to 9 mm avg 3d distance error improvement
epoch ensembling :
- ensemble results from different epochs
- 9 to 8mm improvement