-
Notifications
You must be signed in to change notification settings - Fork 7
1711.07399
[arxiv 1711.07399] V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map [PDF] [code] [notes]
Gyeongsik Moon, Kyoung Mu Lee
read 02/07/2018
Realtime accurate 3D hand pose estimation from depth map.
Compare different input/output representations and show that voxel-to-voxel representation performs best.
Placed first in Hands 2017 challenge.
3D voxel as input and output to ease the network's job (no need for perspective distorsion learning)
- 2D depth h x w x 1
- voxels h x w x d, obtained by reprojecting the depth map from 2D to 3D + discretization of voxels (dim 88x88x88 or 48x48x48 when GPU memory limitations)
- coordinates nb_joints x 3
- 3D heatmaps (gaussians)
3D convnet `~hourglass
==> 2D depth map are worse (as input representations) than 3D voxelized grid
==> 3D coordinates are worse than per-voxel likelihood, and the main improvement comes from changing the output from regression to per-voxel likelihoods
-
get better crop centered on the hands
- take the simple thresholded depth map and produce with regression the offset between the original and ground truth reference point (~center of hand)
- 11 to 9 mm avg 3d distance error improvement
-
epoch ensembling :
- ensemble results from different epochs
- 9 to 8mm improvement
Said to be voxelization (23ms vs 5ms forward pass), which sounds like a lot given the simplicity of the task !