1610.04889

ECCV 2016

[arxiv 1610.04889] Real-time Joint Tracking of a Hand Manipulating an Object from RGB-D Input [PDF] [project page] [dataset]

Srinath Sridhar, Franziska Mueller, Michael Zollhöfer, Dan Casas, Antti Oulasvirta, Christian Theobalt

Objective

Real time hand-object tracking using RGBD camera

Synthesis

Uses 3D articulated Gaussian mixture alignment strategy

Enforce contact point between hand and object using regularizer. Used in order to take advantage of the physics of grasps

Use multilayer random forest on hand part classifier to guide optimization : segment hand and object and classify hand parts

Pipeline

Preprocessing

classify depth pixels as hand or object and hand into hand parts based on a two-layer random forest that takes occlusion into account
- training of forest (three trees trained on random distinct subsets) done on synthetic images created from a 3d synthetic model that is fit to a real image + generate sample object positions between thumb and other finger
- viewpoint selection : 4 views (front, back, thumb, little finger), selection of view based on best match for previous frame estimation
- first layer classifies hand and arm pixels
- second layer uses hand pixels to further classify in hand parts (6 classes: fingers and palm)
- input colors and depth frames
- hand-object segmentation to remove object from depth-map based on RGB cues
- ouput : probability histogram that encodes class likelyhood (object class: 1)

Pose estimation

Initialization

Parametrization of articulated motion of uman hand : 26 DOF (20 angles and 6 DOF transformation / root joint)

input depth and scene (hand + object) are expressed as 3D Gaussian Mixture Models (GMM)

Each gaussian rigged to bone of the hand, manually 30 gaussians are attached to kinematic chain to model volumetric extent (std roughly distance to surface)

Object is fitted by predefined number of Gaussians

Add visibility factor \in [0, 1] (0 : totally occluded, 1 : fully visible), computed using an occlusion map

GMM restricted to visible surface based on solution of the previous frame

Initialization of gaussians : quadtree segmentation of depth data looking at depth variance, each leaf represents a Gaussian with $\mu_i$ the 3D center of gravity of the quad and $\sigma_i^2 = (\frac{a}{2})$ where a is the backprojected side length of the quad.

Optimization

Minimize two energies, one that leverages depth observations and the other one the hand part classifications => 2 proposals

Optimized using gradient descent, initialize at previous frame.

Pose is selected between the two propositions (min for each of the energy terms) by choosing the one that achieves the lowest energy value of the weighted sum of the two energy terms

All components of the energies are detailed in the article. (starts top of page 9)

Enforces anatomical constraints, speed consistency

Enforces (top page 10) contact point objective, specific for hand-object tracking scenario. Touch constraint : fingertip closer to object then sum of their stds

Enforces occlusion handling by imposing that occluded parts move as the rest of the hand

Dataset

Dexter+Object

Create dataset with fingertip positions and object pose (cuboid)

3k frames Manually annotated

Results

Having 2 proposals (from 2 separate energy terms) allows for better recovery from errors.

Datasets

Evaluation on :

IJCV
Tzionas
Dexter

Definitions

Gaussian Mixture Alignment : problem of finding the transformation that best aligns one GAussian mixture with another, generalization of ICP that takes into account spatial proximity between Gaussians