1611.08050

CVPR 2017

[arxiv 1611.08050] Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields [PDF] [code] [notes]

Zhe Cao, Tomas Simon, Shih-En Wei, Yaser Sheikh

read 16/10/2017

Objective

Pose estimation of several different persons in a scene often relies on two steps : first detection and then pose estimation on centred crops (top-down approach from coarse detection to fine pose estimation).

Such methods have several shortcomings :

they cannot take advantage of the information of neighbouring persons to disambiguate joints in crowded scenes
computation time is linear in number of persons

Here the goal is to do real-time pose estimation in crowded scenes.

Synthesis

Keypoint heatmaps

As for Convolutional Pose Machine and stacked hourglasses, joint locations are estimated for each joint using heatmaps.

In this work, each keypoint is represented with a gaussian centered on its position on the heatmap.

In order to represent several joints (belonging to different persons) on one heatmap, the max operator is applied over the gaussians generated for each of the joints.

This generates one heatmap per body part (a part being for instance the right ankle or the neck).

Max is used instead of averaging in order to avoid smoothing between close-by gaussians, which would result in a loss of precision, and could merge close joints.

multiple gaussians averaged or maxed

A non-maximum suppression is performed on the confidence maps to obtain candidate part locations.

Parts affinity fields

Parts affinity fields (PAFs) are used to encode both the location and the orientation of a limb

PAFs are 2D vector fields for each limb. For a point belonging to the limb joining the two associated body parts, the affinity field value is a unit vector pointing along the orientation of the limb .

affinity fields

A point is said to belong to a limb if it is within some distance of the segment that joins its two extreme keypoints.

At test time, the strength of the association between two keypoints is computed by the line integral over the limb along the segment connecting the candidate joint locations.

In practice, the integral is performed by summing the vectors along uniformly-spaced points on the segment.

Training

A neural network is used to predict both the heatmaps and the PAFs using a n iterative scheme.

two separate branches first predict heatmpas and PAFS from the input image.
the outputs and input are then stacked (heatmaps, PAFs and rgb) to the two next branches that predict heatmaps and PAFs again separately.

paf and heatmap network

Intermediate supervision is applied.

Greedy parsing

Each limb is scored according to the line integral.

A graph is then constructed linking the nodes with edges weighted according to those scores.

Finding the optimal parse on the k-partite graph (that can be separated in k independant sets, independant meaning that two nodes in the same set have no connexions) is a NP-hard problem. In this case, k is the number of joints (for a single person).

In order to speed-up computations, we restrict the links in the parts graph to connected joints.

k-partite graphs

The goal is to find a matching with maximum weight for the chosen edges.

A matching is a set of edges without common nodes. (No two edge share a common node)

By focusing only on connected nodes we can reframe the problem as a bi-partite matching problem where the goal is to associate pairs between sets of connected limbs (for instance all the right elbows and all the right shoulders in the image)

The intuition behind why local matching is enough for global performance is that the CNNs used for PAFs have large receptive fields, and therefore PAFs from non-adjacent parts influence predicted PAFs. Therefore, the pairwise PAFs carry a part of global information.

Pairs of limbs are then assembled to generate full-body poses !

Results

Important boost of performance compared to single-person prediction especially on crowded scenes.

State of the art on COCO 2016 and multiperson MPII at a fraction of the competing algorithms' speed (6 order of magnitude less ! : 0.005 vs 10 per second for fastest competing method)

Number of stages

In the network, using two stages instead of one yields significant improvements while additional steps produce marginal (almost negligible) gain.

Ablation studies

Training with masks of unlabeled persons improves PCKh-0.5 performance by 2.3

When using ground truth keypoints instead of heatmap maxes, obtain a mAP of 88.3 (which is a couple percent better then with estimated keypoint locations)