1712.07262

ECCV 2018

[arxiv 1712.07262] FoldingNet: Point Cloud Auto-encoder via Deep Grid Deformation [PDF] [notes]

Yaoqing Yang, Chen Feng, Yiru Shen, Dong Tian

read 2018/09/19

Objective

Deform a regular grid into an object point cloud. Concurrent work from AtlasNet

Synthesis

Prepose a deep auto-encoder that goes from a 3D point cloud to a 3D point cloud which is a deformation of a 2D square grid
They show that using the encoding the obtained they can outperform other unsupervised encodings for classification purposes

Encoder

The input point set S is a nx3 vector of point coordinates that are obtained by randomly sampling the model's meshe's triangles
the produced codeword is of size 512
a graph is created where each point is associated to it's 16 closest neighboors
a local covariance matrix is constructed of size 3x3 and flattened to 1x9
the point's positions and the local covariances are concatenated as input to a 3-layer perceptron
two graph layers with max-pooling of the neighborhood of each node ensue

Decoder

they define the folding operation as the concatenation of the codeword to the grid points, followed by a MLP
they perform two folding operations sequentially, first on the original grid, and then on the points outputted by the first folding operator

Loss

A variant of the Chamfer distance is used to handle point clouds of different size between the reconstruction and the input point clouds that is the max of the two parts of the usual Chamfer distance to force them both to be simultaneaously small

Experiments

Data size ablation study

They show that the quality of the codeword (as evaluated on the classification task on model net) degrades gracefully with the size of ShapeNet's dataset used at training time, e.g. 85% classification accuracy is obtained with a linear SVM on codewords from 20% of the training data, vs 89% when the whole dataset is used

Comparison to other decoders

Fully connected network decoder

They show that a fully connected decoder achieves lower reconstruction error and lower classification accuracy (0.89 vs 0.84 % accuracy) on the generated codewords
the fully connected decoder is a 3-layer network that goes from 512 -->1024-->2048 features --> 2048x3 coordinate points

Deconvolution decoder

Starting from the grid, it is possible to perform successive deconvolution operations which progressively reduce the feature dimension (down to 3 final coordinates from 3 features) while increasing the spatial resolution, which allows to go from the codeword to a grid of 45x45 coordinates, producing a grid of coordinates that can be interpreted as a point cloud.

This implementation produces lower reconstruction accuracies but marginally better classification error

Removing graph layers

marginally decreases performance

Increasing number of foldings

does not significantly improves classification scores (from 88.25 --> 88.41% classification accuracy when going from 2 to 3 folding operations)

Dimension of sampling grid

same performance when using a cube as point inputs
lower performance (marginally when using input line (1D) instead of input surface 88.41 --> 86.71 % accuracy