science autoencoder

Science 2006

science-autoencoder Reducing the Dimensionality of Data with Neural Networks [PDF] [notes]

Geoffrey Hinton, Ruslan Salakhutdinov

Objective

Introduce autoencoder networks, present a pre-training and finetuning method

The initialization takes advantage of several stacked Restricted Boltzamann Machines that are easy to train in a stochastic manner.

Synthesis

Motivation for autoencoders

Dimensionality reduction facilitates several tasks (classification, visualization of data, storage, ...), as dimensionality reduction aims at capturing the structure of the data.

A common algorithm for dimensionality reduction is PCA. Unlike PCA, autoencoder networks are not linear.

The autoencoder can be decomposed into two networks:

the encoder that takes a high-dimensional input and that outputs a low-dimensional vector
the decoder that takes the produced low-dimensional vector and goes back to a high-dimensional representation, trying to reconstruct the initial

The networks are trained so that the distances between the high-dimensional inputs and outputs are minimized using gradient descent (which takes advantage of the chain rule)

Weight initialization

When there are multiple hidden layers, initialization of the weights is crucial

to high values of the weights leads to poor local minima
small weights lead to small gradients in the early layers, and therefore training is slow

Restricted Boltzmann Machines (RBM)

An idea is to build upon RBMs to produce efficient auto-encoders with good initialization.

Given the input, the hidden state is updated so that each hidden state activation j is set to 1 or 0 probabilistically by taking into account the value sigma(b_j + sum(v_i w_{ij})) where b_j is the bias associated to that hidden state, w_ij are the weights that connect the v_i input states (the pixel value of the input at location i) with the hidden state, sigma is the logistic function 1/(1 + exp(-x))

Each hidden state is set to 1 or 0 stochastically according to the value of this function

The input states are then updated in turn where each v_i is set to 1 with proba sigma(b_i + sum_j{h_j w_ij}) where b_i is the bias of i.

The hidden state is then updated once more.

The weight w_ij is then updated by taking into account a learning rate and <v_i h_j>{data} - <v_i h_j>{reconstruct}

<v_i h_j> is the fraction of times that the pixel i and the hidden state j are on together.

This learning rule works well in practice and allows to train an RBM, which can be seen as a single layer connecting inputs and outputs (the hidden states).

We can then iteratively train a new RBM by treating the hidden states of the trained RBM as visible states of this new RBM. This new RBM would then be trained with new hidden states.

This training method therefore iteratively RBMs for which the previous output is the input of the next RBM. This produces a stack of RBMs which can be used to pre-train the layers of an auto-encoder.

After pretraining, the RBM is unfolded to give the encoder and decoder networks. (encoder by stacking the RBMs and decoder by doing the same but in the reversed order)

The encoder and decoder therefore share the same weights after the pre-training stage.

After unfolding, to prepare for the final finetuning stage, stochastic activities are replaced by deterministic, real-valued activations. Back-prop is used through the whole network (so the weights although initialized to the same values will not be the same after fine-tuning)

All units are logistic except the center ones (with lowest dimension)

Experiments

Inputs are real values.

Activations are brought in range [0, 1] using the logistic function

For training higher level RBMs the visible input values were real, set to the activation proba of the hidden RBM in the previous RBM.

Synthetic dataset

They experiment on a synthetic dataset, for that, they generate curves from 3 given points in 28*28 space (therefore 6 intrinsic dimensions to problem)

Results

Shallow autoencoder

Learns without pretraining but training is sped up by pretraining

Deep autoencoder

(28*28-400-200-100-50-25-6)

Without pretraining, converged to the average of the training data (note that no normalization is used, what would have happened if we had added a normalization stage the subtracted the mean of the data to the inputs?)

With pretraining, almost perfect reconstruction, significantly better then PCA.

NLP

Also applied to text for newswire stories retrieval by representing documents as document-specific vectors that had as values the probabilities of the 2000 most commen word stems