-
Notifications
You must be signed in to change notification settings - Fork 7
science autoencoder
Geoffrey Hinton, Ruslan Salakhutdinov
Introduce autoencoder networks, present a pre-training and finetuning method
The initialization takes advantage of several stacked Restricted Boltzamann Machines that are easy to train in a stochastic manner.
Dimensionality reduction facilitates several tasks (classification, visualization of data, storage, ...), as dimensionality reduction aims at capturing the structure of the data.
A common algorithm for dimensionality reduction is PCA. Unlike PCA, autoencoder networks are not linear.
The autoencoder can be decomposed into two networks:
- the encoder that takes a high-dimensional input and that outputs a low-dimensional vector
- the decoder that takes the produced low-dimensional vector and goes back to a high-dimensional representation, trying to reconstruct the initial
The networks are trained so that the distances between the high-dimensional inputs and outputs are minimized using gradient descent (which takes advantage of the chain rule)
When there are multiple hidden layers, initialization of the weights is crucial
- to high values of the weights leads to poor local minima
- small weights lead to small gradients in the early layers, and therefore training is slow
An idea is to build upon RBMs to produce efficient auto-encoders with good initialization.
Given the input, the hidden state is updated so that each hidden state activation j is set to 1 or 0 probabilistically by taking into account the value sigma(b_j + sum(v_i w_{ij})) where b_j is the bias associated to that hidden state, w_ij are the weights that connect the v_i input states (the pixel value of the input at location i) with the hidden state, sigma is the logistic function 1/(1 + exp(-x))
Each hidden state is set to 1 or 0 stochastically according to the value of this function
The input states are then updated in turn where each v_i is set to 1 with proba sigma(b_i + sum_j{h_j w_ij}) where b_i is the bias of i.
The hidden state is then updated once more.
The weight w_ij is then updated by taking into account a learning rate and <v_i h_j>{data} - <v_i h_j>{reconstruct}
<v_i h_j> is the fraction of times that the pixel i and the hidden state j are on together.
This learning rule works well in practice and allows to train an RBM, which can be seen as a single layer connecting inputs and outputs (the hidden states).
We can then iteratively train a new RBM by treating the hidden states of the trained RBM as visible states of this new RBM. This new RBM would then be trained with new hidden states.
This training method therefore iteratively RBMs for which the previous output is the input of the next RBM. This produces a stack of RBMs which can be used to pre-train the layers of an auto-encoder.
After pretraining, the RBM is unfolded to give the encoder and decoder networks. (encoder by stacking the RBMs and decoder by doing the same but in the reversed order)
The encoder and decoder therefore share the same weights after the pre-training stage.
After unfolding, to prepare for the final finetuning stage, stochastic activities are replaced by deterministic, real-valued activations. Back-prop is used through the whole network (so the weights although initialized to the same values will not be the same after fine-tuning)
All units are logistic except the center ones (with lowest dimension)
Inputs are real values.
Activations are brought in range [0, 1] using the logistic function
For training higher level RBMs the visible input values were real, set to the activation proba of the hidden RBM in the previous RBM.
They experiment on a synthetic dataset, for that, they generate curves from 3 given points in 28*28 space (therefore 6 intrinsic dimensions to problem)
Learns without pretraining but training is sped up by pretraining
(28*28-400-200-100-50-25-6)
Without pretraining, converged to the average of the training data (note that no normalization is used, what would have happened if we had added a normalization stage the subtracted the mean of the data to the inputs?)
With pretraining, almost perfect reconstruction, significantly better then PCA.
Also applied to text for newswire stories retrieval by representing documents as document-specific vectors that had as values the probabilities of the 2000 most commen word stems
Works better then Latent Semantic Analysis which is a PCA-based method (For same dimension reduction)