-
Notifications
You must be signed in to change notification settings - Fork 7
jmlr dropout
[jmlr-dropout] Dropout: A Simple Way to Prevent Neural Networks from Overfitting [PDF] [notes]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov
Prevent feature co-adaptation (several neurons working together to detect the same pattern while only one of them would be sufficient)
Prevent over-fitting
We could train several networks and then ensemble them, but it would be computationally prohibitive.
A neural net with n units can be seen as a collection of 2^n possible thinned neural networks depending on the choice of active units (we can sample 2^n units in a binary on or off fashions.
Weights are shared, therefore, total number of parameters is still O(n^2)
Training is therefore seen as training a collection of 2^n networks with extensive weight sharing.
At test time, we use a single neural net without dropout (the original structure with all units active) but scaled weights (see in practice.
This insures that the expected output is the same at test and training time.
There is a decrease of generalization error compared with other regularization methods.
Droping out 20% of the input units and 50% of the hidden units was often found to be optimal.
Standard stochastic gradient improvement techniques such as momentum, annealed learning rate, L2 weight decay, max- norm regularization (bound the norma of incident weights to a learning unit by a constant) work well with dropout
Features become more "meaningful", in MNIST they resemble strokes and not just some random noise.
Dropout produces sparsity in the activations (less neurons are activated with high values, and average activation over entire data for each unit should be low)
Drawback : dropout increases training time
For a unit present with proba p and is connected to weights w at training time, it should be always present with weights multiplied by p at test time.
"A dropout net should typically use 10-100 times the learning rate that was optimal for a standard neural net"
High momentums (0.95/0.99) tend to work better.
Complex co-adaptation : phenomena where a feature detector is only helpful in the context of several other specific feature detectors. For instance if we focus on genes, it is risky if one vital function depends on several different genes being present and working 'together', it would be safer to have a little number of genes cooperating to achieve useful effects. In this case, if one gene malfunctions, the impact is minimized.
"In a standard neural network, the derivative received by each parameter tells it how it should change so the final loss function is reduced, given what all other units are doing. Therefore, units may change in a way that they fix up the mistakes of the other units. This may lead to complex co-adaptations".
These co-adaptations do not generalize to unseen data.