-
Notifications
You must be signed in to change notification settings - Fork 7
1409.7495
[arxiv 1409.7495] Unsupervised Domain Adaptation by Backpropagation [PDF] [notes]
Yaroslav Ganin, Victor Lempitsky
read 03/08/2017
Extract features that are domain invariant accross two domains (real and synthetic) and also usefull for the final task's purpose (image classification for instance)
Feature extractor : Convolutional network that extracts features
Domain classifier : discriminates between the two domains
Label predictor : task specific classifier
Standard backprop from label predictor loss for label predictor and feature extractor
Standard backprop for Domain classifier up to the feature layer, then negative update of backprop in the shared feature extractor. The negative update is obtained through a gradient reversal layer
If gradient reversal was not applied, the backprop would push the feature extractor to create dissimilar features in order to maximize the domain loss. On the contrary, a negative update forces the network to create similar features.
This preserves the discriminative power of the domain classifier, while having the desired effect of pushing the feature extractor to create domain invariant features
The weight of the negative constant of the gradient layer is progressively changed from 0 to one using a determined schedule
Training is successfull when the source domain test error is low and the domain classifier error is high
MNIST vs MNIST blended over random patches
Office dataset, that provides 3 datasets with 30ish categories of office items taken from amazon, DSLR and Webcam cameras
...
traffic signs
t-sne visualization shows that the target and adapted datasets have way more overlap on the top feature extractor layer then the non adapted one
Significant improvement in numerical results
On synthetic and real signal signs, at the end of training, the validation error on the real signs differs depending on the given data
Both is better then real data only which itself is better then adapted data only
- forwards input as is
- multiplies gradient by negative constant during backprop
P_s(Y|X=x) = P_t(Y|X=x) but P_s(X) != P_t(X)
Where P_s is the source probability distribution while P_t is the target one
This means that given a sample, the probability of a given label is the same for the two distributions but the probability of drawing that sample is different in the two distributions.
This is a problem in the case where the optimal model we select depends on P(X) (which is the case in supervised learning when we select the final model by minimizing a loss over all samples, therefore, if we have various distributions, the model will learn to perform well in regions where the source distribution is dense, while neglecting the ones where it is sparse. If these regions do not match between the source and target distributions, our model will be suboptimal on the target distribution)
A simple solution is to resample the source samples so that they are more similar in distribution to the target ones.
Another one is to weigh the contribution of the samples according to the ratio P_t(x, y)/P_s(x, y) = P_t(x)*P_t(y|x)/(P_s(x)*P_s(y|x)) which under the covariate shift assumption is equal to P_t(x)/P_s(x). So we weigh the samples contribution by the target to source input ratio.