1409.7495

ICML 2015

[arxiv 1409.7495] Unsupervised Domain Adaptation by Backpropagation [PDF] [notes]

Yaroslav Ganin, Victor Lempitsky

read 03/08/2017

Objective

Extract features that are domain invariant accross two domains (real and synthetic) and also usefull for the final task's purpose (image classification for instance)

Synthesis

Network structure

Feature extractor : Convolutional network that extracts features

Domain classifier : discriminates between the two domains

Label predictor : task specific classifier

Training

Standard backprop from label predictor loss for label predictor and feature extractor

Standard backprop for Domain classifier up to the feature layer, then negative update of backprop in the shared feature extractor. The negative update is obtained through a gradient reversal layer

If gradient reversal was not applied, the backprop would push the feature extractor to create dissimilar features in order to maximize the domain loss. On the contrary, a negative update forces the network to create similar features.

This preserves the discriminative power of the domain classifier, while having the desired effect of pushing the feature extractor to create domain invariant features

The weight of the negative constant of the gradient layer is progressively changed from 0 to one using a determined schedule

Training is successfull when the source domain test error is low and the domain classifier error is high

Experiments

MNIST vs MNIST blended over random patches

Office dataset, that provides 3 datasets with 30ish categories of office items taken from amazon, DSLR and Webcam cameras

...

traffic signs

Results

t-sne visualization shows that the target and adapted datasets have way more overlap on the top feature extractor layer then the non adapted one

Significant improvement in numerical results

On synthetic and real signal signs, at the end of training, the validation error on the real signs differs depending on the given data

Both is better then real data only which itself is better then adapted data only

Notes

gradient reversal layer

forwards input as is
multiplies gradient by negative constant during backprop

Covariate shift assumption

P_s(Y|X=x) = P_t(Y|X=x) but P_s(X) != P_t(X)

Where P_s is the source probability distribution while P_t is the target one

This means that given a sample, the probability of a given label is the same for the two distributions but the probability of drawing that sample is different in the two distributions.

This is a problem in the case where the optimal model we select depends on P(X) (which is the case in supervised learning when we select the final model by minimizing a loss over all samples, therefore, if we have various distributions, the model will learn to perform well in regions where the source distribution is dense, while neglecting the ones where it is sparse. If these regions do not match between the source and target distributions, our model will be suboptimal on the target distribution)

A simple solution is to resample the source samples so that they are more similar in distribution to the target ones.

Another one is to weigh the contribution of the samples according to the ratio P_t(x, y)/P_s(x, y) = P_t(x)*P_t(y|x)/(P_s(x)*P_s(y|x)) which under the covariate shift assumption is equal to P_t(x)/P_s(x). So we weigh the samples contribution by the target to source input ratio.