-
Notifications
You must be signed in to change notification settings - Fork 7
1607.02533
[arxiv 1607.02533] Adverserial examples in the physical world [PDF] [notes]
Alexey Kurakin, Ian Goodfellow, Samy Bengio
read 08/08/2017
Demonstrate that adverserial examples transfer to the physical world (which is accessed through sensors such as a camera), and not just when input is directly fed to the machine learning model
Adverserial examples generated and then printed are often misclassified by the targeted models when fed through sensors
Suprisingly, no need to provide extra efforts to generalize from model to sensor --> model.
No experiments to account for transferability from one known model to another, but performed a black-box attack with some success
Adverserial examples are constrained to stay close to original sample x by clipping that constrains the example to be in [x - \eps, x + \eps] and in [0, 255]
This enforces the L_inf norm on x - x_adv
Updates the sample according to eps * sign(\Delta_x{J(X, y_hat)]) where y_hat is the ground truth label
The fast method does not target a specific label, instead, it just moves away from the real one (untargetted attack)
For this method, only one update has proved to successfully produce adverserial examples
Same as fast one, but applied in an iterative scheme, with clipping at each step, and smaller multiplicative coefficient at each step
Target the class with the lowest score iteratively by making iterative steps in the direction of sign{log(p_y_{least likely} | x)} = sign ( -\Delta_x(J(X, y_{least likely}) ))
This gives both a fast and an iterative method as previously
They also test resistence of adverserial images to other transformations (change of contrast, brightness, blur, noise, JPEG encoding)
Tested on validation samples of ImageNetwith epsilon in [0, 128] (pixel values)
Measure destruction rate : proportion of images no longer misclassified after some image transformation (such as print and take picture for instance)
Images are cropped and warped to be squares of same value as originals (no effective rescaling, cropping, ...)
Resistance to other image transformations were tested on 1.000 randomly selected images
"an adversary using the fast method with ? = 16 could expect that about 2/3 of the images would be top-1 misclassified and about 1/3 of the images would be top-5 misclassifie"
fast method is more efficient to resist image transformations (maybe less subtile/coadapted features)
Changing color and brightness doesn't affect the adverserial power much (because of imagenet initial normalization ?)
cross-entropy cost function applied to class labels equals negative log probability of the true class CE(g_theta(x), y) = - log(p(y|X))