1607.02533

ICLR 2017

[arxiv 1607.02533] Adverserial examples in the physical world [PDF] [notes]

Alexey Kurakin, Ian Goodfellow, Samy Bengio

read 08/08/2017

Objective

Demonstrate that adverserial examples transfer to the physical world (which is accessed through sensors such as a camera), and not just when input is directly fed to the machine learning model

Synthesis

Adverserial examples generated and then printed are often misclassified by the targeted models when fed through sensors

Suprisingly, no need to provide extra efforts to generalize from model to sensor --> model.

No experiments to account for transferability from one known model to another, but performed a black-box attack with some success

Adverserial examples are constrained to stay close to original sample x by clipping that constrains the example to be in [x - \eps, x + \eps] and in [0, 255]

This enforces the L_inf norm on x - x_adv

Fast method

Updates the sample according to eps * sign(\Delta_x{J(X, y_hat)]) where y_hat is the ground truth label

The fast method does not target a specific label, instead, it just moves away from the real one (untargetted attack)

For this method, only one update has proved to successfully produce adverserial examples

Iterative method

Same as fast one, but applied in an iterative scheme, with clipping at each step, and smaller multiplicative coefficient at each step

Iterative least-likely method

Target the class with the lowest score iteratively by making iterative steps in the direction of sign{log(p_y_{least likely} | x)} = sign ( -\Delta_x(J(X, y_{least likely}) ))

This gives both a fast and an iterative method as previously

They also test resistence of adverserial images to other transformations (change of contrast, brightness, blur, noise, JPEG encoding)

Experiments

Tested on validation samples of ImageNetwith epsilon in [0, 128] (pixel values)

Measure destruction rate : proportion of images no longer misclassified after some image transformation (such as print and take picture for instance)

Images are cropped and warped to be squares of same value as originals (no effective rescaling, cropping, ...)

Resistance to other image transformations were tested on 1.000 randomly selected images

Results

Physical world attack

"an adversary using the fast method with ? = 16 could expect that about 2/3 of the images would be top-1 misclassified and about 1/3 of the images would be top-5 misclassifie"