Explaining and Harnessing Adversarial Examples | 신뢰할 수 있는 인공지능 연구실

Introduction
- The Mystery of Adversarial Examples
The Linear Explanation of Adversarial Examples
- Linearity in High-Dimensional Spaces
A Fast and Efficient Attack: FGSM (Fast Gradient Sign Method)
- Linear Perturbation Against Nonlinear Models
Improving Model Robustness: Adversarial Training
- A New Regularization Technique
Why Do Adversarial Examples Generalize?
- Transferability Across Models
Conclusion
- Key Summary and the Paradox of Optimization

Paper: Explaining and Harnessing Adversarial Examples

Authors: Ian J. Goodfellow, Jonathon Shlens & Christian Szegedy

Venue: In International Conference on Learning Representations (ICLR), 2015

URL: Explaining and Harnessing Adversarial Examples

Introduction

The Mystery of Adversarial Examples

Many machine learning models and state-of-the-art neural networks possess a critical weakness: they are highly vulnerable to adversarial examples. Adversarial examples are data created by taking a correctly sampled input from the original data distribution and adding a perturbation that is imperceptible to the human eye yet intentionally computed under a worst-case assumption. Remarkably, models produce completely incorrect predictions with very high confidence from such subtle changes alone.

Early researchers speculated that this phenomenon was caused by the extreme nonlinearity of deep neural networks and overfitting due to insufficient regularization in supervised learning.

However, this paper proves that such speculation is wrong. Instead, it argues that the fundamental reason neural networks are vulnerable to adversarial perturbations is that models are too linear. This linear perspective perfectly explains why adversarial examples generalize across diverse architectures and training datasets.

The Linear Explanation of Adversarial Examples

Linearity in High-Dimensional Spaces

Digital images often use 8 bits per pixel, so information below 1/255 of the dynamic range is discarded. Since the precision of the model’s input features is limited, if every element of a perturbation is smaller than the feature precision, the classifier should not respond differently to the original input and the adversarial input.

Consider a linear model as an example. The dot product of a weight vector $w$ and an adversarial example $\tilde{x}$ is as follows:

\[w^{\top}\tilde{x} = w^{\top}x + w^{\top}\eta\]

The adversarial perturbation increases the activation by $w^{\top}\eta$. To maximize this increase under a max-norm constraint, one can simply set $\eta = sign(w)$. In high-dimensional problems, even a very small change along each input dimension accumulates linearly along the weight dimensions, producing a single massive change in the output. This demonstrates that linear behavior, rather than nonlinearity, is sufficient to generate adversarial examples.

A Fast and Efficient Attack: FGSM (Fast Gradient Sign Method)

Linear Perturbation Against Nonlinear Models

Modern neural network architectures such as LSTMs, ReLU networks, and Maxout networks are intentionally designed to behave linearly in order to facilitate optimization. Even nonlinear models like sigmoid networks are tuned to operate primarily in their non-saturated, linear regime.

Because of this property, cheap and analytical perturbation methods that exploit linear models can inflict equally severe damage on neural networks. The authors propose the Fast Gradient Sign Method (FGSM), which computes the optimal adversarial perturbation by linearizing the cost function around the current parameters.

\[\eta = \epsilon sign(\nabla_{x}J(\theta,x,y))\]

The required gradient can be computed very efficiently using backpropagation.

FGSM attack example on GoogLeNet (misclassifying a panda as a gibbon)

In practice, when a perturbation of a certain magnitude was applied using this method, a shallow softmax classifier recorded a 99.9% error rate on the MNIST test set, and a Maxout network exhibited an 89.4% error rate.

Improving Model Robustness: Adversarial Training

A New Regularization Technique

Unlike shallow linear models, deep networks possess the capacity to represent functions that can at least resist adversarial perturbations. According to the universal approximator theorem, a network with a sufficient number of hidden units can approximate any function. The problem is that standard supervised training does not inherently encourage the model to learn resistance against adversarial examples.

Therefore, a method was proposed that directly integrates an FGSM-based adversarial objective function into the training process.

\[\tilde{J}(\theta,x,y) = \alpha J(\theta, x, y) + (1-\alpha)J(\theta, x + \epsilon sign(\nabla_{x}J(\theta,x,y)))\]

The paper used $\alpha=0.5$ in its experiments. By continuously updating the supply of adversarial examples during training, this approach successfully reduced the error rate of a Maxout network with dropout from 0.94% to 0.84%, demonstrating an excellent regularization effect. The model’s error rate on adversarial examples also dropped dramatically from 89.4% to 17.9%.

Weight visualization of a Maxout network: (left) standard trained model, (right) adversarially trained model.

Why Do Adversarial Examples Generalize?

Transferability Across Models

One of the most intriguing properties of adversarial examples is that examples crafted to fool a specific model can also fool other models with entirely different architectures or training data. Moreover, the misclassified classes often agree across models. If extreme nonlinearity or overfitting were the cause, it would be impossible to explain why different models produce the same incorrect predictions for the same example.

From the linear perspective, adversarial examples do not exist in fine pockets scattered densely throughout space like rational numbers, but rather occur across broad subspaces. In other words, one can reliably generate adversarial examples simply by scaling the perturbation sufficiently in the correct direction – one that has a positive dot product with the gradient of the cost function. Since machine learning algorithms learn similar classification weights by generalizing from the same dataset, the stability of the weights directly translates to the stability (transferability) of adversarial examples.

Adversarial examples exist in continuous subspaces, not narrow pockets. Moving beyond a certain threshold (epsilon) in the correct direction reliably causes misclassification.

Conclusion

Key Summary and the Paradox of Optimization

The main results running through this paper are as follows:

Adversarial examples are a property of high-dimensional dot products, arising from linearity rather than nonlinearity.
Adversarial examples generalize across different models because perturbations are strongly aligned with the model’s weight vectors, and different models learn similar functions.
The direction of the perturbation matters most, not the specific point in space.
Models that are easy to optimize are easy to perturb.

Consequently, state-of-the-art AI models can perfectly classify training data but harbor a serious flaw: they make incorrect predictions with excessively high confidence in regions outside the data distribution. The “ease of optimization” adopted by deep learning comes at the cost of models that can be easily misled.

This study demonstrated that adversarial training can partially mitigate these flaws and raises an important implication: the development of robust optimization techniques that can guarantee locally more stable behavior is needed in the future.