Code Review: Adversarial Attacks and Defenses | 신뢰할 수 있는 인공지능 연구실

Attack Code Review
- FGSM
- C&W
- PGD
- PGDL2
- TPGD
Defense Code Review
- AT
- TRADES

In this post, we analyze the implementation code of the adversarial attack and defense techniques that were introduced in previous posts.

The attack code is based on the torchattacks library (Harry24k/adversarial-attacks-pytorch). torchattacks is a PyTorch-based collection of adversarial attack implementations that provides a wide variety of attacks under a concise interface. The defense code is based on the MAIR library (Harry24k/MAIR), which provides implementations of adversarial-training-based defense techniques.

The attack and defense techniques covered in this post are summarized in the table below.

Name	Type	Paper	Distance
FGSM	Attack	Explaining and Harnessing Adversarial Examples	\(L_{\infty}\)
C&W	Attack	Towards Evaluating the Robustness of Neural Networks	\(L_2\)
PGD	Attack	Towards Deep Learning Models Resistant to Adversarial Attacks	\(L_{\infty}\)
PGD-L2	Attack	Towards Deep Learning Models Resistant to Adversarial Attacks	\(L_2\)
TPGD	Attack	Theoretically Principled Trade-off between Robustness and Accuracy	\(L_{\infty}\)
AT	Defense	Towards Deep Learning Models Resistant to Adversarial Attacks	-
TRADES	Defense	Theoretically Principled Trade-off between Robustness and Accuracy	\(L_{\infty}\)

Attack Code Review

FGSM

FGSM (Fast Gradient Sign Method) is the most basic technique for generating adversarial examples. A brief review of the FGSM formulation is as follows.

\[x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y))\]

Under the \(L_{\infty}\) constraint, this equation extracts only the sign of the loss function’s gradient via the sign function and shifts each pixel by \(\epsilon\). Let’s go through the implementation.

def __init__(self, model, eps=8 / 255):
    super().__init__("FGSM", model)
    self.eps = eps
    self.supported_mode = ["default", "targeted"]

eps=8/255 corresponds to a perturbation magnitude that is virtually imperceptible to the human eye when images are normalized to [0, 1]. Because supported_mode includes targeted, it also supports targeted attacks that induce misclassification toward a specific class.

def forward(self, images, labels):
    images = images.clone().detach().to(self.device)
    labels = labels.clone().detach().to(self.device)

    if self.targeted:
        target_labels = self.get_target_label(images, labels)

clone().detach() separates the original tensor from the computation graph. This prevents requires_grad = True, which is set later, from affecting the original tensor.

    loss = nn.CrossEntropyLoss()
    images.requires_grad = True
    outputs = self.get_logits(images)

    if self.targeted:
        cost = -loss(outputs, target_labels)
    else:
        cost = loss(outputs, labels)

In ordinary training, gradients are computed with respect to model.parameters(), but here gradients are computed with respect to the input images. With model parameters fixed, we are computing how to modify the input image so that the loss increases. The reason -loss is used in targeted mode is that we want to manipulate the image in a direction that minimizes the loss with respect to the target label.

    grad = torch.autograd.grad(
        cost, images, retain_graph=False, create_graph=False
    )[0]

    adv_images = images + self.eps * grad.sign()
    adv_images = torch.clamp(adv_images, min=0, max=1).detach()

    return adv_images

retain_graph=False, create_graph=False are options that immediately release the computation graph after computing gradients. Since FGSM uses the gradient only once, these flags are set to save memory. grad.sign() extracts only the sign of the gradient, faithfully implementing the formulation, and torch.clamp ensures the resulting image stays within the valid pixel range [0, 1].

C&W

C&W (Carlini & Wagner) is an attack technique that bypasses the [0, 1] box constraint via a change of variables to generate adversarial examples. A brief review of the C&W formulation is as follows.

\[\min_{w} \left\| \frac{1}{2}(\tanh(w)+1) - x \right\|^2_2 + c \cdot f\left(\frac{1}{2}(\tanh(w)+1)\right)\] \[f(x') = \max\left(\max_{i \neq t} Z(x')_i - Z(x')_t, -\kappa\right)\]

Under the \(L_2\) constraint, this equation optimizes in a direction that simultaneously induces misclassification through the loss function \(f\) while minimizing the \(L_2\) size of the perturbation. Let’s analyze the code in detail.

def __init__(self, model, c=1, kappa=0, steps=50, lr=0.01):
    super().__init__("CW", model)
    self.c = c
    self.kappa = kappa
    self.steps = steps
    self.lr = lr
    self.supported_mode = ["default", "targeted"]

c is a hyperparameter that adjusts the balance between the \(L_2\) distance and the misclassification loss. Larger c focuses more on misclassification, while smaller c focuses more on reducing the perturbation size. kappa is the confidence parameter that controls how confident the misclassification should be. The larger this value, the more confidently the model misclassifies the adversarial example.

def forward(self, images, labels):

    images = images.clone().detach().to(self.device)
    labels = labels.clone().detach().to(self.device)

    if self.targeted:
        target_labels = self.get_target_label(images, labels)

    w = self.inverse_tanh_space(images).detach()
    w.requires_grad = True

    best_adv_images = images.clone().detach()
    best_L2 = 1e10 * torch.ones((len(images))).to(self.device)
    prev_cost = 1e10
    dim = len(images.shape)

    MSELoss = nn.MSELoss(reduction="none")
    Flatten = nn.Flatten()

    optimizer = optim.Adam([w], lr=self.lr)

C&W does not optimize the image directly; instead it introduces a variable w. By using the tanh transformation, the image is guaranteed to stay within [0, 1] while the optimization itself remains unconstrained. inverse_tanh_space maps the original image to the w space for optimization, and tanh_space maps it back to image space. best_L2 is initialized to 1e10 so that it serves as a baseline that will later be replaced when smaller values are found.

    for step in range(self.steps):
        adv_images = self.tanh_space(w)

        current_L2 = MSELoss(Flatten(adv_images), Flatten(images)).sum(dim=1)
        L2_loss = current_L2.sum()

        outputs = self.get_logits(adv_images)
        if self.targeted:
            f_loss = self.f(outputs, target_labels).sum()
        else:
            f_loss = self.f(outputs, labels).sum()

        cost = L2_loss + self.c * f_loss

        optimizer.zero_grad()
        cost.backward()
        optimizer.step()

At each step, w is mapped through tanh_space to produce the adversarial image, and the \(L_2\) distance loss and misclassification loss are summed to form cost (the total loss). Unlike FGSM, the update is performed iteratively by an Adam optimizer on w, not via torch.autograd.grad.

        mask = condition * (best_L2 > current_L2.detach())
        best_L2 = mask * current_L2.detach() + (1 - mask) * best_L2

        mask = mask.view([-1] + [1] * (dim - 1))
        best_adv_images = mask * adv_images.detach() + (1 - mask) * best_adv_images

best_adv_images is updated only when the attack succeeds in misclassifying the input AND the current \(L_2\) distance is smaller than the previous best. The mask is used to selectively update only the images that satisfy these conditions.

        if step % max(self.steps // 10, 1) == 0:
            if cost.item() > prev_cost:
                return best_adv_images
            prev_cost = cost.item()

    return best_adv_images

Loss convergence is checked at intervals of 10% of the total steps. If the current cost is larger than the previous cost, the loop terminates early.

def tanh_space(self, x):
    return 1 / 2 * (torch.tanh(x) + 1)

def inverse_tanh_space(self, x):
    return self.atanh(torch.clamp(x * 2 - 1, min=-1, max=1))

def f(self, outputs, labels):
    one_hot_labels = torch.eye(outputs.shape[1]).to(self.device)[labels]
    other = torch.max((1 - one_hot_labels) * outputs, dim=1)[0]
    real = torch.max(one_hot_labels * outputs, dim=1)[0]

    if self.targeted:
        return torch.clamp((other - real), min=-self.kappa)
    else:
        return torch.clamp((real - other), min=-self.kappa)

tanh_space maps w into the [0, 1] range, and inverse_tanh_space is its inverse. The f function implements the paper’s objective: in the non-targeted case, it computes the difference between the logit of the correct class and the maximum logit among the other classes. The clamp ensures this value never drops below \(-\kappa\).

PGD

PGD (Projected Gradient Descent) extends FGSM into an iterative optimization procedure. A brief review of the PGD formulation is as follows.

\[\max_{\delta \in \mathcal{S}} \mathcal{L}(\theta, x+\delta, y), \quad \mathcal{S}=\{\delta:\|\delta\|_\infty \le \epsilon\}\] \[x^{t+1} = \Pi_{x+\mathcal{S}}\Big(x^t + \alpha \cdot \mathrm{sign}\big(\nabla_x \mathcal{L}(\theta, x^t, y)\big)\Big)\]

Here \(\Pi_{x+\mathcal{S}}\) is a projection onto the \(L_\infty\) feasible set around the original \(x\).

The two key steps in the code are:

Take one step in the direction that increases the loss.
Project back into the \(L_\infty\) ball of radius \(\epsilon\) around the original input.

Here is the implementation.

def __init__(self, model, eps=8 / 255, alpha=2 / 255, steps=10, random_start=True):
    super().__init__("PGD", model)
    self.eps = eps
    self.alpha = alpha
    self.steps = steps
    self.random_start = random_start
    self.supported_mode = ["default", "targeted"]

def forward(self, images, labels):

    images = images.clone().detach().to(self.device)
    labels = labels.clone().detach().to(self.device)

    if self.targeted:
        target_labels = self.get_target_label(images, labels)

    if self.random_start:
        adv_images = adv_images + torch.empty_like(adv_images).uniform_(-self.eps, self.eps)
        adv_images = torch.clamp(adv_images, min=0, max=1).detach()

Random start performs coordinate-wise uniform sampling within the [-eps, eps] interval around the clean image. This diversifies the search starting point and reduces the risk of getting stuck in weak local optima.

    for _ in range(self.steps):
        adv_images.requires_grad = True
        outputs = self.get_logits(adv_images)

        if self.targeted:
            cost = -loss(outputs, target_labels)
        else:
            cost = loss(outputs, labels)

The non-targeted attack increases CE(outputs, labels) to push away from the correct class, while the targeted attack increases -CE(outputs, target) to move toward the target class.

        grad = torch.autograd.grad(cost, adv_images, retain_graph=False, create_graph=False)[0]
        adv_images = adv_images.detach() + self.alpha * grad.sign()
        delta = torch.clamp(adv_images - images, min=-self.eps, max=self.eps)
        adv_images = torch.clamp(images + delta, min=0, max=1).detach()

grad.sign() is the steepest-ascent direction under the \(L_\infty\) constraint. After the update, clipping delta to [-eps, eps] enforces the \(L_\infty\) constraint – a per-pixel cap on the change – and the final [0,1] clamp keeps the input within the valid range.

PGDL2

PGDL2 keeps the PGD structure but switches the constraint set to \(L_2\). A brief review of the PGDL2 formulation is as follows.

\[\max_{\delta \in \mathcal{S}} \mathcal{L}(\theta, x+\delta, y), \quad \mathcal{S}=\{\delta:\|\delta\|_2 \le \epsilon\}\] \[x^{t+1} = \Pi_{x+\mathcal{S}}\Big(x^t + \alpha \cdot \frac{\nabla_x \mathcal{L}(\theta, x^t, y)}{\|\nabla_x \mathcal{L}(\theta, x^t, y)\|_2+\eta}\Big)\]

That is, the normalized gradient direction is used in place of sign(grad), and the projection shrinks the vector length rather than clipping coordinates.

def __init__(self, model, eps=1.0, alpha=0.2, steps=10, random_start=True, eps_for_division=1e-10):
    super().__init__("PGDL2", model)
    self.eps = eps
    self.alpha = alpha
    self.steps = steps
    self.random_start = random_start
    self.eps_for_division = eps_for_division
    self.supported_mode = ["default", "targeted"]

eps_for_division corresponds to \(\eta\) in the equation: a small constant added for denominator stability when the gradient norm is very small.

if self.random_start:
    delta = torch.empty_like(adv_images).normal_()
    d_flat = delta.view(adv_images.size(0), -1)
    n = d_flat.norm(p=2, dim=1).view(adv_images.size(0), 1, 1, 1)
    r = torch.zeros_like(n).uniform_(0, 1)
    delta *= r / n * self.eps
    adv_images = torch.clamp(adv_images + delta, min=0, max=1).detach()

For each sample in random start:

delta/n normalizes the direction vector to L2 norm 1.
Multiplying by r*eps sets the final length to a value within [0, eps].

Geometrically, viewing this as a (high-dimensional) L2 ball, the center is the original sample \(x\) and delta/n is a unit vector (direction) from the center toward the surface. Multiplying by r*eps determines “how far to travel in that direction,” so the final starting point takes the form \(x + r\epsilon u\) (where \(u\) is a unit vector). In other words, random start picks a point inside the ball using “random direction + length between 0 and \(\epsilon\).”

    grad = torch.autograd.grad(cost, adv_images, retain_graph=False, create_graph=False)[0]
    grad_norms = torch.norm(grad.view(batch_size, -1), p=2, dim=1) + self.eps_for_division
    grad = grad / grad_norms.view(batch_size, 1, 1, 1)
    adv_images = adv_images.detach() + self.alpha * grad

In the update step, the gradient is L2-normalized per sample and the move is of length alpha. That is, the update relies on the direction information rather than the magnitude information.

    delta = adv_images - images
    delta_norms = torch.norm(delta.view(batch_size, -1), p=2, dim=1)
    factor = self.eps / delta_norms
    factor = torch.min(factor, torch.ones_like(delta_norms))
    delta = delta * factor.view(-1, 1, 1, 1)
    adv_images = torch.clamp(images + delta, min=0, max=1).detach()

This block is the heart of the L2 projection. It checks “whether the current perturbation \(\delta\)’s length \((\|\delta\|_2)\) exceeds \(\epsilon\)” and, if so, shrinks only the length while keeping the direction.

factor = eps / ||delta||_2: scale ratio that resizes the current length to the allowed radius.
factor = min(factor, 1): when already inside (||delta||_2 <= eps), factor=1 (no change); when outside, factor<1 shrinks it.
delta *= factor: direction is preserved while length is adjusted, so any external point is projected exactly onto the boundary (||delta||_2 = eps).

In equation form:

\[\Pi_{\|\delta\|_2 \le \epsilon}(\delta)=\delta\cdot \min(1,\epsilon/\|\delta\|_2)\]

Finally, adv_images = torch.clamp(images + delta, min=0, max=1).detach() is applied so that the input itself stays within the valid pixel range [0,1], separately from the L2 radius constraint.

TPGD

TPGD (TRADES PGD) is the adversarial-example generation method proposed for the inner maximization of the TRADES defense. A brief review of the TPGD formulation is as follows.

\[x_{adv} = x + \alpha \cdot \text{sign}(\nabla_x KL(f(x) \| f(x_{adv})))\]

Instead of using the ground-truth label, this equation moves pixels in a direction that maximizes the Kullback-Leibler divergence (KL Divergence) between the predicted probability distribution on the original image and that on the adversarial image. Let’s examine the implementation.

def __init__(self, model, eps=8 / 255, alpha=2 / 255, steps=10):
    super().__init__("TPGD", model)
    self.eps = eps
    self.alpha = alpha
    self.steps = steps
    self.supported_mode = ["default"]

TPGD aims to widen the gap from the original image’s predicted distribution rather than to induce misclassification using the label itself, so it does not support ‘targeted’ mode and supports only ‘default’.

def forward(self, images, labels=None):
    images = images.clone().detach().to(self.device)
    logit_ori = self.get_logits(images).detach()

    adv_images = images + 0.001 * torch.randn_like(images)
    adv_images = torch.clamp(adv_images, min=0, max=1).detach()

    loss = nn.KLDivLoss(reduction="sum")

The original image’s prediction (logit_ori) is computed in advance and frozen with detach() to serve as the reference point. Before optimization begins, the initial position is randomly perturbed (random start) to avoid local optima. Note that, unlike standard PGD which uses a uniform distribution, the implementation here adds Gaussian noise (torch.randn_like) scaled by 0.001. The loss function is KLDivLoss with the reduction="sum" option.

for _ in range(self.steps):
    adv_images.requires_grad = True
    logit_adv = self.get_logits(adv_images)

    # Calculate loss
    cost = loss(F.log_softmax(logit_adv, dim=1), F.softmax(logit_ori, dim=1))

At each step, the model’s output on the adversarial image (logit_adv) is computed. PyTorch’s KLDivLoss requires log-probabilities as the first argument and probabilities as the second argument, so log_softmax is applied to the adversarial output and softmax is applied to the original output. The KL value is then defined as the final cost.

    # Update adversarial images
    grad = torch.autograd.grad(
        cost, adv_images, retain_graph=False, create_graph=False
    )[0]

    adv_images = adv_images.detach() + self.alpha * grad.sign()
    delta = torch.clamp(adv_images - images, min=-self.eps, max=self.eps)
    adv_images = torch.clamp(images + delta, min=0, max=1).detach()

return adv_images

After computing the gradient of cost with respect to adv_images, the sign (grad.sign()) is extracted and the image is updated by alpha. The difference (delta) from the original image is then constrained to stay within the eps range, and the final pixel values are clipped to the valid [0, 1] range.

Defense Code Review

AT

AT (Adversarial Training) is the most intuitive and widely used adversarial defense technique, in which adversarial examples are incorporated into the model training process itself. A brief review of the AT formulation is as follows.

\[\min_{\theta} \mathbb{E}_{(x, y)} \left[ \max_{\|x'-x\|_{\infty} \leq \epsilon} L(f_\theta(x'), y) \right]\]

This min-max objective alternates between an inner maximization that finds the adversarial example \(x'\) which maximizes the loss, and an outer minimization that updates the model parameters \(\theta\) to minimize the loss on those adversarial examples. Let’s analyze the code in detail.

def __init__(self, rmodel, eps, alpha, steps, random_start=True):
    super().__init__(rmodel)
    self.atk = PGD(rmodel, eps, alpha, steps, random_start)

To train a robust model, AT internally uses PGD (Projected Gradient Descent), a strong first-order attack. The constructor takes rmodel (the defense model), the attack hyperparameters (eps, alpha, steps), and random_start to decide whether to add initial noise, and instantiates the PGD attack module.

def calculate_cost(self, train_data, reduction="mean"):
    # ... (omitted: data device assignment) ...
    adv_images = self.atk(images, labels)
    logits_adv = self.rmodel(adv_images)

The original images and labels from a mini-batch are passed through the previously defined PGD attack module to generate adversarial examples (adv_images) on the fly. The defense model (rmodel) is then fed these adversarial examples to produce predictions (logits_adv). The key point is that the feed-forward during training is performed on the adversarial examples, not the original images.

    cost = nn.CrossEntropyLoss(reduction="none")(logits_adv, labels)
    self.add_record_item("CALoss", cost.mean().item())

    return cost.mean() if reduction == "mean" else cost

The Cross-Entropy loss (cost) is computed between the predictions on the generated adversarial examples and the true labels. reduction="none" is used to obtain the loss for each sample in the batch individually. The MAIR framework’s logging system then records the average loss under the name CALoss (Cross-Entropy Adversarial Loss) so that the per-epoch trend in robust loss can be tracked. Finally, the loss is returned according to the reduction mode requested by the caller (typically “mean”) and used for backpropagation.

TRADES

TRADES (TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization) is a defense technique that theoretically addresses the trade-off between natural accuracy and adversarial robustness. A brief review of the TRADES formulation is as follows.

\[\min_f \mathbb{E}\left[ L(f(x), y) + \beta \cdot \max_{\|x'-x\|_{\infty} \leq \epsilon} KL(f(x) \| f(x')) \right]\]

This objective simultaneously minimizes the Cross-Entropy loss on the clean image and the KL divergence between the clean prediction and the adversarial prediction, balanced by \(\beta\). Larger \(\beta\) emphasizes robustness, while smaller \(\beta\) emphasizes natural accuracy.

def __init__(self, rmodel, eps, alpha, steps, beta):
    super().__init__(rmodel)
    self.atk = TPGD(rmodel, eps, alpha, steps)
    self.beta = beta

TRADES uses TPGD for the inner maximization. Rather than using the ground-truth label, TPGD generates adversarial images in the direction that maximizes the KL divergence between the clean and adversarial predictions. beta is the trade-off parameter \(\beta\) in the equation.

def calculate_cost(self, train_data, reduction="mean"):

    images, labels = train_data
    images = images.to(self.device)
    labels = labels.to(self.device)

    logits_clean = self.rmodel(images)
    loss_ce = nn.CrossEntropyLoss(reduction=reduction)(logits_clean, labels)

    adv_images = self.atk(images)
    logits_adv = self.rmodel(adv_images)

The clean logits are computed first to obtain the Cross-Entropy loss; then TPGD is used to generate the adversarial images, and their logits are computed separately. The two sets of logits are computed separately because each plays a different role in the subsequent KL divergence computation.

    probs_clean = F.softmax(logits_clean, dim=1)
    log_probs_adv = F.log_softmax(logits_adv, dim=1)
    loss_kl = nn.KLDivLoss(reduction="none")(log_probs_adv, probs_clean).sum(dim=1)

    cost = loss_ce + self.beta * loss_kl

When computing the KL divergence, softmax is applied to the clean side and log_softmax to the adversarial side, because PyTorch’s KLDivLoss requires log-probabilities as input. The final cost is the Cross-Entropy loss plus the KL divergence loss weighted by beta – a direct implementation of the equation.

    self.add_record_item("Loss", cost.mean().item())
    self.add_record_item("CELoss", loss_ce.mean().item())
    self.add_record_item("KLLoss", loss_kl.mean().item())

During training, the total loss, the Cross-Entropy loss, and the KL divergence loss are each recorded. Tracking them separately makes it possible to monitor how the two losses change as beta is varied.

This concludes our code analysis of the basic adversarial-example generation attacks and the defense techniques that build on them.

Attack Code Review

FGSM

C&W

PGD

PGDL2

TPGD

Defense Code Review

AT

TRADES

Related Articles

Adversarial Training for Free!

Adversarial Examples Are Not Bugs, They Are Features

Theoretically Principled Trade-off between Robustness and Accuracy

Robustness May Be at Odds with Accuracy