Robustness

FGSM

class captum.robust.FGSM(forward_func, loss_func=None, lower_bound=- inf, upper_bound=inf)[source]

Fast Gradient Sign Method is an one-step method that can generate adversarial examples. For non-targeted attack, the formulation is x’ = x + epsilon * sign(gradient of L(theta, x, y)). For targeted attack on t, the formulation is x’ = x - epsilon * sign(gradient of L(theta, x, t)). L(theta, x, y) is the model’s loss function with respect to model parameters, inputs and labels.

More details on Fast Gradient Sign Method can be found in the original paper: https://arxiv.org/pdf/1412.6572.pdf

Parameters
  • forward_func (callable) – The pytorch model for which the attack is computed.

  • loss_func (callable, optional) – Loss function of which the gradient computed. The loss function should take in outputs of the model and labels, and return a loss tensor. The default loss function is negative log.

  • lower_bound (float, optional) – Lower bound of input values.

  • upper_bound (float, optional) – Upper bound of input values. e.g. image pixels must be in the range 0-255

bound

A function that bounds the input values based on given lower_bound and upper_bound. Can be overwritten for custom use cases if necessary.

Type

Callable

zero_thresh

The threshold below which gradient will be treated as zero. Can be modified for custom use cases if necessary.

Type

float

perturb(inputs, epsilon, target, additional_forward_args=None, targeted=False)[source]

This method computes and returns the perturbed input for each input tensor. It supports both targeted and non-targeted attacks.

Parameters
  • inputs (tensor or tuple of tensors) – Input for which adversarial attack is computed. It can be provided as a single tensor or a tuple of multiple tensors. If multiple input tensors are provided, the batch sizes must be aligned accross all tensors.

  • epsilon (float) – Step size of perturbation.

  • target (any) –

    True labels of inputs if non-targeted attack is desired. Target class of inputs if targeted attack is desired. Target will be passed to the loss function to compute loss, so the type needs to match the argument type of the loss function.

    If using the default negative log as loss function, labels should be of type int, tuple, tensor or list. For general 2D outputs, labels can be either:

    • a single integer or a tensor containing a single integer, which is applied to all input examples

    • a list of integers or a 1D tensor, with length matching the number of examples in inputs (dim 0). Each integer is applied as the label for the corresponding example.

    For outputs with > 2 dimensions, labels can be either:

    • A single tuple, which contains #output_dims - 1 elements. This label index is applied to all examples.

    • A list of tuples with length equal to the number of examples in inputs (dim 0), and each tuple containing #output_dims - 1 elements. Each tuple is applied as the label for the corresponding example.

  • additional_forward_args (any, optional) – If the forward function requires additional arguments other than the inputs for which attributions should not be computed, this argument can be provided. These arguments are provided to forward_func in order following the arguments in inputs. Default: None.

  • targeted (bool, optional) – If attack should be targeted. Default: False.

Returns

Perturbed input for each input tensor. The perturbed inputs have the same shape and dimensionality as the inputs. If a single tensor is provided as inputs, a single tensor is returned. If a tuple is provided for inputs, a tuple of corresponding sized tensors is returned.

Return type

  • perturbed inputs (tensor or tuple of tensors)

PGD

class captum.robust.PGD(forward_func, loss_func=None, lower_bound=- inf, upper_bound=inf)[source]

Projected Gradient Descent is an iterative version of the one-step attack FGSM that can generate adversarial examples. It takes multiple gradient steps to search for an adversarial perturbation within the desired neighbor ball around the original inputs. In a non-targeted attack, the formulation is:

x_0 = x
x_(t+1) = Clip_r(x_t + alpha * sign(gradient of L(theta, x, t)))

where Clip denotes the function that projects its argument to the r-neighbor ball around x so that the perturbation will be bounded. Alpha is the step size. L(theta, x, y) is the model’s loss function with respect to model parameters, inputs and targets. In a targeted attack, the formulation is similar:

x_0 = x
x_(t+1) = Clip_r(x_t - alpha * sign(gradient of L(theta, x, t)))

More details on Projected Gradient Descent can be found in the original paper: https://arxiv.org/pdf/1706.06083.pdf

Parameters
  • forward_func (callable) – The pytorch model for which the attack is computed.

  • loss_func (callable, optional) – Loss function of which the gradient computed. The loss function should take in outputs of the model and labels, and return the loss for each input tensor. The default loss function is negative log.

  • lower_bound (float, optional) – Lower bound of input values.

  • upper_bound (float, optional) – Upper bound of input values. e.g. image pixels must be in the range 0-255

bound

A function that bounds the input values based on given lower_bound and upper_bound. Can be overwritten for custom use cases if necessary.

Type

Callable

perturb(inputs, radius, step_size, step_num, target, additional_forward_args=None, targeted=False, random_start=False, norm='Linf')[source]

This method computes and returns the perturbed input for each input tensor. It supports both targeted and non-targeted attacks.

Parameters
  • inputs (tensor or tuple of tensors) – Input for which adversarial attack is computed. It can be provided as a single tensor or a tuple of multiple tensors. If multiple input tensors are provided, the batch sizes must be aligned accross all tensors.

  • radius (float) – Radius of the neighbor ball centered around inputs. The perturbation should be within this range.

  • step_size (float) – Step size of each gradient step.

  • step_num (int) – Step numbers. It usually guarantees that the perturbation can reach the border.

  • target (any) –

    True labels of inputs if non-targeted attack is desired. Target class of inputs if targeted attack is desired. Target will be passed to the loss function to compute loss, so the type needs to match the argument type of the loss function.

    If using the default negative log as loss function, labels should be of type int, tuple, tensor or list. For general 2D outputs, labels can be either:

    • a single integer or a tensor containing a single integer, which is applied to all input examples

    • a list of integers or a 1D tensor, with length matching the number of examples in inputs (dim 0). Each integer is applied as the label for the corresponding example.

    For outputs with > 2 dimensions, labels can be either:

    • A single tuple, which contains #output_dims - 1 elements. This label index is applied to all examples.

    • A list of tuples with length equal to the number of examples in inputs (dim 0), and each tuple containing #output_dims - 1 elements. Each tuple is applied as the label for the corresponding example.

  • additional_forward_args (any, optional) – If the forward function requires additional arguments other than the inputs for which attributions should not be computed, this argument can be provided. These arguments are provided to forward_func in order following the arguments in inputs. Default: None.

  • targeted (bool, optional) – If attack should be targeted. Default: False.

  • random_start (bool, optional) – If a random initialization is added to inputs. Default: False.

  • norm (str, optional) – Specifies the norm to calculate distance from original inputs: ‘Linf’|’L2’. Default: ‘Linf’.

Returns

Perturbed input for each input tensor. The perturbed inputs have the same shape and dimensionality as the inputs. If a single tensor is provided as inputs, a single tensor is returned. If a tuple is provided for inputs, a tuple of corresponding sized tensors is returned.

Return type

  • perturbed inputs (tensor or tuple of tensors)

Attack Comparator

class captum.robust.AttackComparator(forward_func, metric, preproc_fn=None)[source]

Allows measuring model robustness for a given attack or set of attacks. This class can be used with any metric(s) as well as any set of attacks, either based on attacks / perturbations from captum.robust such as FGSM or PGD or external augmentation methods or perturbations such as torchvision transforms.

Parameters
  • forward_func (callable or torch.nn.Module) – This can either be an instance of pytorch model or any modification of a model’s forward function.

  • metric (callable) –

    This function is applied to the model output in order to compute the desired performance metric or metrics. This function should have the following signature:

    >>> def model_metric(model_out: Tensor, **kwargs: Any)
    >>>     -> Union[float, Tensor, Tuple[Union[float, Tensor], ...]:
    

    All kwargs provided to evaluate are provided to the metric function, following the model output. A single metric can be returned as a float or tensor, and multiple metrics should be returned as either a tuple or named tuple of floats or tensors. For a tensor metric, the first dimension should match the batch size, corresponding to metrics for each example. Tensor metrics are averaged over the first dimension when aggregating multiple batch results. If tensor metrics represent results for the full batch, the size of the first dimension should be 1.

  • preproc_fn (callable, optional) – Optional method applied to inputs. Output of preproc_fn is then provided as input to model, in addition to additional_forward_args provided to evaluate.

add_attack(attack, name=None, num_attempts=1, apply_before_preproc=True, attack_kwargs=None, additional_attack_arg_names=None)[source]

Adds attack to be evaluated when calling evaluate.

Parameters
  • attack (perturbation or callable) – This can either be an instance of a Captum Perturbation / Attack or any other perturbation or attack function such as a torchvision transform.

  • name (optional, str) – Name or identifier for attack, used as key for attack results. This defaults to attack.__class__.__name__ if not provided and must be unique for all added attacks.

  • num_attempts (int) – Number of attempts that attack should be repeated. This should only be set to > 1 for non-deterministic attacks. The minimum, maximum, and average (best, worst, and average case) are tracked for attack attempts.

  • apply_before_preproc (bool) – Defines whether attack should be applied before or after preproc function.

  • attack_kwargs (dict) – Additional arguments to be provided to given attack. This should be provided as a dictionary of keyword arguments.

  • additional_attack_arg_names (list[str]) – Any additional arguments for the attack which are specific to the particular input example or batch. An example of this is target, which is necessary for some attacks such as FGSM or PGD. These arguments are included if provided as a kwarg to evaluate.

Return type

None

evaluate(inputs, additional_forward_args=None, perturbations_per_eval=1, **kwargs)[source]

Evaluate model and attack performance on provided inputs

Args:

inputs (any): Input for which attack metrics

are computed. It can be provided as a tensor, tuple of tensors, or any raw input type (e.g. PIL image or text string). This input is provided directly as input to preproc function as well as any attack applied before preprocessing. If no pre-processing function is provided, this input is provided directly to the main model and all attacks.

additional_forward_args (any, optional): If the forward function

requires additional arguments other than the preprocessing outputs (or inputs if preproc_fn is None), this argument can be provided. It must be either a single additional argument of a Tensor or arbitrary (non-tuple) type or a tuple containing multiple additional arguments including tensors or any arbitrary python types. These arguments are provided to forward_func in order following the arguments in inputs. For a tensor, the first dimension of the tensor must correspond to the number of examples. For all other types, the given argument is used for all forward evaluations. Default: None

perturbations_per_eval (int, optional): Allows perturbations of multiple

attacks to be grouped and evaluated in one call of forward_fn Each forward pass will contain a maximum of perturbations_per_eval * #examples samples. For DataParallel models, each batch is split among the available devices, so evaluations on each available device contain at most (perturbations_per_eval * #examples) / num_devices samples. In order to apply this functionality, the output of preproc_fn (or inputs itself if no preproc_fn is provided) must be a tensor or tuple of tensors. Default: 1

kwargs (any, optional): Additional keyword arguments provided to metric function

as well as selected attacks based on chosen additional_args

Returns:

  • attack results Dict: str -> Dict[str, Union[Tensor, Tuple[Tensor, …]]]:

    Dictionary containing attack results for provided batch. Maps attack name to dictionary, containing best-case, worst-case and average-case results for attack. Dictionary contains keys “mean”, “max” and “min” when num_attempts > 1 and only “mean” for num_attempts = 1, which contains the (single) metric result for the attack attempt. An additional key of ‘Original’ is included with metric results without any perturbations.

Examples:

>>> def accuracy_metric(model_out: Tensor, targets: Tensor):
>>>     return torch.argmax(model_out, dim=1) == targets).float()
>>> attack_metric = AttackComparator(model=resnet18,
                                     metric=accuracy_metric,
                                     preproc_fn=normalize)
>>> random_rotation = transforms.RandomRotation()
>>> jitter = transforms.ColorJitter()
>>> attack_metric.add_attack(random_rotation, "Random Rotation",
>>>                          num_attempts = 5)
>>> attack_metric.add_attack((jitter, "Jitter", num_attempts = 1)
>>> attack_metric.add_attack(FGSM(resnet18), "FGSM 0.1", num_attempts = 1,
>>>                          apply_before_preproc=False,
>>>                          attack_kwargs={epsilon: 0.1},
>>>                          additional_args=["targets"])
>>> for images, labels in dataloader:
>>>     batch_results = attack_metric.evaluate(inputs=images, targets=labels)
Return type

Dict[str, Union[~MetricResultType, Dict[str, ~MetricResultType]]]

reset()[source]

Reset stored average summary results for previous batches

Return type

None

summary()[source]

Returns average results over all previous batches evaluated.

Returns

str -> Dict[str, Union[Tensor, Tuple[Tensor, …]]]:

Dictionary containing summarized average attack results. Maps attack name (with “Mean Attempt”, “Max Attempt” and “Min Attempt” suffixes if num_attempts > 1) to dictionary containing a key of “mean” maintaining summarized results, which is the running mean of results over all batches since construction or previous reset call. Tensor metrics are averaged over dimension 0 for each batch, in order to aggregte metrics collected per batch.

Return type

  • summary Dict

Min Param Perturbation

class captum.robust.MinParamPerturbation(forward_func, attack, arg_name, arg_min, arg_max, arg_step, mode='linear', num_attempts=1, preproc_fn=None, apply_before_preproc=False, correct_fn=None)[source]

Identifies minimal perturbation based on target variable which causes misclassification (or other incorrect prediction) of target input.

More specifically, given a perturbation parametrized by a single value (e.g. rotation by angle or mask percentage of top features based on attribution results), MinParamPerturbation helps identify the minimum value which leads to misclassification (or other model output change) with the corresponding perturbed input.

Parameters
  • forward_func (callable or torch.nn.Module) – This can either be an instance of pytorch model or any modification of a model’s forward function.

  • attack (Perturbation or Callable) – This can either be an instance of a Captum Perturbation / Attack or any other perturbation or attack function such as a torchvision transform. Perturb function must take additional argument (var_name) used for minimal perturbation search.

  • arg_name (str) – Name of argument / variable paramterizing attack, must be kwarg of attack. Examples are num_dropout or stdevs

  • arg_min (int, float) – Minimum value of target variable

  • arg_max (int, float) – Maximum value of target variable (not included in range)

  • arg_step (int, float) – Minimum interval for increase of target variable.

  • mode (str, optional) – Mode for search of minimum attack value; either ‘linear’ for linear search on variable, or ‘binary’ for binary search of variable Default: ‘linear’

  • num_attempts (int, optional) – Number of attempts or trials with given variable. This should only be set to > 1 for non-deterministic perturbation / attack functions Default: 1

  • preproc_fn (callable, optional) – Optional method applied to inputs. Output of preproc_fn is then provided as input to model, in addition to additional_forward_args provided to evaluate. Default: None

  • apply_before_preproc (bool, optional) – Defines whether attack should be applied before or after preproc function. Default: False

  • correct_fn (Callable, optional) –

    This determines whether the perturbed input leads to a correct or incorrect prediction. By default, this function is set to the standard classification test for correctness (comparing argmax of output with target), which requires model output to be a 2D tensor, returning True if all batch examples are correct and false otherwise. Setting this method allows any custom behavior defining whether the perturbation is successful at fooling the model. For non-classification use cases, a custom function must be provided which determines correctness.

    The first argument to this function must be the model out; any additional arguments should be provided through correct_fn_kwargs.

    This function should have the following signature:

    def correct_fn(model_out: Tensor, **kwargs: Any) -> bool

    Method should return a boolean if correct (True) and incorrect (False). Default: None (applies standard correct_fn for classification)

evaluate(inputs, additional_forward_args=None, target=None, perturbations_per_eval=1, attack_kwargs=None, correct_fn_kwargs=None)[source]

This method evaluates the model at each perturbed input and identifies the minimum perturbation that leads to an incorrect model prediction.

It is recommended to provide a single input (batch size = 1) when using this to identify a minimal perturbation for the chosen example. If a batch of examples is provided, the default correct function identifies the minimal perturbation for at least 1 example in the batch to be misclassified. A custom correct_fn can be provided to customize this behavior and define correctness for the batch.

Parameters
  • inputs (Any) – Input for which minimal perturbation is computed. It can be provided as a tensor, tuple of tensors, or any raw input type (e.g. PIL image or text string). This input is provided directly as input to preproc function as well as any attack applied before preprocessing. If no pre-processing function is provided, this input is provided directly to the main model and all attacks.

  • additional_forward_args (any, optional) – If the forward function requires additional arguments other than the preprocessing outputs (or inputs if preproc_fn is None), this argument can be provided. It must be either a single additional argument of a Tensor or arbitrary (non-tuple) type or a tuple containing multiple additional arguments including tensors or any arbitrary python types. These arguments are provided to forward_func in order following the arguments in inputs. For a tensor, the first dimension of the tensor must correspond to the number of examples. For all other types, the given argument is used for all forward evaluations. Default: None

  • target (TargetType) – Target class for classification. This is required if using the default correct_fn

  • perturbations_per_eval (int, optional) – Allows perturbations of multiple attacks to be grouped and evaluated in one call of forward_fn Each forward pass will contain a maximum of perturbations_per_eval * #examples samples. For DataParallel models, each batch is split among the available devices, so evaluations on each available device contain at most (perturbations_per_eval * #examples) / num_devices samples. In order to apply this functionality, the output of preproc_fn (or inputs itself if no preproc_fn is provided) must be a tensor or tuple of tensors. Default: 1

  • attack_kwargs (dictionary, optional) – Optional dictionary of keyword arguments provided to attack function

  • correct_fn_kwargs (dictionary, optional) – Optional dictionary of keyword arguments provided to correct function

Return type

Tuple[Any, Union[int, float, None]]

Returns

Tuple of (perturbed_inputs, param_val) if successful else Tuple of (None, None)

  • perturbed inputs (Any):

    Perturbed input (output of attack) which results in incorrect prediction.

  • param_val (int, float)

    Param value leading to perturbed inputs causing misclassification

Examples:

>>> def gaussian_noise(inp: Tensor, std: float) -> Tensor:
>>>     return inp + std*torch.randn_like(inp)
>>> min_pert = MinParamPerturbation(forward_func=resnet18,
                                   attack=gaussian_noise,
                                   arg_name="std",
                                   arg_min=0.0,
                                   arg_max=2.0,
                                   arg_step=0.01,
                                )
>>> for images, labels in dataloader:
>>>     noised_image, min_std = min_pert.evaluate(inputs=images, target=labels)