LLM Attribution Classes

LLMAttribution

class captum.attr.LLMAttribution(attr_method, tokenizer, attr_target='log_prob')[source]

Attribution class for large language models. It wraps a perturbation-based attribution algorthm to produce commonly interested attribution results for the use case of text generation. The wrapped instance will calculate attribution in the same way as configured in the original attribution algorthm, but it will provide a new “attribute” function which accepts text-based inputs and returns LLMAttributionResult

Parameters:
  • attr_method (Attribution) – Instance of a supported perturbation attribution Supported methods include FeatureAblation, ShapleyValueSampling, ShapleyValues, Lime, and KernelShap. Lime and KernelShap do not support per-token attribution and will only return attribution for the full target sequence. class created with the llm model that follows huggingface style interface convention

  • tokenizer (Tokenizer) – tokenizer of the llm model used in the attr_method

  • attr_target (str) – attribute towards log probability or probability. Available values [“log_prob”, “prob”] Default: “log_prob”

attribute(inp, target=None, skip_tokens=None, num_trials=1, gen_args=None, use_cached_outputs=True, _inspect_forward=None, **kwargs)[source]
Parameters:
  • inp (InterpretableInput) – input prompt for which attributions are computed

  • target (str or Tensor, optional) – target response with respect to which attributions are computed. If None, it uses the model to generate the target based on the input and gen_args. Default: None

  • skip_tokens (List[int] or List[str], optional) – the tokens to skip in the the output’s interpretable representation. Use this argument to define uninterested tokens, commonly like special tokens, e.g., sos, and unk. It can be a list of strings of the tokens or a list of integers of the token ids. Default: None

  • num_trials (int, optional) – number of trials to run. Return is the average attributions over all the trials. Defaults: 1.

  • gen_args (dict, optional) – arguments for generating the target. Only used if target is not given. When None, the default arguments are used, {“max_new_tokens”: 25, “do_sample”: False, “temperature”: None, “top_p”: None} Defaults: None

  • **kwargs (Any) – any extra keyword arguments passed to the call of the underlying attribute function of the given attribution instance

Returns:

Attribution result. token_attr will be None

if attr method is Lime or KernelShap.

Return type:

attr (LLMAttributionResult)

attribute_future()[source]

This method is not implemented for LLMAttribution.

Return type:

Callable[[], LLMAttributionResult]

LLMGradientAttribution

class captum.attr.LLMGradientAttribution(attr_method, tokenizer)[source]

Attribution class for large language models. It wraps a gradient-based attribution algorthm to produce commonly interested attribution results for the use case of text generation. The wrapped instance will calculate attribution in the same way as configured in the original attribution algorthm, with respect to the log probabilities of each generated token and the whole sequence. It will provide a new “attribute” function which accepts text-based inputs and returns LLMAttributionResult

Parameters:
  • attr_method (Attribution) – instance of a supported perturbation attribution class created with the llm model that follows huggingface style interface convention

  • tokenizer (Tokenizer) – tokenizer of the llm model used in the attr_method

attribute(inp, target=None, skip_tokens=None, gen_args=None, **kwargs)[source]
Parameters:
  • inp (InterpretableInput) – input prompt for which attributions are computed

  • target (str or Tensor, optional) – target response with respect to which attributions are computed. If None, it uses the model to generate the target based on the input and gen_args. Default: None

  • skip_tokens (List[int] or List[str], optional) – the tokens to skip in the the output’s interpretable representation. Use this argument to define uninterested tokens, commonly like special tokens, e.g., sos, and unk. It can be a list of strings of the tokens or a list of integers of the token ids. Default: None

  • gen_args (dict, optional) – arguments for generating the target. Only used if target is not given. When None, the default arguments are used, {“max_new_tokens”: 25, “do_sample”: False, “temperature”: None, “top_p”: None} Defaults: None

  • **kwargs (Any) – any extra keyword arguments passed to the call of the underlying attribute function of the given attribution instance

Returns:

attribution result

Return type:

attr (LLMAttributionResult)

attribute_future()[source]

This method is not implemented for LLMGradientAttribution.

Return type:

Callable[[], LLMAttributionResult]

LLMAttributionResult

class captum.attr.LLMAttributionResult(seq_attr, token_attr, input_tokens, output_tokens)[source]

Data class for the return result of LLMAttribution, which includes the necessary properties of the attribution. It also provides utilities to help present and plot the result in different forms.

plot_seq_attr(show=False)[source]

Generate a matplotlib plot for visualising the attribution of the output sequence.

Parameters:

show (bool) – whether to show the plot directly or return the figure and axis Default: False

Return type:

Optional[Tuple[Figure, Axes]]

plot_token_attr(show=False)[source]

Generate a matplotlib plot for visualising the attribution of the output tokens.

Parameters:

show (bool) – whether to show the plot directly or return the figure and axis Default: False

Return type:

Optional[Tuple[Figure, Axes]]