class captum.module.GaussianStochasticGates(n_gates, mask=None, reg_weight=1.0, std=0.5, reg_reduction='sum')[source]

Stochastic Gates with Gaussian distribution.

Stochastic Gates is a practical solution to add L0 norm regularization for neural networks. L0 regularization, which explicitly penalizes any present (non-zero) parameters, can help network pruning and feature selection, but directly optimizing L0 is a non-differentiable combinatorial problem. To surrogate L0, Stochastic Gate uses certain continuous probability distributions (e.g., Concrete, Gaussian) with hard-sigmoid rectification as a continuous smoothed Bernoulli distribution determining the weight of a parameter, i.e., gate. Then L0 is equal to the gates’s non-zero probability represented by the parameters of the continuous probability distribution. The gate value can also be reparameterized to the distribution parameters with a noise. So the expected L0 can be optimized through learning the distribution parameters via stochastic gradients.

GaussianStochasticGates adopts a gaussian distribution as the smoothed Bernoulli distribution of gate. While the smoothed Bernoulli distribution should be within 0 and 1, gaussian does not have boundaries. So hard-sigmoid rectification is used to “fold” the parts smaller than 0 or larger than 1 back to 0 and 1.

More details can be found in the original paper <>.

  • n_gates (int) – number of gates.

  • mask (Optional[Tensor]) – If provided, this allows grouping multiple input tensor elements to share the same stochastic gate. This tensor should be broadcastable to match the input shape and contain integers in the range 0 to n_gates - 1. Indices grouped to the same stochastic gate should have the same value. If not provided, each element in the input tensor (on dimensions other than dim 0 - batch dim) is gated separately. Default: None

  • reg_weight (Optional[float]) – rescaling weight for L0 regularization term. Default: 1.0

  • std (Optional[float]) – standard deviation that will be fixed throughout. Default: 0.5 (by paper reference)

  • reg_reduction (str, optional) – the reduction to apply to the regularization: ‘none’|’mean’|’sum’. ‘none’: no reduction will be applied and it will be the same as the return of get_active_probs, ‘mean’: the sum of the gates non-zero probabilities will be divided by the number of gates, ‘sum’: the gates non-zero probabilities will be summed. Default: ‘sum’


input_tensor (Tensor) – Tensor to be gated with stochastic gates


  • gated_input (Tensor): Tensor of the same shape weighted by the sampled

    gate values

  • l0_reg (Tensor): L0 regularization term to be optimized together with

    model loss, e.g. loss(model_out, target) + l0_reg

Return type

tuple[Tensor, Tensor]


Get the active probability of each gate, i.e, gate value > 0


  • probs (Tensor): probabilities tensor of the gates are active

    in shape(n_gates)

Return type



Get the gate values, which are the means of the underneath gate distributions, optionally clamped within 0 and 1.


clamp (bool) – whether to clamp the gate values or not. As smoothed Bernoulli variables, gate values are clamped within 0 and 1 by default. Turn this off to get the raw means of the underneath distribution (e.g., concrete, gaussian), which can be useful to differentiate the gates’ importance when multiple gate values are beyond 0 or 1. Default: True


  • gate_values (Tensor): value of each gate in shape(n_gates)

Return type