class: middle, center, title-slide
Lecture 10: Uncertainty
Prof. Gilles Louppe
[email protected]
???
R: Code the GMM example R: Code the NF with coupling layers and visualize the transformations
class: middle
class: middle
.italic["Every time a scientific paper presents a bit of data, it's accompanied by an .bold[error bar] – a quiet but insistent reminder that no knowledge is complete or perfect. It's a .bold[calibration of how much we trust what we think we know]."]
.pull-right[Carl Sagan]
???
Knowledge is an artefact. It is a mental construct.
Uncertainty is how much we trust this construct.
How to estimate uncertainty with and of neural networks?
- Uncertainty
- Aleatoric uncertainty
- Epistemic uncertainty
class: middle
class: middle
Uncertainty refers to situations where there is .bold[imperfect or unknown information]. It can arise in predictions of future events, in physical measurements, or in situations where information is unknown.
Accounting for uncertainty is necessary for making optimal decisions. Not accounting for uncertainty can lead to suboptimal, wrong, or even catastrophic decisions.
class: middle
.italic[Case 1]. First assisted driving fatality in May 2016: Perception system mistook trailer's white side for bright sky.
.grid[
.kol-2-3[.center.width-100[]]
.kol-1-3[.center.width-100[
]]
]
.footnote[Credits: Kendall and Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, 2017.]
class: middle, center
.center[
class: middle
.italic[Case 2]. An image classification system erroneously identifies two African Americans as gorillas, raising concerns of racial discrimination.
.footnote[Credits: Kendall and Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, 2017.]
class: middle
.alert[The systems that made these errors were likely confident in their predictions. They did not account for uncertainty.]
class: middle
class: middle
Aleatoric uncertainty refers to the uncertainty arising from the inherent stochasticity of the true data generating process. This uncertainty .bold[cannot be reduced] with more data.
A common example is observational noise due to the limitations of the measurement devices. Collecting more data will not reduce the noise.
class: middle
Assumptions about the data generating process can help in distinguishing between different types of aleatoric uncertainty:
- Homoscedastic uncertainty, which is constant across the input space.
- Heteroscedastic uncertainty, which varies across the input space.
.center.width-90[![](figures/lec10/homo-vs-hetero.png)]
Consider training data
-
$\mathbf{x} \in \mathbb{R}^p$ , -
$y \in \mathbb{R}$ .
We do not wish to learn a function
Instead we want to learn the full conditional density
class: middle
We can model aleatoric uncertainty in the output by modelling the conditional distribution as a Gaussian distribution,
Note: The Gaussian distribution is a modelling choice. Other parametric distributions can be used.
class: middle
.center[Case 1: Homoscedastic aleatoric uncertainty]
class: middle
We have, $$\begin{aligned} &\arg \max_{\theta,\sigma^2} p(\mathbf{d}|\theta,\sigma^2) \\ &= \arg \max_{\theta,\sigma^2} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} p(y_i|\mathbf{x}_i, \theta,\sigma^2) \\ &= \arg \max_{\theta,\sigma^2} \prod_{\mathbf{x}_i, y_i \in \mathbf{d}} \frac{1}{\sqrt{2\pi} \sigma} \exp\left(-\frac{(y_i-\mu(\mathbf{x}_i))^2}{2\sigma^2}\right) \\ &= \arg \min_{\theta,\sigma^2} \sum_{\mathbf{x}_i, y_i \in \mathbf{d}} \frac{(y_i-\mu(\mathbf{x}_i))^2}{2\sigma^2} + \log(\sigma) + C \end{aligned}$$
.question[What if
class: middle
.center[Case 2: Heteroscedastic aleatoric uncertainty]
class: middle
Same as for the homoscedastic case, except that that
.question[What is the purpose of
???
Take care of properly parametrizing
class: middle
Modelling
???
Illustrate on the blackboard.
class: middle
A Gaussian mixture model (GMM) defines instead
class: middle
A .bold[mixture density network] (MDN) is a neural network implementation of the Gaussian mixture model.
class: middle
Let us consider training data generated randomly as
class: middle
.center[
The data can be fit with a 2-layer network producing point estimates for
]
.footnote[Credits: David Ha, Mixture Density Networks, 2015.]
class: middle
.center[
If we flip
]
.footnote[Credits: David Ha, Mixture Density Networks, 2015.]
class: middle
.center[
A mixture density network models the data correctly, as it predicts for each input a distribution for the output, rather than a point estimate (demo).
]
.footnote[Credits: David Ha, Mixture Density Networks, 2015.]
Gaussian mixture models are a flexible way to model multimodal distributions, but they are limited by the number of components
Normalizing flows are a more flexible way to model complex distributions.
class: middle
Assume
???
Motivate that picking a parametric family of distributions is not always easy. We want something more flexible.
class: middle
What if
.footnote[Image credits: Simon J.D. Prince, Understanding Deep Learning, 2023.]
class: middle
If
- the Jacobian
$J_f(\mathbf{z})$ of$\mathbf{x} = f(\mathbf{z})$ represents the infinitesimal linear transformation in the neighborhood of$\mathbf{z}$ ; - if the function is a bijective map, then the mass must be conserved locally.
Therefore, the local change of density yields
Similarly, for
???
The Jacobian matrix of a function f: R^n -> R^m at a point z in R^n is an m x n matrix that represents the linear transformation induced by the function at that point. Geometrically, the Jacobian matrix can be thought of as a matrix of partial derivatives that describes how the function locally stretches or shrinks areas and volumes in the vicinity of the point z.
The determinant of the Jacobian matrix of f at z has a geometric interpretation as the factor by which the function locally scales areas or volumes. Specifically, if the determinant is positive, then the function locally expands areas and volumes, while if it is negative, the function locally contracts areas and volumes. The absolute value of the determinant gives the factor by which the function scales the areas or volumes.
class: middle
Assume
- Forward mapping
$\mathbf{x} = f(\mathbf{z})$ :$$\mathbf{x}_a = \mathbf{z}_a, \quad \mathbf{x}_b = \mathbf{z}_b \odot \exp(s(\mathbf{z}_a)) + t(\mathbf{z}_a),$$ - Inverse mapping
$\mathbf{z} = g(\mathbf{x})$ :$$\mathbf{z}_a = \mathbf{x}_a, \quad \mathbf{z}_b = (\mathbf{x}_b - t(\mathbf{x}_a)) \odot \exp(-s(\mathbf{x}_a)),$$
where
???
Draw the coupling layer on the blackboard.
class: middle
For
Therefore, the log-likelihood is $$\begin{aligned}\log p(\mathbf{x}) &= \log p(\mathbf{z}) - \sum_i s(\mathbf{z}_a)_i\end{aligned}$$
class: middle
A normalizing flow is a change of variable
.center.width-100[![](figures/lec10/FlowTransformLayers.svg)]
.footnote[Image credits: Simon J.D. Prince, Understanding Deep Learning, 2023.]
class: middle
Formally, $$\begin{aligned} &\mathbf{z}_0 \sim p(\mathbf{z}) \\ &\mathbf{z}_k = f_k(\mathbf{z}_{k-1}), \quad k=1,...,K \\ &\mathbf{x} = \mathbf{z}_K = f_K \circ ... \circ f_1(\mathbf{z}_0). \end{aligned}$$
The change of variable theorem yields
class: middle
.center[Normalizing flows can fit complex multimodal discontinuous densities.]
.footnote[Image credits: Wehenkel and Louppe, 2019.]
class: middle
Normalizing flows can also estimate densities
- Transformations are made conditional by taking
$c$ as an additional input. For example, in a coupling layer, the networks can be upgraded to$s(\mathbf{z}, c)$ and$t(\mathbf{z}, c)$ . - Optionally, the base distribution
$p(\mathbf{z})$ can also be made conditional on$c$ .
(Accordingly, aleatoric uncertainty of some output
class: middle
.footnote[Image credits: Winkler et al, 2019.]
class: middle
.grid[ .kol-1-2[ Replace the discrete sequence of transformations with a neural ODE with reversible dynamics such that $$\begin{aligned} &\mathbf{z}_0 \sim p(\mathbf{z})\\ &\frac{d\mathbf{z}(t)}{dt} = f(\mathbf{z}(t), t, \theta)\\ &\mathbf{x} = \mathbf{z}(1) = \mathbf{z}_0 + \int_0^1 f(\mathbf{z}(t), t) dt. \end{aligned}$$ ] .kol-1-2.center[ ] ]
The instantaneous change of variable yields
.footnote[Image credits: Grathwohl et al, 2018.]
class: middle
class: middle
Epistemic uncertainty accounts for uncertainty in the model or in its parameters. It captures our ignorance about which model can best explain the collected data. It .bold[can be explained away] given enough data.
.footnote[Credits: Kendall and Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, 2017.]
???
Once we have decided on a model of the true data generating process, we face uncertainty in how much we can trust the model or its parameters.
To capture epistemic uncertainty in a neural network, we model our ignorance with a prior distribution
.center[
.width-60[] .circle.width-30[
]
]
class: middle
The prior predictive distribution at
Given training data
The posterior predictive distribution is then given by
class: middle
Bayesian neural networks are easy to formulate, but notoriously .bold[difficult] to perform inference in.
Therefore, we must rely on approximations.
Variational inference can be used for building an approximation
We can show that minimizing
???
Do it on the blackboard.
class: middle
The integral in the ELBO is not tractable for almost all
- Sample
$\hat{\omega} \sim q(\mathbf{\omega};\nu)$ . - Do one step of maximization with respect to
$\nu$ on$$\hat{L}(\nu) = \log p(\mathbf{d}|\hat{\omega}) - \log\frac{q(\hat{\omega};\nu)}{p(\hat{\omega})} $$
In the context of Bayesian neural networks, this procedure is also known as Bayes by backprop (Blundell et al, 2015).
Dropout is an empirical technique that was first proposed to avoid overfitting in neural networks.
At each training step:
- Remove each node in the network with a probability
$p$ . - Update the weights of the remaining nodes with backpropagation.
???
Remind the students we used Dropout in Lec 8 when implementing a Transformer.
class: middle
At test time, either:
- Make predictions using the trained network without dropout but rescaling the weights by the dropout probability
$p$ (fast and standard). - Sample
$T$ neural networks using dropout and average their predictions (slower but better principled).
class: middle, center
class: middle
- It makes the learned weights of a node less sensitive to the weights of the other nodes.
- This forces the network to learn several independent representations of the patterns and thus decreases overfitting.
- It approximates Bayesian model averaging.
class: middle
What variational family
- Let us split the weights
$\omega$ per layer,$\omega = \{ \mathbf{W}_1, ..., \mathbf{W}_L \},$ where$\mathbf{W}_i$ is further split per unit$\mathbf{W}_i = \{ \mathbf{w}_{i,1}, ..., \mathbf{w}_{i,q_i} \}.$ - Variational parameters
$\nu$ are split similarly into$\nu = \{ \mathbf{M}_1, ..., \mathbf{M}_L \}$ , with$\mathbf{M}_i = \{ \mathbf{m}_{i,1}, ..., \mathbf{m}_{i,q_i} \}$ . - Then, the proposed
$q(\omega;\nu)$ is defined as follows: $$ \begin{aligned} q(\omega;\nu) &= \prod_{i=1}^L q(\mathbf{W}_i; \mathbf{M}_i) \\ q(\mathbf{W}_i; \mathbf{M}_i) &= \prod_{k=1}^{q_i} q(\mathbf{w}_{i,k}; \mathbf{m}_{i,k}) \\ q(\mathbf{w}_{i,k}; \mathbf{m}_{i,k}) &= p\delta_0(\mathbf{w}_{i,k}) + (1-p)\delta_{\mathbf{m}_{i,k}}(\mathbf{w}_{i,k}) \end{aligned} $$ where$\delta_a(x)$ denotes a (multivariate) Dirac distribution centered at$a$ .
???
Note that this assumes the parameterization
class: middle
Given the previous definition for
- Draw binary
$z_{i,k} \sim \text{Bernoulli}(1-p)$ for each layer$i$ and unit$k$ . - Compute
$\hat{\mathbf{W}}_i = \mathbf{M}_i \text{diag}([z_{i,k}]_{k=1}^{q_{i-1}})$ , where$\mathbf{M}_i$ denotes a matrix composed of the columns$\mathbf{m}_{i,k}$ .
.grid[
.kol-3-5[
That is,
This is strictly equivalent to dropout, i.e. removing units from the network with probability
]
.kol-2-5[.center.width-100[]]
]
class: middle
Therefore, one step of stochastic gradient descent on the ELBO becomes:
- Sample
$\hat{\omega} \sim q(\mathbf{\omega};\nu)$ $\Leftrightarrow$ Randomly set units of the network to zero$\Leftrightarrow$ Dropout. - Do one step of maximization with respect to
$\nu = \{ \mathbf{M}_i \}$ on$$\hat{L}(\nu) = \log p(\mathbf{d}|\hat{\omega}) - \text{KL}(q(\mathbf{\omega};\nu) || p(\mathbf{\omega})).$$
class: middle
Maximizing
This is also equivalent to one minimization step of a standard classification or regression objective:
- The first term is the typical objective (such as the cross-entropy).
- The second term forces
$q$ to remain close to the prior$p(\omega)$ .- If
$p(\omega)$ is Gaussian, minimizing the$\text{KL}$ is equivalent to$\ell_2$ regularization. - If
$p(\omega)$ is Laplacian, minimizing the$\text{KL}$ is equivalent to$\ell_1$ regularization.
- If
class: middle
Conversely, this shows that when training a network with dropout with a standard classification or regression objective, one is actually implicitly doing variational inference to match the posterior distribution of the weights.
class: middle
Proper uncertainty estimates at
- Draw
$T$ sets of network parameters$\hat{\omega}_t$ from$q(\omega;\nu)$ . - Compute the predictions for the
$T$ networks,$\{ f(\mathbf{x};\hat{\omega}_t) \}_{t=1}^T$ . - Approximate the predictive mean and variance as
$$
\begin{aligned}
\mathbb{E}_{p(y|\mathbf{x},\mathbf{d})}\left[y\right] &\approx \frac{1}{T} \sum_{t=1}^T f(\mathbf{x};\hat{\omega}_t) \\
\mathbb{V}_{p(y|\mathbf{x},\mathbf{d})}\left[y\right] &\approx \sigma^2 + \frac{1}{T} \sum_{t=1}^T f(\mathbf{x};\hat{\omega}_t)^2 - \hat{\mathbb{E}}\left[y\right]^2,
\end{aligned}
$$
where
$\sigma^2$ is the assumed level of noise in the observational model.
class: middle, center
(demo)
class: middle
.footnote[Credits: Kendall and Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, 2017.]
exclude: true
Consider the 1-layer MLP with a hidden layer of size
Assume Gaussian priors
exclude: true class: middle
For a fixed value
We have
The variance of the contribution of each hidden unit
We define
exclude: true class: middle
By the Central Limit Theorem, as
The bias
exclude: true class: middle
Accordingly, for
For two or more fixed values
exclude: true class: middle
This result states that for any set of fixed points
In other words, the infinitely wide 1-layer MLP converges towards a Gaussian process.
.center[(Neal, 1995)]
class: end-slide, center count: false
The end.