Variational Laplace

Variational Laplace#

We have finally made it to the main topic of the book: variational Laplace.

We have seen in the previous section that if we want to answer a question of the kind “Is there a relationship between penguins flipper length and their weights”, we can use a linear model. In order to answer our question in Bayesian term, we want to obtain the posterior probability distribution, that tells us the likelihood of the true value of the parameter of our model given the data we have measured in our experiment. If it is highly unlikely that the parameter that relates penguin flipper lengths to their weight in the model (\(\beta_1\)) is equal to 0 (or close thereof), then we can answer our question and say ‘yes, there is a relationship between penguin flipper lengths and their weights”, or even better, we can say “I am 95% confident that the relationship between the 2 is between this value and that value”. Following the Bayes Theorem, to compute the posterior, we need to multiply the likelihood of the data under all values of our parameters by our prior belief of the likelihood of each parameter and divide the whole by the marginal likelihood, also called model evidence, which is the likelihood of the data after marginalizing all possible values of the parameters of our model. Because we are dealing with a linear model, the likelihood of our data is a multivariate normal distribution, due to the assumption that the error of our model is normally distributed.

There is however one main problem: we can’t solve the marginal likelihood analytically. This is because there is an integral in the formula of the marginal likelihood. To compute it, e would need to compute one by one the results of a function for an infinity of values, and then sum all these values together, which is of course impossible to do.

Saying the same thing with mathematical formulae#

We can also recapitulate all we have seen before with mathematical formulae. It is not saying anything different from above, just the same thing in a different way (albeit a bit more detailed and precise).

The linear model:#

To answer our question ‘Is there a relationship between penguins flipper lengths and their weights’, we can use a linear model:

\[y = \boldsymbol{X}\boldsymbol{\beta} + \epsilon\]

Where:

\(y= \begin{bmatrix}y_1\\y_2\\...\end{bmatrix}\): is a vector containing each data point we are trying to model (i.e. penguin weight)
\(\boldsymbol{X}= \begin{bmatrix}x_{0, 1},\ x_{1, 1},\ ...\\x_{0, 2},\ x_{1, 2},\ ...\\...,\ ...,\ ...\end{bmatrix}\): is a matrix containing the regressor, also called the design matrix. Each column contains all the value of one regressor, each row the value of all regressor associated with each data point. In our example, \(x_0\) is the intercept, so it’s the same in each row: \(x_{0_1} = x_{0_2} = 1\), while \(x_1\) is the weight of our penguins, so \(x_{1, 1}\) is the weight of the first penguin whose weight is \(y_1\) and so on for all our penguins
\(\boldsymbol{\beta}=\begin{bmatrix}\beta_0\\\beta_1\\...\end{bmatrix}\): the weight of our regressors. We don’t know the true values, but can estimate the observed values of these parameters that generate a line that fits the data the best, i.e. that minimize the residuals sum of square (RSS) by using OLS (see below). The values found by the OLS are the maximum likelihood estimates (MLE), which means that the likelihood of the data is maximized under these parameter (see likelihood function below)
\(\epsilon \sim \mathcal{N}(0, \sigma^2)\): the error term follows a normal distribution of mean 0 and of variance \(\sigma^2\). It is an assumption, meaning that if this is not true, we shouldn’t use this kind of model. We assume that the mean of the distribution is 0, which means that negative errors (underestimations) are equally likely as positive errors (overestimations). The \(\sigma^2\) on the other hands is not assumed and depends on the data. The variance parameter of the distribution controls its spread, meaning that large \(\sigma^2\) implies that large error are more likely compared to when \(\sigma^2\) is small. and we can also find the MLE for this parameters, i.e. the value of \(\sigma^2\) under which the data are the most likely

In this model, the \(y\) and \(\boldsymbol{X}\) are fixed: it is the data we observed (i.e. measured, collected…). The likelihood of observing these values depends on the value of the parameters \(\beta\) and \(\sigma^2\). Accordingly, we can try to find the \(\beta\) and \(\sigma^2\) under which the data are most likely, which is what we do when we fit the model to the data. The formula for finding the MLE for the \(\beta\) is know as the optimal least square (OLS):

\[\hat{\beta} = (\boldsymbol{X}^T\boldsymbol{X})^-1\boldsymbol{X}^Ty\]

This formula looks a bit different to the OLS formulae we saw before for calculating \(\hat{\beta_1}\) and \(\hat{\beta_0}\), beacuse it is in vector format and it general to calculate all the \(\beta\) even if you have more than 2.

The MLE of the variance term of the \(\epsilon\) distribution is:

\[\hat{\sigma}^2 = \frac{(y-\boldsymbol{X}\boldsymbol{\hat{\beta}})^2}{n}\]

In the case of the \(\epsilon\) of our model, we don’t write that it is equal to the normal distribution, because the error follows a distribution or probability density function but it is not equal to it.

Bayes theorem applied to the linear model#

When we ask the question: “Is there a relationship between penguins’ fipper lenghts and their weights”, we want to draw an inference on the true value of the \(\beta\) parameter. We can never know the true value, but we can infer the likelihood of the true value of the parameter given the data, for which we need the Bayes theorem, which states that:

\[P(\Theta|y) = \frac{P(y|\Theta)P(\Theta)}{P(y)}\]

In our specific case, we have:

\[P(\beta, \sigma^2|y) = \frac{P(y|\beta, \sigma^2)P(\beta, \sigma^2)}{P(y)}\]

Where \(\beta\) is a vector containing the values of the various beta of our model. So in our penguin example, it contains both \(\beta_0\) and \(\beta_1\)

The likelihood of the data is a multivariate normal distribution (because the error is normally distrubuted around 0), defined as:

\[P(y|\beta, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\frac{1}{2\sigma^2}(y-\boldsymbol{X}\beta)^T(y-\boldsymbol{X}\beta))\]

The prior distribution is:

\[P(\beta, \sigma^2) = P(\beta) \times P(\sigma^2)\]

Where \(P(\beta)\) is also a multivariate normal distribution, like so:

\[P(\beta) = \frac{1}{(2\pi)^{p/2}|\mathcal{\Sigma}|^{1/2}}exp(-\frac{1}{2}(\mathcal{\beta} - \mathcal{\mu})^T\Sigma^{-1}(\mathcal{\beta}-\mathcal{\mu}))\]

Where:

\(\mu\) is the prior mean, meaning the value we think is the most likely for each \(\beta\) parameter
\(\Sigma\) is the prior variance covariance matrix. Along the diagonal, it contains the variance of the prior distribution of each \(\beta\) parameter (i.e. how likely we think the true value of the \(\beta\) parameter is really far away from the mean we declared) and off the diagonal is the covariance between two \(\beta\) parameters. We should set the off diagonal to 0 if we think that each pair of \(\beta\) are independent from each other and to non-zero otherwise. What we mean by \(\beta\) not being independent is if you have reasons to believe that observing a certain value of \(\beta\) for one parameter is likely to result in another \(\beta\) having a larger or smaller value (depending on the sign of the covariance)

And \(P(\sigma^2)\) is:

\[P(\sigma^2) = \frac{b^\alpha}{\Gamma(\alpha)}(\sigma^2)^{-\alpha-1}exp(-\frac{b}{\sigma^2})\]

We can’t compute the marginal likelihood…#

The reason the book isn’t finished already is because we have a problem: we can’t compute the marginal likelihood. The marginal likelihood is:

\[P(y) = \int P(y|\boldsymbol{\beta}, \sigma^2)P(\boldsymbol{\beta}, \sigma^2)d\beta d\sigma^2 = \int P(y|\boldsymbol{\beta}, \sigma^2) \times P(\boldsymbol{\beta}) \times P(\sigma^2)d\beta d\sigma^2\]

\[P(y) = \int \frac{1}{(2\pi\sigma^2)^{n/2}}exp(-\frac{1}{2\sigma^2}(y-\boldsymbol{X}\beta)^T)(y-\boldsymbol{X}\beta) \times \frac{1}{(2\pi)^{p/2}|\mathcal{\Sigma}|^{1/2}}exp(-\frac{1}{2}(\mathcal{\beta} - \mathcal{\mu})^T\Sigma^{-1}(\mathcal{\beta}-\mathcal{\mu})) \times \frac{b^\alpha}{\Gamma(\alpha)}(\sigma^2)^{-\alpha-1}exp(-\frac{b}{\sigma^2})\ d\beta d\sigma^2\]

The marginal likelihood requires to integrate over all possible values of \(\beta\) and \(\sigma\), which is impossible because there is literally an infinity of possible values. This is a fact which we have to accept it.

Variational Laplace: a method to approximate the marginal likelihood#

If we can’t compute the marginal likelihood, we can’t compute the posterior and we can’t answer our original question. This is where the variational Laplace method comes into play. This method calls on rather advanced mathematics (by my standards at least), but the gist idea is the following: instead of trying to solve the marginal likelihood, we try to approximate it by calculating something that should approach it. There are many bits and pieces of mathematics associated with the method (both the implementation and the justification of why it works). None of them are incredibly complicated on their own, but it is a bit difficult to understand how they all fit together. We will explore each bit in a progressive order, so that by the end of it, you hopefully have a good understanding of it! Let’s dive in.