Approximating the log posterior using quadratic approximation and the Laplace approximation#
Quick recap#
We started with our simple question: ‘Is there a relationship between the flipper length of penguin and their weights’. We saw that we can use the linear model to answer that question based on some data. We also need the Bayes theorem to be able to calculate the posterior, which will inform us of the likelihood of the values of the parameters of our model given the data. And if, given our data, the value that characterize the relationship between flipper length and weight is unlikely to be close to 0, then we can answer our question with some amount of confidence.
In order to apply the Bayes theorem to our linear model, we need to multiply the likelihood of our data given any values of the parameters (\(P(y|\Theta)\)) by our prior (\(P(\Theta)\)), which characterizes our belief about likelily values of the parameters of our model. Then, we need to divide that whole thing by a scaling constant: the marginal likelihood or the model evidence, which is basically the integration of the numerator (\(P(y|\Theta) * P(\Theta)\)) over all possible values of \(\Theta\). And as we saw, in the case of our model, the likelihood is a multivariate distribution, while the prior consists of the product between a normal distribution for the \(\beta\) parameters of our model and a Delta distribution for the \(\sigma^2\) parameter of our model. And unfortunately, with this combination of distributions, we can’t find an easy (i.e. analytical) solution for the integral in the denominator.
In other words, for our simple model, we can’t compute the model evidence. In the previous chapter, we found a way to circumvent that by relying on the Jensen inequality. Instead of directly computing the model evidence, we can instead optimize the free energy function, which is defined as:
So we have to find the \(Q(\Theta)\) distribution with which the output of the free energy functional is maximal, in which case that output is the free energy, which is an approximation of the log of the marginal likelihood. And also, the distribution \(Q(\Theta)\) is maximal is itself an approximation of the posterior.
So now, in order to answer our question ‘Is there a relationship between the flipper length of penguin and their weights’, our goal is to esimate the parameters of our model that maximize the free energy functional, which will yield the model evidence and the approximate posterior so that we can get our confidence for the true value of our model parameter to be able to answer our question. In other words, what we have to do conceptually is try out a lot of different possible \(Q(\Theta)\) distributions until we find the one that yields the largest value of \(F[Q(\Theta)]\).
Yeah? All good? Perfect, now let’s get to it.
Dealing with the integral#
Wait a minute: how do we actually calculate the result of the free energy functional for each \(Q(\Theta)\) we try? Because just to reiterate, eventhough the integral symbol is absent from the Free energy equation, it is still there!. It is hidden in the E symbol, which is an integral over all values of \(\Theta\). We can actually rewrite it like so:
One thing we can do to make the free energy functional a bit easier to deal with is this:
That doesn’t solve anything. But now, we can deal with the Expectation of the log of the joint probability \(P(y, \Theta)\) and the expectation of the log of the approximate posterior \(Q(\Theta)\) separately, meaning if we can figure each part out, then we can simply take the substraction of the latter with the former.
So how do we solve the integral in each of the part? Unfortunately, we cannot solve it analytically for either the log of the joint probability nor for the log of the approximate posterior (which again means that to calculate the exact result we would need to first calculate the probability at each possible values of \(\Theta\) of which we have an infinity, so it’s impossible to calculate them all). What we will do instead is approximate the part inside the expectation by some other function, hoping that we can use tricks to derive analytically the expectation of that function we use an approximation. Or said differently, we will replace the function inside the expectation by some other function that we know should be roughly equal to it, and try to calculate analytically the expectation of that other function instead. It’s really important to understand this. Just to be extra clear: we will approximate what’s inside the expectation, so \(ln P(y, \Theta)\) and \(ln Q(\Theta)\), not the expectation itself directly.
This may all seem confusing and contradictory: if we approximate the function inside the Expectation by another function, then we are calculating the Expectation of the approximation and not of the function itself. The expectation of the approximation should yield different results from the expectation of the actual function we are interested in. So are we in fact calculating an approximation of the expectation of the function when we are calculating the expectation of the approximation? That is correct, we will in fact be calculating an approximation of each expectations as a way to sidestep the integral. But it is important to understand that we do so by finding a function that approximate what’s inside each of the expectations, and not by finding a function that approximates the expectation directly.
In this chapter and the next, the math is going to be a little bit more complicated that we have seen so far, and goes beyond typical highschool curriculum. In a sense, this chapter and the next are a bit the opposite of the previous chapter. In the previous chapter, the maths involved was fairly straightforward, but it was a bit difficult to wrap your head around conceptually, because it wasn’t quite clear where it was going. In this chapter and the next, the math involved is more complex, but where we are going is more straight forward: calculate (or approximate) the expectation terms so that we can then compute the free energy for each \(Q(\Theta)\) we try.
In this chapter, we will focus on approximating the \(ln Q(\Theta)\) term and show how we can calculate the expectation of the approximation analytically.
Quadratic approximation#
We want to find an approximation of the log approximate posterior \(ln Q(\Theta)\), such that we can derive the expectation of the approximation analytically. But as we saw in the previous chapter, the only thing we know is that \(Q(\Theta)\) is a probability distribution, but we don’t know much more about it: we don’t know what kind of a probability distribution it is, nor how it is parametrized. So even if we knew say that \(Q(\Theta)\) is a multivariate normal distribution, we wouldn’t know what’s the mean (i.e. the mode) nor the standard deviation of that distribution. So you might wonder, if we don’t have any clue of what that distribution is (i.e. we don’t know it’s formulae, nor its parameters), how can we know what function should be roughly equal to it, i.e. what would be a good approximation of it?
These are all very valid question, and the goal of this whole chapter is answering it. But for now, put these questions aside. Depending on your background, you might be familiar with something called Taylor series expansion. This is a mathematical tool to approximate any function around a particular point by taking the sume of the derivatives of that function. In other words, for any function, you can actually compute many derivatives, you can approximate it around a point \(x_0\), like so:
You can go as far as you want (or as far as the function let you, depending on the function). But if you approximate a function by only going to the second derivative, it is called a quadratic approximation. And we say that:
\(f(x_0) + \frac{f'(x_0)}{1!}(x-x_0) + \frac{f''(x_0)}{2!}(x-x_0)^2\)
Is the quadratic approximation of \(f(x)\)
We can write it a bit more simply:
You are probably familiar with the concept of a derivative of the first and second order. If you have a function:
\(f(x) = x^2\)
The the first order derivative of that function is:
\(f'(x) = 2x\)
And the second order derivative is the derivative of the derivative:
\(f''(x) = 2\)
And you might also remember that the first order derivative defines the slope of the tangent of function at any given point. So for our function \(f(x) = x^2\), the slope at say \(x=2.5\) is \(f'(2.5)=2*2.5=5\). In other words, the derivative is the rate of a change of a function at a single point. Hopefully, that should already be familiar. The second order derivative perhaps a bit less, but it defines the curvature of a function at a particular point. What that exaclty means isn’t very important right now, this is just a very quick refresher about what derivatives are.
Now one thing you might be less familiar with (at least I was before diving into this) is what happens to the concept of derivatives when we are dealing with functions that don’t take single values as inputs, but rather vectors. The whole reason why we are talking about any of this is because ultimately, we want to approximate of \(Q(\Theta)\) and \(P(y, \Theta)\) to easily find their expecations. But \(\Theta\) isn’t a single value, it is a vector that contains a value for each parameter of our model. In the case of vectors, the first order derivative of a function is something we call a gradient. And the second order derivative is something we call a Hessian matrix. We won’t have time to dig into this much further here, but if you want to know more about each, I highly recommend this series of videos on youtube, which are explaining multivariate calculus very well.
In any case, when dealing with functions that take vectors as their input, the quadratic approximate becomes this:
Where:
\(\nabla f(x_0)\): is the gradient (i.e. the first order derivative of a vector function) of \(f(x)\)
\(H_f(x_0)\) is the Hessian matrix at \(x_0\)
This may look a bit intimidating, and you might not know how to compute each of the parts. Just remember this: \(H_f(x_0)\) is a matrix which captures the curvature of a function at a particular point. The rest you don’t need to worry about, it will mostly disappear by the end of this chapter!
The quadratic approximation of the approximate posterior#
Even if we don’t know what \(Q(\Theta)\) is, we can still write the formulae of its quadratic approximation of it’s log based on what we saw above. We could of course also write the quadratic approximation of \(Q(\Theta)\) itself and then take the log of that, but since we want to compute the expectation of the log of \(Q(\Theta)\), we will write the quadratic approximation of \(ln Q(\Theta)\) directly. The quadratic approximation of \(ln Q(\Theta)\) is the following:
Or something like that. I have just replaced the f(x) by the actual function, and the \(x_0\) by \(\Theta_0\). There is actually a little notation issue with the formulae above. I have written the hessian as \(H_{ln Q}\). That’s a little bit unconventional, because it would typically be written as H_f, and then you would define the function in a separate formulae (“where f=…”). There is however a different notatation that you can adopt which is the following:
These two formulae mean the same thing (hence the equal term), but the latter displays the function we are computing the Hessian from more clearly. So we can replace the H in the formulae above by that notation. Again, it doesn’t change anything, it just makes it clearer and easier to read:
Okay, so if we want to approximate the log of \(Q(\Theta)\) from a certain point \(\Theta_0\), we need to calcuate the log of \(Q(\Theta)\) at that point, plus the first order derivative (i.e. the gradient) at that point with some scaling parameter, plus the second order derivative (hessian) with some scaling. And of course, you must be thinking “since we don’t know \(Q(\Theta)\), we can’t calculate any of this, so what’s the point?”. Again, keep that thought in the back of your mind for now. But now you may have another question: “What is \(\mu\) supposed to be, how do we find the modes of a distribution we don’t even know?”. Very good question, but once again, keep that in the back of your mind.
One other thing we need to do is decide what \(\Theta_0\) should be. This is basically a set of values for each of the parameters from which to start approximating from. If you remember the 3D plots of multivariate distribution we saw before, the \(\Theta_0\) is just a point on that distribution, which corresponds to a given value for each of the parameter. But which point should we chose? One thing to note about quadratic approximation (and any kind of finite Taylor series expansion) is that the approximate is most accurate around the point we start from (i.e. around \(\Theta_0\), no matter which \(\Theta_0\) we chose). And remember that the reason we are approximating the log of the joint probability using the quadratic approximation is because perhaps we can solve the expectation (i.e. integral) the quadratic approximation of the approximate posterior. So in other words, our final goal is to approximate the Expectation of the log of the approximate posterior by calculating the Expectation of the quadratic approximation of the approximate posterior (I know, it’s a mouthful).
Accordingly, we should chose our \(\Theta_0\) to be the point of the distribution which surroundings has the strongest impact on the Expectation. So what is the point in a probability distribution that has the strongest impact on the expectation? The mode! The mode is the point in your distribution with the highest probabilty. And because the Expectation is a weighted average, the mode of the distribution (the point with the highest probability) has the strongest influence. This is why we will take \(\Theta_0\) to be \(\mu\), where \(\mu\) stands for the mode. So we can rewrite the function above like this to make it more explicit:
There is one important consequence to taking the mode as our point to approximate from. The mode of a distribution is a peak, meaning that if you move in any direction from there, you go down. As we have explain, the \(\nabla ln Q(\mu)\) is the gradient of the distribution, which is the same thing as the derivative you know from high school, just in multidimension space. But it describes the same thing: the slope of the tangent to a point in your distribution. Now if the point you calculate from it a peak, the slope of the tangent is… 0. That’s right, if you approximate a function at its peak (i.e. mode), then the gradient term is equal to 0, so the above function simplifies to:
We still don’t know what \(Q(\Theta)\) is, but the the approximation at the mode of this distribution seems easy enough to deal with. And remember that the Hessian (\(\delta_{\mu, x_j} ln P(y, \mu)\)) is a matrix.
The Laplace approximation of \(Q(\Theta)\)#
Here is where we adress the central question: “what on earth is the approximate posterior \(Q(\Theta)\)”. We still don’t know what it is, but we know the quadratic approximation of its log. Granted, we also don’t really know what the quadratic approximation of its log is, because the quadratic approximation from its log still requires us to compute the log of \(Q(\Theta)\) at a particular point, which we can’t do if we don’t know \(Q(\Theta)\). This is all true.
But we can do something quite interesting. Since we know the approximation of the log of the distribution, the approximation \(Q(\Theta)\) is simply the exponential of that:
We just got rid of the log signs, and with the exponential, a sum becomes a multiplication, hence the formulae on the second line. Now, there is one more important thing about approximating a probability distribution at its peak. The hessian term represents the curvature at that particular point. And so since we are dealing with probability distribution, there is only one way we can go from the peak: down. So this means that the Hessian at the peak will always be negative definite. And again remember that it is a matrix. So we will rewrite the equation above like so:
Does that remind you of anything? We have seen a very similar formulae in previous chapters, the formulae of the multivariate normal distribution:
In the multivariate normal distribution, we have an exponentiated term that gets divided by something (\((2*\pi)^{p/2}|\mathcal{\Sigma}|^{1/2}\)). That something is just a normalization constant to ensure that the whole sums to 1. In the quadratic approximation of our probability distribution, we have some exponentiated term multiplied by some number \(G(x_0)\) (which is a single number remember).
If we look at the exponentiated term, they are almost exactly teh same across both formulae, to the difference that in the multivariate we have the term \(\Sigma\), while in the quadratic approximation, we have \(\Lambda\). But if you remember, both these terms are matrices. So in fact, we have:
The exponentiated term is the same between the two formulae, so the only difference is the one number by which we multiply it. In the multiavariate normal distribution, this is the normalization constant to ensure that it all sums to 1. In the case of the exponential of the quadratic approximation, we don’t quite know what it is. But since we multiply the exponentiated term by that instead of the normalization constant, we say that the exponential quadratic approximation of a probability distribution is a scaled multivariate distribution.
This is what the Laplace approximation is. It is bascially the observation that the exponential quadratic approximation of the log of a probability distribution near its mode is a scaled multivariate normal distribution (i.e. a Gaussian). That’s just the way it is.
What you might wonder is what should we do about the \(G(x_0)\) term? Should we calculate it? Should we somehow figure out if its equal to the normalization constant? The truth is that we don’t quite care about it. Since it is a constant, it won’t impact the Expectation, no matter what we set it as. However, since \(G(x)\) is a probability distribution, this value has to be a constant that ensures that the whole integrates to 1. So the only valid option is the following:
Assumption of Gaussianity under the Laplace approximation#
There is an important implication to all of this. If we approximate the approximate posterior \(Q(\Theta)\) using the quadratic approximation, it implies that we approximate the posterior as a Gaussian. In other words, we will assume that the posterior is a normal distribution. This is why you will often read about how the Free energy method for Bayesian statistics has an “assumption of Gaussianity” about the posterior, or that the method assumes that the posterior is a normal distribution. This is all true, but it is important to understand where this “assumption” comes from. It comes from the methodological choice to approximate \(Q(\Theta)\) using Laplace approximation, i.e. quadratic approximation of the log approximate posterior around its mode. It doesn’t come from a general assumption that any posteriors will look like a normal distribution, but from the fact that around its mode, the posterior will look like one.
The expectation of the approximation of the approximate posterior#
So from now on, we will just \(Q(\Theta)\) to be a multivariate Gaussian distribution. So we will now try to compute the expectation of the log of \(Q(\Theta)\) by trying to find a solution to the expectation of the log of a multivariate normal distribution.
Let’s first roll back to the quadratic approximation. We have written before that the approximation of the log of the approximate prior is:
\(ln Q(\Theta_0) + -\frac{1}{2}(\Theta-\mu)^T \Sigma^{-1}(\Theta-\mu)\)
So the expectation of that is simply that:
This may seem utterly confusing. Didn’t we just say that we figured out \(Q(\Theta)\) is a normal distribution, can’t we just see if we can calculate the expectation of the log of the multivariate distribution instead? The answer is no, we can’t. There is not analytical solution to the expectation of the log of a multivariate distribution. But then, how does the above help, we still have \(ln Q(\mu)\) inside the expectation? Devil is in the detail: this is \(ln Q(\mu)\), not \(ln Q(\Theta)\)! So it is not a problem. The \(ln Q(\mu)\) does not depend on \(\Theta\), so we can take it out of the Expectation (same old trick). So we can rewrite the above function by taking out of the expectation term everything that doesn’t depend on \(\Theta\):
Now that we know that \(Q(\Theta)\) is a normal distribution, let’s see if we can write down the function of the log of \(Q(\Theta)\) at its modes:
So we can add that in the formulae of the expectation of the approximation of the log approximate posterior: $\( \begin{align} \mathbb{E}_{Q(\Theta)}[ln Q(\Theta)] \approx -\frac{1}{2}[ln(|\Sigma|) + n ln 2\pi] - \frac{1}{2}\mathbb{E}_{Q(\Theta)}[(\Theta-\mu)^T \Sigma^{-1}(\Theta-\mu)] \end{align} \)$
Still wondering what the mode of the approximate posterior is? We are getting there, I promise, have patience. In the equation above, we only have to deal with the Expectation for the right hand term. Remember when I said we will use complicated maths in this chapter? Well this is is where that happens. In order to get rid of the Expectation sign, we will use something called the trace trick. You don’t actually need to understand that part, but we might write a chapter to explain how that works (drop an issue on github if you’d like us to). We will simply repraise the derivation from “A primer on variational Laplace”:
“To simplify these expressions, note that each quadratic term inside the square brackets is a scalar. This means we can use the ‘trace trick’, \(tr(ABC) = tr(CAB)\). Applying this gives the simpler expressions:” $\( \begin{align} \mathbb{E}_{Q(\Theta)}[ln Q(\Theta)] &\approx -\frac{1}{2}[ln(|\Sigma|) + n ln 2\pi] - \frac{1}{2}\mathbb{E}_{Q(\Theta)}[tr\bigg((\Theta-\mu)^T \Sigma^{-1}(\Theta-\mu)\bigg)] \\ &= -\frac{1}{2}[ln(|\Sigma|) + n ln 2\pi] - \frac{1}{2}tr\bigg(\mathbb{E}_{Q(\Theta)}[(\Theta-\mu)^T(\Theta-\mu)\Sigma^{-1}]\bigg)\\ &= -\frac{1}{2}[ln(|\Sigma|) + n ln 2\pi] - \frac{1}{2}tr\bigg(\Sigma^{-1}\mathbb{E}_{Q(\Theta)}[(\Theta-\mu)^T(\Theta-\mu)]\bigg) \end{align} \)$
Okay, we are making progress. The term in between the expectation bracket probably looks familar from previous chapters. Indeed, the expectation of the \((\Theta-\mu)^T(\Theta-\mu)\) is the formulae of the covariance matrix we saw before. And one thing we haven’t explained (but we will below) is that in fact, the \(\Sigma\) is also the covariance matrix. So in fact,
Let’s replace it in the equation:
As it turns out, multiplying a matrix by its inverse yields the identity matrix. The trace of the identity matrix is equal to the number of rows of the matrix. Since \(\Sigma\) is the covariance matrix, we have as many rows as we have parameters, so we have:
Where \(n\) is the number of parameters in our model! So here we are, we have gotten rid of the Expectation, and we now have an analytical solution to the expectation of the log of the approximate posterior!
The modes and covariance matrix of the distribution is what we optimize for#
The formulae above seems straight forward enough and easy to calculate, and we will in a bit how we do that. But there are two things that you are probably still confused about: \(\mu\) and \(\sigma\). What should these values be? At this point, it’s important to step back to remind ourselves why we did everything we did so far. Otherwise, you’ll surely feel lost in the details of the maths and fail to see the whole picture. To go back to our penguin example, we are fitting the following model:
Where
And the whole goal is to compute the probability of the \(\beta\) and \(\sigma^2\) parameters after seeing the data, which is the posterior:
Where \(\Theta\) is a vector containing the parameters \(\beta_0\), \(\beta_1\) and \(\sigma^2\). The posterior is a probability distribution that gives us for each value of the parameter, what the probability is of that parameter given the data we have observed. The whole tricky bit with Bayes is that getting this value requires to calculate this:
Which is made difficult by the fact that \(P(y)\) contains an integral. But regardlless, our ultimate goal is to be able to find the parameter of the posterior distribution, so that we can answer our final question.
As we have said in the previous chapter, with the Free energy functional, we converted the problem of the integral in an issue of variational calculus. We can aim to find the approximate posterior \(Q(\Theta)\) that results in the highest possible value for the free energy, as that distribution should be closest to the true posterior. And we have seen in this chapter that if we use the quadratic approximation to find the expectation of the log of that approximate posterior, it implies that the approximate posterior is in fact a multivariate normal distribution.
So what that means is that our approximate posterior has the following formulae:
So based on the formulae above, what does it mean to find the approximate posterior that maximizes the free energy functional? That’s right, it means finding the values of \(\mu\) and \(\Sigma\) that yield the highest possible values of the free energy function. Accordingly, in the formulae above, the fact that we don’t know what the values of \(\mu\) and \(\Sigma\) are is normal: that’s exactly what we are looking for!. Indeed, if we know the modes and the covariance of the posterior, then we have found the posterior, in other words, we have solved (or approximated) the Bayes theorem for our problem. So conceptually, the optimization of the free energy functional entails that we try many different values for the modes \(\mu\) and the covariance \(\Sigma\) of our approximate posterior distribution.
You can think about it that way: in the case of the likelihood function \(P(y|Theta)\), the maximum likelihood estimate are the values of \(\Beta\) and \(\sigma^2\) under which the data are the most likely. In the posterior, we are trying to find the same thing but the other way around: the values of \(\Beta\) and \(\sigma^2\) that are most likelly undere the data, but we also add the covariance matrix \(\Sigma\). This is one thing that’s perhaps confusing at first: aren’t \(\sigma^2\) and \(\Sigma\) both variance terms? If yes, why is \(\sigma^2\) part of the mode, separate from \(\Sigma\), shouldn’t it be inside it? The answer is no, because \(\Sigma\) is the error term about the parameters of the model. So it is also something about error and with variance in it, but not about the observations, but about the parameters we have computed instead.
Final recap#
In this chapter, we have seen that we can use the quadratic approximation to compute the Expection of the log of the approximate prior analytically. And we have seen that using the quadratic method to approximate the log of the approximate prior entails that approximate prior is a scaled gaussian, which means that the approximate prior is a multivariate normal distribution, which is what the Laplace approximation is. With all that in mind, we could write a simple function for the one side of the free energy functional:
Where:
\(\mu\) is the mode of the approximate prior, i.e. the most likely values of \(\beta_0\), \(\beta_1\) and \(\sigma^2\) given the data in our penguin example
\(\Sigma\) is the covariance matrix of our parameters
\(n\) is the number of parameters (so 3 if we have \(\beta_0\), \(\beta_1\) and \(\sigma^2\))
And so when we try to optimize the free energy functional, we will try many different values of \(\beta_0\), \(\beta_1\) and \(\sigma^2\) and \(\Sigma\), until we find the right combination that maximizes the formulae:
But before we can do that, we still need to get rid of the expectation in that other term \(\mathbb{E}[ln P(y, \Theta)]\). We will see in the next chapter how we do that.