Approximating the Expectation of the log of the joint probabilitiy#
In the previous chapter, we have figured out an approximation of the expectation of the log of the approximate posterior, so the free energy functional is simplified to this:
So now, we will basically follow the exact same logic, but to approximate \(\mathbb{E}[ln P(y, \Theta)]\). With everything we have seen up until now, this one should be easy!
Quadratic approximation of the log of the joint probability#
In order to approximate the log of the joint, we will use the same technic as we used for the log of the approximate posterior: the quadratic approximation at the mode. So same as before:
One thing you might wonder about is why the \(y\) remains, why do we have \(ln P(y, \Theta_0)\) and not ln P(y_0, \Theta_0)? That’s because y are the observed data, which is fixed. We can’t try to approximate the joint probability by adjusting the observed data. That’s in contrast with the \(\Theta\) that we are trying to estimate. What we are doing with the quadratic approximation is approximate the function around a certain value of \(\Theta\).
What of the Laplace approximation?#
In the function above, we have used the quadratic method to approximate the log of the joint posterior, doesn’t that imply that we are approximating the joint probability as a Gaussian? Indeed, if you were to take the exponential of the approximation above, we would be doing just that. But in fact, in the case of the joint probability, we don’t need to approximate it, we know what it is: it’s just the numerator. And we have seen before that it’s the multiplication of the prior and the likelihood:
We do know how to calculate that. So in the approximation above, we don’t need to use the approximation formula to figure out what \(ln P(y, \mu)\) is equal to, we can simply calculate it. But so then, what’s the point of using the quadratic approximation in the first place? Remember, we are not using the quadratic approximation to find the distribution, but rather the Expectation of the distribution. In the particular case of the joint probability, we are only using the quadratic approximation for that, it’s only for the posterior that it was handy to use the quadratic approximation of the log at the mode to realize that the posterior itself is a Gaussian.
Calculating the expectation of the quadratic approximation of the log joint probability.#
So now, we can use the exact same trick as before, trace trick and everything:
So just like that, we have found the approximation of the expectation of the log of the joint probability!
What of the Laplace approximation?#
At that point, you might wonder: aren’t we using the Laplace approximation here and if not, why not? Well, if we were to take the exponential of the quadratic approximation, we would also get a scaled Gaussian, same as before. So in other words, the exponential of the quadratic approximation of the log of the joint probability would also be a gaussian. But in this particular case, we actuallz don’t need to approximate the joint probability, we already know what it is and we can compute it, it’s just the numerator of the Bayes theorem. And we have seen before that it’s the multiplication of the prior and the likelihood:
The only reason in the case of the joint probability to use the quadratic approximation is to be able to approximate the expectation. So again, it’s not wrong to say that the approximation of the log of the joint probability we are using to calculate the expectation implies that the approximation of the joint probability is a Gaussian. But it’s completely inconsequential, as we don’t use this fact to calculate the approximation of the log of the joint probability at the mode \(ln P(y, \mu)\) as we did in the case of the approximate posterior. With the joint probability, we can just calculate it.
What is the mode \(\mu\) of the joint likelihood \(P(y, \Theta)\)?#
In the previous chapter, we said that \(\mu\) is typically used to refer to the mode of a distribution. And in the previous chapter, we used the character \(\mu\) to depict the mode of the approximate prior, i.e. the values of \(\Theta\) that are the most likely given the data. Now in the expression above, it is used to depict the mode of the joint likelihood. Now you are probably thinking that it’s just a convention, we use the same letter for both the approximate posterior and for the joint likelihood, but these should be different values as they represent the mode of two different distributions: \(Q(\Theta)\) and \(P(y, \Theta)\). But the neat thing is that approximate posterior \(Q(\Theta)\) and the joint likelihood \(P(y, \Theta)\) actually share the same mode.
It’s easy to understand why. If you remember, the joint likelihood is the multiplication of the likelihood times the prior, which is the numerator of the Bayes theorem:
And as we said before, the denominator of the Bayes theorem is the integral of the numerator, which is a single number. And this single number is the normalization constant, to make sure that the whole thing integrates to 1. What that means is that the posterior is a scaled version of the joint likelihood. The fact that the posterior is a scaled version of the joint likelihood implies that they both have exactly the same shape, just one might be “larger” or “shallower” than the other. But the mode of both distributions is the same, as the overall “shape” of the two is the same! So accordingly, the mode \(\mu\) of the joint likelihood is the same mode \(\mu\) from the approximate posterior \(Q(\Theta)\).
This is important, because that means that we can simply compute the free energy functional by simply trying around a set of \(\mu\) values shared by both the approximate prior and the joint probability! So we can now solve the free energy functional quite easily.
The solution to the free energy functional#
So, the free energy functional is:
We have seen in the previous chapter that the expectation of the log of the approximate prior is:
And now, we have seen that the expectation of the log of the joint probability is:
So we now have the solution to the free energy functional:
where:
\(y\) is our observed data
\(n\) is the number of parameters
\(\mu\) is the mode of the posterior distribution
\(\Sigma\) is the covariance matrix
So we have finally made it. Each term can actually be computed. Therefore, we can now optimize the free energy function by trying out many possible combinations of \(\mu\) and \(\Sigma\) until we find those values that yield the highest possible value of the free energy functional. Once we have done that, the value that we get out of the free energy functional is our approximation of the model evidence from the Bayes Theorem, and the parameters that yield that value are the parameters of the approximate prior. So in other words, we have found everything there is to find!