Back to our penguins#

The previous chapters were quite intense, we barely spoke about our penguins, and we didn’t write any code. It was all very theoretical. We saw that we can use the Jensen inequality to transform the impossible integral into a problem of variational calculus (i.e. optimization) with the free energy functional. Then we saw that we can use the quadratic approximation to get rid of the integrals in the free energy functional, in which case we are approximating the posterior distribution (which we aim to calculate) as Gaussian as per the Laplace approximation. So we now have the formulae for the free energy functional as follows:

\[F[Q(\Theta)] = ln P(y, \mu) - \frac{n}{2} + \frac{1}{2} [ln(|\Sigma|) + nln2\pi e]\]

And the crucial thing is that the distribution \(Q(\Theta)\) that yields the highest possible value of the free energy functional is also the (approximate) posterior distribution we sought to calculate. So in order to find the posterior, we don’t really need to go through the whole Bayes theorem:

\[P(\Theta|y) = \frac{P(y|\Theta)P(\Theta)}{P(y)}\]

Instead, we can optimize the free energy functional instead, which is basically killing two birds with one stone, as it yields both the model evidence \(P(y)\) as well as the (approximate) posterior \(Q(\Theta)\approx P(\Theta|y)\)

That’s still all a bit dry probably. To illustrate it practically, we will first get back to our penguins to relate each bit of the free energy functional with the concrete parts of our penguin example, and in the next chapter, we will provide a worked out example with all the code needed.

Everything we have seen so far, but with our very concrete penguins#

Okay, let’s once again rehearse the goal of it all. We started with our very simple example: we want to know if there is any kind of relationship between penguins beak length and their weights. So we collected some data from several penguins. To investigate the relationship between penguins beak lengths and their weights, we used a linear model to model penguins weight as a function of their beak length:

\[y = \beta_0 + \beta_1 x + \epsilon\]

And we want to know what the true value of the \(\beta_1\) is: if it’s anything else than 0, then we would say that yes, there is a relationship between penguin beak length and their weights. Unfortunately, we can’t know the true value of \(\beta_1\), it’s not a quantity we can measure directly. Even if we were to measure the beak length and weights of all the penguins that are alive right now, we still wouldn’t know, because maybe in 10 years from now, there will be a penguin born with a weight that really don’t match their beak length with the \(\beta_1\) we had measured across all penguins before!

Since we cannot know the true value, whatever value we obtain through experiments (i.e. measuring many penguins) is only an estimation of the \(\beta_1\) in our sample. So the next best thing we can do is try to find our how confident we can be about some values of \(\beta_1\) being close to the true value given the data we have observed. The Bayes factor enables us to calculate the posterior, which is a direct answer to that quetion:

\[P(\Theta|y) = \frac{P(y|\Theta)P(\Theta)}{P(y)}\]

The posterior gives us the probability of each value of the parameters of interest \(\Theta\) in our model given the data that we have observed, based on the mutliplication of the likelihood of the data given any possible value of \(\Theta\) (the likelihood) and our prior belief about the probability of each values of each parameters \(P(\Theta)\), the whole divide by the model evidence. In the previous chapters, we have seen that the likelihood of the data given the parameters is this:

\[P(y|\beta, \sigma^2) = (\frac{1}{\sqrt{2\pi\sigma^2}})^n\prod_{i=1}^{n}exp^{-\frac{[y_i-X\boldsymbol{\beta}]^2}{2\sigma^2}}\]

And we have also seen that the priors is split in two parts: a multivariate normal distribution for our parameters \(\beta\) and the inverse gamma distribution for our parameter sigma:

\[P(\sigma^2) = \frac{\beta^\alpha}{\Gamma(\alpha)}(\sigma^2)^{-\alpha-1}exp(-\frac{\beta}{\sigma^2})\]

And:

\[P(\mathcal{\beta}) = \frac{1}{(2*\pi)^{p/2}|\mathcal{\Sigma}|^{1/2}}exp(-\frac{1}{2}(\mathcal{\beta} - \mathcal{\mu})^T\Sigma^{-1}(\mathcal{\beta}-\mathcal{\mu}))\]

And the multiplication between the two is our prior:

\[P(\Theta) = P(\beta)P(\sigma^2)\]

In order to specify our priors, we need to specify the \(\alpha\) and \(\beta\) values for the error \(\sigma\). For the \(\beta\), we need to specify a vector of mean values, which is the mode of the distribution, i.e. the values we think are the most likelily, as well as a covariance matrix. The diagonal of that matrix represents how wide our prior is for the values of each parameter, i.e. how much we think values around the mean of that parameters are likely, and the values off the diagonal represent how correlated we think the \(\beta\) are with each other.

But as we saw before, we can’t compute the denominator part because of the integral. So instead, we figured out a way to approximate it by relying on the Jensen inequality and the method of quadratic approximation. With this method, we try to optimize the free energy functional, and once we do so, we have also found the approximate prior:

\[F[Q(\Theta)] = ln P(y, \mu) - \frac{n}{2} + \frac{1}{2} [ln(|\Sigma|) + nln2\pi e]\]

So let’s try to do it conceptually with our penguins before we move to the code implementation. So say we have measure the beak length and weight of like 20 penguins, \(y\) should be a vector of 20 data points. We should have the corresponding beak length of all 20 penguins, which is our \(X\). It’s not seen in the free energy functional, but that’s just because it’s inside \(ln P(y, \mu)\). We should also specify our priors. We have to define the mean of \(\beta_0\) and \(\beta_1\) and the covariance matrix for those. So let’s set it to the following:

  • \(\boldsymbol{\beta_p}=\begin{bmatrix}20\\0\\\end{bmatrix}\)

  • \(\boldsymbol{\Sigma_p}= \begin{bmatrix}10\ 0\\0 \ 10\end{bmatrix}\)

That’s basically saying that I believe that penguins average weight is 20Kg, and that I think the most likely value for \(\beta_1\) is 0, meaning that my initial guess is that there might not be a relationship between penguins beak length and their weight. But then, with the covariance matrix, I am basically saying that I wouldn’t be terribly surprised if penguins average weight turns out to be 15Kg, and I also wouldn’t be terribly surprised if there is a strong relationship between penguin beak length and their weight. So in other words, despite my initial guess about the mean of each parameter, I am actually giving quite a bit of leeway to it.

We also need to specify our prior about the error. Since it is defined using an inverse gamma, we have to specify it using the \(\alpha\) and \(\beta\) parameters (careful, not the same beta as for the parameters we are trying to estimate!). Let’s set those like this:

  • \(\alpha_p=1\)

  • \(\beta_p=1\)

Just cause these seem reasonable when you look at the chapter The prior of the linear mixed model.

The way I am chosing these priors is very informal and arbitrary, but that’s just for the sake of illustrating how to to everything from start to finish. We will see later that there are kind of rules and conventions on how to do these things, but for now what we have is good enough.

With all of this, we are now equipped to compute the approximate prior. And once we have computed the approximate prior, we can answer the question “Is there a relationship between penguins beak length and their weight?” by saying something like “I am 95% confident that the relationship between their beak length and the weight is between 0.5 and 1.32, which doesn’t overlap 0, so I can confidently answer yes”. And that is the best possible response you can give to that question!

To do so, we need to calculate the free energy for many different \(Q(\Theta)\) and find the max value. We have the following:

\[\begin{split} \begin{align} F[Q(\Theta)] &= ln P(y, \mu) - \frac{n}{2} + \frac{1}{2} [ln(|\Sigma|) + nln2\pi e] \\ \end{align} \end{split}\]

And we know that:

\[\begin{split} \begin{align} ln P(y, \mu) &= P(y|\mu)P(\mu) \\ \end{align} \end{split}\]

So now concretly, what we will do is randomly pick some values for \(\mu_{\beta_0}\), \(\mu_{\beta_1}\), \(\mu_{\sigma^2}\) and \(\Sigma\) and calculate the result of the above function. We can then rince and repeat until we find the largest possible value. Once we have done that, we will have found the posterior modes for \(\beta_0\), \(\beta_1\) and \(\sigma^2\) as well as our posterior covariance matrix, and we are done. Easy peasy! Now let’s implement the code!

\[\begin{split} \begin{align} P(y|\Theta) &= (\frac{1}{\sqrt{2\pi\sigma^2}})^n\prod_{i=1}^{n}exp^{-\frac{[y_i-X\boldsymbol{\beta}]^2}{2\sigma^2}} \\ P(\Theta) &= P(\beta)P(\sigma^2)\\ P(\beta) &=\frac{1}{(2*\pi)^{p/2}|\Sigma_p|^{1/2}}exp(-\frac{1}{2}(\beta - \mu_p)^T\Sigma^{-1}(\beta-\mu_p))\\ P(\sigma^2) &= \frac{\beta_p^{\alpha_p}}{\Gamma(\alpha_p)}(\sigma^2)^{-\alpha_p-1}exp(-\frac{\beta_p}{\sigma^2})\\ \end{align} \end{split}\]