Bayes theorem applied to the linear model

Bayes theorem applied to the linear model#

As a reminder, the Bayes theorem is this:

\[P(\Theta|y) = \frac{P(y|\Theta)P(\Theta)}{P(y)}\]

Where:

\(P(y|\Theta)\): Likelihood, probability of the data given the parameters
\(P(\Theta)\): prior, probability of the parameter based on previous knowledge, i.e. what we believe the most likely values for the parameters are
\(P(y)\): marginal likelihood or model evidence, probability of y after marginalizing out all possible values of \(\Theta\)
\(P(\Theta|y)\): posterior, probability of the values for the parameters given the data we have seen

A few things might be confusing at first. In the previous example, we didn’t really have a model: we flipped a coin many times, counted how often we got head. We determined that the likelihood for this model is the binomial distribution that tells us what the probability of each value \(\Theta\) (i.e. P(X=1)) is. We didn’t need a model, so why do we need one now? That’s because in the coin tossing model, the parameter of interest was directly observable. If our question was ‘What is the average weight of a penguin?’, we also wouldn’t need a model, we would simply way many penguins (just like we through the coin many times to estimate the true probability of getting head) and use the Bayes theorem to establish how confident we are in the true weight of the penguin being what we measured, given our observations. However, our current question is a little bit more complicated: we want to know how two different variables relate to one another. This isn’t something we can directly observe, so we need the model and so basic math to ‘extract’ that parameter from the data. To the risk of being overly abstract, we can say that in the case of the coin toss example (or if we want to know the average weight of a penguin), there is in fact a model. This model is a bit ‘hidden’ and we would say that it is implicit. That implicit model is that each coin toss is an independent Bernoulli trial with probability of success \(\Theta\) (don’t worry if that doesn’t make perfect sense to you, it is not that critical to understand).

The key message is that the Bayes theorem is agnostic as to what the paramater you are trying to estimate is, and whether it can be directly estimated (i.e. take the number of head out of your number of throw or weight many penguins) or not. It is simply a way to relate the probability of your data given any values of your parameters to the probability of the true value of these parameters given the data (based on the prior and marginal likelihood of course). So it would work for any models you can think off that may give rise to the data you are interested in. We will see below how that works for our simple linear model, and hopefully that should make it clear how it could work in principle for other kind of models.

But before we get started, you might wonder about another difference from the previous examples. In the coin toss example, there was only one parameter we were interested in (\(P(X=1)\)). But in the linear model, there are two parameters: \(\beta_0\) and \(\beta_1\). In fact, as you will see below, there are also other parameters we will try to estimate which relate to the error. However, in the formulation of the Bayes theorem we have seen so far, there was a single term for \(\Theta\). However, that doesn’t need to be the case. \(\Theta\) is a general term that can refer to one, but also many parameters. So in our example, you can say something like this:

\[P(\Theta|y) = \frac{P(y|\Theta)P(\Theta)}{P(y)}\]

where:

\[\Theta=\beta, \sigma^2\]

Or you could rewrite the Bayes theorem direclty like this:

\[P(\beta, \sigma^2|y) = \frac{P(y|\beta, \sigma^2)P(\beta, \sigma^2)}{P(y)}\]

That’s also fine.

So with all of that out of the way, we can now attack the Bayesian problem. As in the previous example, we need to figure out what the likelihood function of this particular model, then what our priors should be, and finally try to integrate over the product of these two things to get the model evidence.