Intermediary recap#
The content we have seen so far was quite dense, so before we move forward, let’s recapitulate everything we have seen so far and keep things straight.
The linear model#
We started from a simple problem: we have two variables and we want to know if they are correlated to another. We took a concrete example to illustrate this: imagine you want to know if the length of penguin flippers is related to their weight. We used several expression to express teh same idea:
Is the flipper length correlated with the weight of penguin
Is the flipper length predictive of the weight of the penguins
And other variation of this. For any of these questions, if the answer is yes, it means that changes in the value of the one variable are reflected in the other. If the answer is no, then the change in one variable are inconsequential to the other variables and the two are independent. And as you have probably heard countless times before, correlation doesn’t imply causation: if two variables are dependent on each other, it can be that one causes the other, one is a consequence of the other or alternatively, they both driven by a third variable. In our example, the weight of the penguin may be related to their flipper length, but it would be odd to suggest that the flipper length has any causal relationship with the weight of penguin. If you were to investigate whether penguin weight is associated with their food intake, it might seem more reasonable to assume causality.
If you want to know whether two variables are linked to another, there are several ways you can go. You could compute the correlation coefficient between the two and see whether it is positive, negative, or null. An alternative approach we chose is to go for a linear model, which has the following formula:
This approach enables to answer our question, just as other approaches would. But as it turns out, other approach (such as computing a correlation) are specific case of the above. The linear model is therefore more general than other approaches. If you are already familiar with statistics, it is really important to clear out some confusions you may have. A linear model not a a way to estimate the correlation between two parameters, nor is it a way to know if the correlation between two (or more) variables is significant. Instead, it is a generative model for your data. In other words, it is a hypothesis. You are basically saying “I believe that my observed data are a weighted sum of some variables, give or take some error”. So in the case of our penguins, it is say “I believe that the weight of the penguin is equal to the length of their flipper weighted by a constant, plus an intercept, though with some error because I don’t believe that it is going to be exactly correct”.
You can then use various other formula and tools to estimate the weights for your regressors (flipper length): that would be the optimal least square (OLS) method. Here again, it is important to be clear about what we mean. When you “estimate the parameters” of a linear model, you are trying to find the values of the \(\beta\) parameters they yield the smallest error overall. You are looking for the \(\beta\) values with which the distance between each point and the regression line is on average as small as possible. Because it is equally bad to have a line that is below or above an observation (i.e. positive and negative misestimation are equally bad), you want to minimize the square of the error. So the OLS finds the \(\beta\) values that minimize the Residual Sum of Square (RSS). And we saw that the \(\beta\) that result in the smallest error are also the values under which the observation are the most likely based on the likelihood function.
Bayes theorem and the linear model#
The question “Is there a relationship between two (or more) variables?” is equivalent to asking “Are the \(\beta\) parameters different from 0?”. Indeed, if you have a (linear) relationship between two variables, then by definition, a change in one variable will be associated with a change in the other, so the \(\beta\) parameter of the one regressor will be different from 0. Once again, we are dealing with some randomness (stochastic processes). So if we observe two variables, such as the length of penguin’s flipper and their weight, and that we find using the OLS method that the \(\beta\) weight of the flipper length parameter is different from 0, it doesn’t mean that there really is a relationship between the two, just as observing 7/10 heads when throwing a coin doesn’t mean that your coin is biased. What we really want to know is what is the true relationship between two variables, which we can’t know for sure. So instead, what we can seek is to know how confident we are of the values of our \(\beta\) (and other) parameters after seeing some data. And for that, we need the Bayes theorem:
In the case of the coin toss, this is a bit simpler, because both \(y\) and \(\Theta\) are a single number: the number of head observed in our experiment and the probability of getting head. But in the case of our linear model, the \(\Theta\) parameter refer to several different parameters: each of the \(\beta\) parameter, as well as the \(\sigma\) parameter which refers to how large our error \(\epsilon\) is. Specifically, the \(\sigma^2\) parameters is the variance of the error normal distribution. In addition, the data \(y\) refer to several observation (one for each penguin). This implies that things get a little bit trickier when it comes to compute the likelihood and the prior.
The likelihood#
So, in our penguin example, where we have 2 \(\beta\), the \(\Theta\) of the Bayes theorem above is replaced by \(\beta_0, \beta_1, \sigma^2\), so we can rewrite the Bayes theorem as such for that specific case: $\(P(\beta_0, \beta_1, \sigma^2|y) = \frac{P(y|\beta_0, \beta_1, \sigma^2)P(\beta_0, \beta_1, \sigma^2)}{p(y)}\)$
In the case of the linear model, the likelihood is the probability of observing all the data \(y\) given the values of \(\beta\) and \(\sigma\). Importantly, by the very formula of the linear model, the value of \(y\) depend on the values of each \(\beta\) combined, as well as the value of \(\sigma\) for the error. In other words, the likelihood of \(y\) is not independent of the parameters, instead it is dependent jointly on the parameters of the model. Accordingly, the likelihood is not equal to the product of the likelihood of the data under each parameter independently. Instead, the likelihood is a multivariate distribution, providing the likelihood of the data \(y\) given the combination of the \(\beta\) and \(\sigma\) parameters.
What kind of distribution? A multivariate distribution. This is because of the main assumption of linear models: the error should be normally distributed and independently and identically distributed (i.i.d.). Because the error is normally distributed, so is the likelihood.
The general formula a multivariate normal distribution is:
What we mean that by the “general formula” is that any normal distribution follow this formulae. It can however be written in more specific ways depending on the parameters it refers to. In the case of the liklihood function, it describes the probability of the data given the parameters, so can write the following:
How is that the same? Well that’s because product of exponentials can be combined into a single exponential of a sum:
This expression aligns with the multivariate normal distribution’s formula when we consider \(y\) as an \(n\)-dimensional vector, \(\mu\) as the vector of expected values given by the model (\(\mu = X \beta\)), and \(\Sigma\) as \(\sigma^2 I_n\), where \(I_n\) is the \(n \times n\) identity matrix.
Upon calculating the likelihood, we made two observations. First, the \(\beta_0\) and \(\beta_1\) values under which the likelihood of our observation \(y\) is highest are equivalent to the estimates we get from the OLS method. That’s in fact why finding the \(\beta\) values under which the error are is minimal is useful, it’s because these are the values under which the data are most likely. And this is why we call these the Minimal Likelihood Estimates (MLE). And just as there are MLE for the \(\beta\), there is also a value of \(\sigma^2\) under which the data are most likely, i.e. there is a MLE for that parameter. You can either compute the likelihood and scan values of \(\sigma^2\) until you find the one under which the data are most likely, or you can take a shortcut and use this formulae:
This is the formula of the variance. The hat here signifies that it is the observed variance based on the data. And as you know, this does not need to be equal to the true variance. In other words, say you observe 10 penguins, find your estimates of best fit ofr \(\beta_0\) and \(\beta_1\) and compute the measurement error, you don’t know if the value you get is the ground truth, only that it is what you have observed in your sample. This is yet again another parameter you can’t really know about.
But the formula above is not limited to the observed variance, it is the general formula of the variance:
This formula computes the average of the squared differences between the observed values and the values predicted by our model, which is the essence of variance in statistical terms. And for any normal distribution, this is what the \(\sigma^2\) parameters refer to: the square of the differenmce between a value and the mean of the distribution. And this once again makes sense: large variance means that values further away from the mean are more likely, which means that the distribution is wider, small variance means that values that are far from the mean are not very likely, which means that our distribution is quite tight.
And the other observation we made is that when we have two variables and that our \(x\) regressor isn’t centered, the likelihood takes kind of a weird shape, there is a ridge such that the distribution is squished in one direction. Such a ridge implies that there is a correlation between the values of \(\beta_0\) and \(\beta_1\), because it means that there are pairs of values of \(\beta_0\) and \(\beta_1\) under which the data are most likely, but more importantly that these pairs fall along a specific line. In this specific case, it is a negative relationship: for larger values of \(beta_1\), the data tend to be more likely under smaller \(\beta_0\) values and reciprocally. In this case, this is an artifact of sorts that occurs when the regressors aren’t centered. We can fix that issue by subtracting the mean value of our regressors \(\bar{x}\) (i.e. the mean flipper length) to each penguin \(x_i\) flipper length. More generally, it is always a good idea to center each variable independently to avoid such artifacts.
Importantly, it can happen even if you have centered your variable that there is a correlation between pairs of \(\beta\) parameters under which the data are most likely. It occurs when the predictor variables themselves are correlated, i.e., when there is multicollinearity among the predictors. In such cases, even after centering, if two or more predictors are highly correlated, the estimation of each \(\beta\) becomes less precise, and the likelihood surface shows ridges or elongations reflecting the dependencies between the coefficients.
The prior#
The prior specify the belief we have about likely values of all parameters in the model: \(\beta\) as well as \(\sigma^2\). For the \(\beta\), we opted for a normal distribution, while for the \(\sigma^2\) parameter we opted for an inverse gamma distribution, because it is only defined for positive value, which is also the case of the variance.
The inverse gamma is defined as:
Where \(\alpha\) and \(\beta\) are parameters that we can tweak to specify our belief aboutt the likely values for the variance of the error term of our model.
In the case of the \(\beta\), we use a multivariate normal distribution, so that we can specify if we believe that there may be a correlation between our beta parameter. That would be the case if we believe that if we observe a large value in our \(\beta\), we should be more likely to see a large value in the other, in which we should specify a positive covariance. Alternatively, we can also specify a negative covariance between two \(\beta\) if we believe that if we observe a high value for one of the \(\beta\) we are likely to see a small value for the other. Importantly, while for the likelihood function, we would expect to observe covariance between the \(\beta\) parameters if there is some colinearity in the regressors \(x\), here this is independent of seeing any data. It would make sense that if there is a correlation between the \(\beta\), if both regressors are driven by the same underlying thing, then it would make sense for the actual values of \(x\) to be similar. But in the case of the priors, this is independently of the actual values of \(x\), it only depends on our belief. And so the prior of the \(\beta\) parameters is specified as:
Where:
\(\mathcal{\beta}\) is a vector with the values of each \(\beta\) parameter for which we want to obtain the probability
\(\mathcal{\mu}\) is a vector with the mean value of each \(\beta\) prior distributions
\(\mathcal{\Sigma}\) is the variance covariance matrix. The diagonal specifies the variance of each parameter, and the off diagonal represents the covariance between each pair of parameter
Keeping things straight#
I know, I know—at this point, you probably feel like you’re playing Sudoku. There are many bits and pieces, many distributions, and the same parameter \(\sigma^2\) popping up all over the place. It’s really easy to get things mixed up! So here is a summary of each of the bits and pieces that we need to move forward:
Linear model: the generative model for our data
Assumption About the Error Term:
This means the errors are independently and identically distributed (i.i.d.) normal random variables with mean zero and variance \(\sigma^2\).
Bayes’ Theorem for Our Model:: Since we have three parameters that we don’t know—\(\beta_0\), \(\beta_1\), and \(\sigma^2\)—and we want to know how confident we are about their true values given the data, we use Bayes’ theorem:
Likelihood Function:
The likelihood of the data given the parameters, \(P(y \mid \beta_0, \beta_1, \sigma^2)\), defines the probability of observing the data we have observed (e.g., the weights of the penguins), given the values of the parameters. It is calculated as:
Here, \(n\) is the number of observations.
The parameters \(\beta_0\), \(\beta_1\), and \(\sigma^2\) are what we’re trying to estimate.
Prior Distribution: The prior distribution is defined as:
Where \(\boldsymbol{\beta} = \begin{bmatrix} \beta_0 \ \beta_1 \end{bmatrix}\) is the vector of coefficients.
Prior for \(\sigma^2\):
The prior distribution of the variance parameter \(\sigma^2\) states our belief about the likely values of \(\sigma^2\), which conditions the distribution of the error term in our model. It follows an inverse gamma distribution:
Here:
\(\alpha\) and \(\beta\) are parameters that we can adjust to specify our belief about the likely values for the variance of the error term in our model.
\(\Gamma(\alpha)\) is the gamma function evaluated at \(\alpha\).
Prior for \(\boldsymbol{\beta}\): Finally—and this is probably where it gets confusing—the prior distribution of the \(\beta\) parameters is a multivariate normal distribution. While it’s the same kind of distribution as the likelihood (since both are normal), it’s important to note that they are separate entities; they just belong to the same family of distributions.
Variance Terms:
Because the prior is a normal distribution, we have a different set of variances (and covariances) that refer to the prior distribution of each \(\beta\) parameter.
Important Distinction:
\(\sigma^2\) in the likelihood: This is the parameter conditioning the distribution of the error term of our model, and it’s a parameter we’re trying to estimate given the data.
Variances in the prior of \(\boldsymbol{\beta}\): These reflect our belief about which values are plausible for our \(\beta\) parameters.
Key Point:
These variances are not the same. The \(\sigma^2\) conditioning the error term depends on the data \(y\), while the variances in the prior only depend on our belief about what values are plausible for \(\boldsymbol{\beta}\).
These variance terms are what we find in the variance-covariance matrix \(\boldsymbol{\Sigma}\) of our prior.
Variance-Covariance Matrix \(\boldsymbol{\Sigma}\):
The diagonal elements represent the variances of each \(\beta\) parameter.
The off-diagonal elements represent the covariances between pairs of \(\beta\) parameters, reflecting our belief about whether the \(\beta\) parameters are correlated with each other.
Prior Distribution of \(\boldsymbol{\beta}\): The prior for \(\boldsymbol{\beta}\) is given by:
Where:
\(\boldsymbol{\beta}\) is the vector of \(\beta\) parameters.
\(\boldsymbol{\mu}\) is the vector of prior means for the \(\beta\) parameters.
\(p\) is the number of \(\beta\) parameters (including the intercept).
\(|\boldsymbol{\Sigma}|\) is the determinant of the variance-covariance matrix \(\boldsymbol{\Sigma}\).
\(\boldsymbol{\Sigma}^{-1}\) is the inverse of the variance-covariance matrix.
Got it? Great, now we can move on to the part where problems actually start. Oh boy.