Marginal likelihood#

With everything we have seen so far, we are ready to tackle the marginal likelihhood of our linear model. The marginal likelihood is the denominator of the Bayes theorem:

\[P(\Theta|y) = \frac{P(y|\theta)P(\Theta)}{P(y)}\]

And it is equivalent to the integral of the numerator, so that the whole function integrates to 1, so:

\[P(\Theta|y) = \frac{P(y|\theta)P(\Theta)}{\int P(y|\theta)P(\Theta)}\]

By now, we know that the numerator is:

\[P(y|\theta)P(\Theta) = (\frac{1}{\sqrt{2\pi\sigma^2}})^nexp(-\frac{1}{2\sigma^2}(y - \boldsymbol{X}\boldsymbol{\beta})^T(y-\boldsymbol{X}\boldsymbol{\beta})) \times \Bigg(\frac{1}{(2\pi)^{p/2}|\mathcal{\Sigma}|^{1/2}}exp(-\frac{1}{2}(\mathcal{\beta} - \mathcal{\mu})^T\Sigma^{-1}(\mathcal{\beta}-\mathcal{\mu})) \times \frac{b^\alpha}{\Gamma(\alpha)}(1/x)^{\alpha+1}exp(-b/x)\Bigg)\]

(\(\alpha\) and \(b\) are the parameter of the prior distribution of \(\sigma^2\))

That’s a big scary formula, but that’s not a problem. Note that here, everything is expressed in vector form, otherwise it would be way too long. We can plug in the various bits and pieces and compute the numerator for all values of the parameters, given our data. In fact, we can write a simple python function to express it, by simply combining the various functions we have already created!

import numpy as np
from math import gamma

def lm_likelihood(y, X, beta, sigma):
    """
    Computes the likelihood of observing y given X for parameters beta and sigma.
    
    Parameters:
    - y : array-like of shape (n,), observed values
    - X : array-like of shape (n, p), predictor values (should include a column of ones if intercept is included in beta)
    - beta : array-like of shape (p,), regression coefficients (including intercept if X includes a column of ones)
    - sigma : float, standard deviation of the error term
    
    Returns:
    - likelihood : float, the likelihood of the observed data given the parameters
    """
    # Ensure inputs are numpy arrays
    y = np.asarray(y)
    X = np.asarray(X)
    beta = np.asarray(beta)
    
    # Check dimensions
    n = y.shape[0]
    if X.shape[0] != n:
        raise ValueError("The number of rows in X must match the length of y.")
    if X.shape[1] != beta.shape[0]:
        raise ValueError("The number of columns in X must match the length of beta.")
    
    # Calculate the predicted values using the linear model
    y_pred = X @ beta  # Matrix multiplication
    
    # Compute the squared residuals
    squared_residuals = (y - y_pred) ** 2
    
    # Calculate the likelihood
    norm_const = (1 / np.sqrt(2 * np.pi * sigma**2)) ** n
    likelihood = norm_const * np.exp(-np.sum(squared_residuals) / (2 * sigma**2))
    
    return likelihood


def multivariate_normal_pdf(beta, mu, Sigma):
    """
    Computes the probability density function of a multivariate normal distribution.
    
    Parameters:
    - beta : np.ndarray
        A 1D array of shape (p,) representing the point at which to evaluate the PDF.
    - mu : np.ndarray
        A 1D array of shape (p,) representing the mean vector of the distribution.
    - Sigma : np.ndarray
        A 2D array of shape (p, p) representing the covariance matrix of the distribution.
        
    Returns:
    - float
        The value of the PDF evaluated at beta.
        
    Notes:
    The multivariate normal PDF is given by:
    
        P(beta) = (1 / ((2 * pi)^(p/2) * |Sigma|^(1/2))) * 
                  exp(-0.5 * (beta - mu)^T * Sigma^{-1} * (beta - mu))
                  
    where:
    - p is the dimensionality of beta,
    - |Sigma| is the determinant of the covariance matrix,
    - Sigma^{-1} is the inverse of the covariance matrix.
    """
    beta = np.asarray(beta)
    mu = np.asarray(mu)
    Sigma = np.asarray(Sigma)
    
    # Ensure that beta and mu are 1D arrays
    if beta.ndim != 1 or mu.ndim != 1:
        raise ValueError("beta and mu must be 1-dimensional arrays.")
    
    # Ensure that Sigma is a 2D square matrix
    if Sigma.ndim != 2 or Sigma.shape[0] != Sigma.shape[1]:
        raise ValueError("Sigma must be a 2-dimensional square matrix.")
    
    p = beta.shape[0]
    
    # Check that the dimensions match
    if mu.shape[0] != p or Sigma.shape[0] != p:
        raise ValueError("Dimensions of beta, mu, and Sigma do not match.")
    
    # Compute the determinant and inverse of Sigma
    det_Sigma = np.linalg.det(Sigma)
    if det_Sigma <= 0:
        raise ValueError("The covariance matrix Sigma must be positive definite.")
    
    inv_Sigma = np.linalg.inv(Sigma)
    
    # Compute the normalization constant
    norm_const = 1.0 / (np.power(2 * np.pi, p / 2) * np.sqrt(det_Sigma))
    
    # Compute the exponent
    diff = beta - mu
    exponent = -0.5 * np.dot(diff.T, np.dot(inv_Sigma, diff))
    
    # Compute the PDF value
    pdf = norm_const * np.exp(exponent)
    
    return pdf


def inv_gamma_pdf(x, alpha, beta):
    """
    Computes the probability density function of the inverse gamma distribution.
    
    Parameters:
    - x : float or np.ndarray
        The value(s) at which to evaluate the PDF. Must be positive.
    - alpha : float
        The shape parameter of the inverse gamma distribution. Must be positive.
    - beta : float
        The scale parameter of the inverse gamma distribution. Must be positive.
        
    Returns:
    - float or np.ndarray
        The PDF of the inverse gamma distribution evaluated at x.
    
    Notes:
    The inverse gamma distribution PDF is given by:
    
        f(x; alpha, beta) = (beta ** alpha / gamma(alpha)) * x ** (-alpha - 1) * exp(-beta / x)
    
    where `alpha` > 0 and `beta` > 0.
    """
    return (beta ** alpha / gamma(alpha)) * x ** (-alpha - 1) ** np.exp(-beta/x) 


def bayes_numerator_lm(y, X, beta, sigma, beta_prior_mu, beta_prior_sigma, sigma_priors):
    likelihood = lm_likelihood(y, X, beta, sigma)
    beta_prior_proba = multivariate_normal_pdf(beta, beta_prior_mu, beta_prior_sigma)
    sigma_prior_proba = inv_gamma_pdf(sigma, sigma_priors[0], sigma_priors[0])
    return likelihood * beta_prior_proba * sigma_prior_proba

It’s a bit long but we get there eventually. Each bit of the numerator returns a single probability value reflecting how likely something is when the parameters are set to that value, and then we just need to multiply these probabilities together. In other words, we can pass in any values of beta and sigma, given that the data y, the regressors X and the priors stay the same, and we will get out a single value out. We can also change the data and keep everything else the same, and we will see how the results change depending on the data. Everything is possible here, we have no major issue to compute the numerator, only that the formula is a bit clunky and the code a bit long.

The marginal likelihood and the issue of integration#

That is unfortunately not the case for the marginal likelihood. Because of the integral, the marginal likelihood cannot be computed. As we saw in the case of our coin toss example, we could manipulate the various formulae to basically remove the integral symbol from the equation. We say that we found a closed form solution for the equation. If we do not find a closed form solution, we cannot solve the equation, meaning we can’t find the result of that equation. It is simple to understand why, as we briefly mentioned in the coin toss example. We are here dealing with continuous variables: the \(\beta\) and \(\sigma\) parameters can take any values between \(-\infty\) and \(+\infty\). The integral term means that we need to take the sum of the value of the numerator for any possible values the \(\beta\) and \(\sigma\) parameter. But since there is an infinity of them (\(\beta_1=0.2\) but also \(\beta_1=0.0000000001\) and so on), we can’t possibly compute it.

But then, maybe we can find a trick, like in the case of the coin toss example, to get rid of that nasty integral? Let’s give it a try:

\[P(y|\theta)P(\Theta) = \int(\frac{1}{\sqrt{2\pi\sigma^2}})^n\prod_{i=1}^{n}exp^{-\frac{[y_i-X\boldsymbol{\beta}]^2}{2\sigma^2}} \times \Bigg(\frac{1}{(2\pi)^{p/2}|\mathcal{\Sigma}|^{1/2}}exp(-\frac{1}{2}(\mathcal{\beta} - \mathcal{\mu})^T\Sigma^{-1}(\mathcal{\beta}-\mathcal{\mu})) \times \frac{\beta^\alpha}{\Gamma(\alpha)}(1/x)^{\alpha+1}exp(-\beta/x)\Bigg) d\beta d\sigma^2\]

In this case, we can’t get rid of the integral, so we can’t compute the denominator. And it’s not that I, the author of this book, failed to figure it out in this book and that it has been solved somewhere else. And it is also not such that mathematicians haven’t been able to solve it just yet and will eventually. It just is the case that this equation has no closed form solution, which is the same as saying that this equation has no analytical solution. It never will. Now maybe you will read this book and go on to pursue a successful career as a mathematician to prove me wrong and actually solve that equation analytically. If you actually were to do that, you would shatter everything we know about mathematics and be crowned the smartest person that has ever lived. It’s no exaggeration, it would be that big a deal.

What do we do now?#

So, we started with an easy question: “Is the length of penguin flipper correlated with their weight”? We wanted to use the Bayes theorem to know how confident we should be in the value of the \(\beta_1\) in our model given the data, to see how likely or unlikely it is to be 0 (or close thereof) to answer our question. To be precise, we want to compute the probability distribution of all possible values of \(\beta_1\) given the data, so that we can know the interval of likely values of \(\beta_1\). And after all this complicated math, we arrive at a big let down: there is a bit of the Bayes theorem we can’t solve and accordingly, we can’t compute the posterior and therefore, we are stuck with our question.

The integral is not only a problem in the specific case of the linear model, it is the case for many statistical problems. In fact, there are probably fewer examples in which it is not a problem (such as our coin toss example) than the other way around. But lucky for us, mathematicians are stubborn and when they can’t solve an equation directly, they spend decades and centuries figuring out tricks and workarounds. To be clear: they do not eventually find a solution to an equation previoulsy thought unsolvable. They instead find ways to get to something that should be close enough, which they call approximations. In this specific case, there are two different families of strategies to adopt to deal with the unsolvable equation:

  1. Computational approaches:

  • roughly speaking, throw a lot of computation resources at the problem. The least refined way you could go about is multiply the prior and likelihood for all combinations of the parameters (\(\beta\) and \(\sigma\)) ranging from large negative values to large positive ones and everyhing in between. This is never done in practice because it would be absurdly expansive computationally speaking, even with a few parameters to consider. Instead, fancier method are used, relying on so called samplers relying on Markov Monte Carlo Chains to generate numbers following complicated probabilistic rules.

  1. Variational approaches:

  • Solve even more complicated equations and use a very broad array of tricks to find some solvable equations that should approximate the integral. You will often hear these methods referred to as “approximate Bayes” or “approximate bayesian inference” or things along those lines. As the name suggest, these methods are not exact, but they approximate it well enough.

As the name of the book indicates, we will focus on the second approach. And to be even more specific, we will use a specific (out of many) variational appraoch: Variational Laplace. The variational approach is tailored for linear problems, and assumes that the error are Gaussian, which is the same thing as saying normally distributed. Let’s get started