Model Evidence

SciencePedia

Key Takeaways

Model evidence is a Bayesian tool that quantifies how well a model predicts observed data by averaging over all its possible parameter settings.
The framework inherently implements Occam's razor, automatically penalizing more complex models to prevent overfitting and favouring simpler explanations.
The calculation of model evidence is sensitive to prior probability assignments, requiring careful and explicit statement of scientific assumptions.
Its applications are vast, ranging from comparing cosmological models and resolving evolutionary trees to optimizing machine learning algorithms.
When evidence is ambiguous, Bayesian Model Averaging combines predictions from multiple models, weighted by their evidence, to produce more robust conclusions.

Introduction

In the quest for knowledge, science constantly pits competing ideas against each other. How do we decide which theory—which model of the world—is best supported by the evidence? A common temptation is to favor the model that fits our data most closely, but this path can be deceptive, often leading us to overly complex explanations that mistake noise for signal. The challenge, then, is to find a rational way to balance a model's accuracy against its simplicity. This fundamental problem of model selection requires a tool that can quantitatively weigh evidence and prevent us from fooling ourselves.

This article introduces model evidence, a cornerstone of the Bayesian framework that provides a principled solution to this challenge. It acts as a mathematical formalization of Occam's razor, rewarding models for their predictive power while penalizing them for unnecessary complexity. In the chapters that follow, we will explore this powerful concept in detail. The first chapter, "Principles and Mechanisms," will unpack the core mechanics of model evidence, explaining how it arises from Bayes' theorem, how it automatically prefers simpler theories, and the crucial role our initial assumptions, or priors, play in the process. Subsequently, the "Applications and Interdisciplinary Connections" chapter will showcase the remarkable versatility of model evidence, demonstrating its use in fields as diverse as cosmology, evolutionary biology, and artificial intelligence to resolve scientific debates and drive discovery.

Principles and Mechanisms

Imagine you are a detective at the scene of a crime. You have a handful of suspects, each with a different story. Your job is to sift through the clues—the evidence—and decide which story is the most plausible. Science works in much the same way. We have competing hypotheses, which we can think of as formal “models” of how some part of the world works. And we have data, the clues left by nature. How do we rationally decide which model is best supported by the data? Is it simply the one that “fits” the data most closely? The answer, it turns out, is much more subtle and beautiful. The Bayesian framework gives us a precise, mathematical tool for this task: the model evidence.

The Courtroom of Science: Judging Ideas with Evidence

In the Bayesian way of thinking, our belief in a model is updated as we collect evidence. The logic is captured by Bayes' theorem, applied not just to parameters but to entire models:

$P(\text{Model} | \text{Data}) = \frac{P(\text{Data} | \text{Model}) P(\text{Model})}{P(\text{Data})}$

The term we are interested in is $P(\text{Model} | \text{Data})$ , the posterior probability—our updated belief in a model after seeing the data. This is proportional to two things: our prior probability $P(\text{Model})$ , which represents our initial belief in the model before seeing the data, and a crucial quantity called the marginal likelihood or model evidence, $P(\text{Data} | \text{Model})$ .

The model evidence is the probability of having observed the actual data, given the model. But it's not as simple as picking the best-fitting parameters of the model and calculating the probability. Instead, it's the average probability of the data over all possible parameter settings the model allows, weighted by our prior beliefs about those parameters. Mathematically, if a model $M$ has parameters $\theta$ , the evidence is:

$P(\text{Data} | M) = \int P(\text{Data} | \theta, M) P(\theta | M) d\theta$

This integral is the heart of the matter. It tells us, "On the whole, how well does this model, in all its possible configurations, predict the data we actually saw?"

To compare two models, say $M_1$ and $M_2$ , we can look at the ratio of their evidence, a quantity known as the Bayes Factor:

$\text{Bayes Factor} = \frac{P(\text{Data} | M_1)}{P(\text{Data} | M_2)}$

If we start with no preference for either model (i.e., equal priors, $P(M_1) = P(M_2)$ ), the Bayes Factor directly tells us which model is now more probable. A Bayes Factor of 10 means the data supports $M_1$ ten times more strongly than $M_2$ . Biologists use this routinely. For instance, when reconstructing an evolutionary tree, they might compare a simple model of DNA mutation (like the Jukes-Cantor model) with a more complex one (like the GTR model). Or they might test whether evolution proceeds at a steady "strict clock" rate or a more variable "relaxed clock" rate. By calculating the Bayes Factor from their DNA data, they can quantitatively decide which model provides a more compelling explanation of evolutionary history.

The Magic of Occam's Razor: Why Simpler is Often Better

We are often told to prefer simpler explanations over more complex ones. This principle is famously known as Occam's razor. But is this just a philosophical preference, a rule of thumb for tidy thinking? No. Astonishingly, it is a direct mathematical consequence of the definition of model evidence.

Let’s explore this with a thought experiment. Suppose we have some data points that lie perfectly on a straight line, say $y = 2x$ . We want to decide between two models to explain this data.

Model $M_1$ is a simple linear model: $y = ax$ .
Model $M_2$ is a more complex quadratic model: $y = ax + bx^2$ .

Now, the complex model $M_2$ can certainly fit the data perfectly—it just needs to set its extra parameter $b$ to zero. The simple model $M_1$ can also fit it perfectly by setting $a=2$ . So, based on "goodness of fit" alone, they seem equally good. Why should we prefer $M_1$ ?

The model evidence provides the answer. Model $M_2$ is more flexible; it could have produced an infinite variety of curved datasets. It "spreads its bets" over all these possibilities. Because its prior probability is distributed over a much larger space of possible functions (all the parabolas), the specific probability density it assigns to the one simple, straight-line function we actually observed is very low. Model $M_1$ , on the other hand, was only ever capable of producing straight lines. It made a riskier, more specific prediction. Since its prediction turned out to be correct, the model evidence rewards it. The evidence for the simpler model, $P(\text{Data}|M_1)$ , will be higher than the evidence for the more complex model, $P(\text{Data}|M_2)$ . The complex model is penalized for its unnecessary complexity. This is the Bayesian Occam's razor in action.

Unpacking the Penalty: Data Fit vs. Complexity

This penalty isn't some mystical force; it appears explicitly in the mathematics. Let's look at a modern example from machine learning: Gaussian Process regression, a powerful technique for modeling complex functions, like the potential energy surface of a molecule.

When we use this method, we find that the log of the model evidence can be broken down into two main parts:

$\log p(\mathbf{y} | \mathbf{X}) = \underbrace{-\frac{1}{2} \mathbf{y}^T \mathbf{K}_y^{-1} \mathbf{y}}_{\text{Data-Fit Term}} \underbrace{-\frac{1}{2} \log |\mathbf{K}_y|}_{\text{Complexity Penalty}} - \text{constant}$

Let's not worry about the scary symbols. The first term is the data-fit term. It gets better (less negative) the more closely the model's curve passes through the data points $\mathbf{y}$ . This is the part that rewards a good fit.

The second term, however, is the complexity penalty. The matrix $\mathbf{K}_y$ describes the flexibility of the model. A very "wiggly," complex model that can bend and twist to fit anything will have a large determinant, $|\mathbf{K}_y|$ . This makes the complexity penalty a large negative number, which drags down the total evidence. In contrast, a simpler, "smoother" model has a smaller determinant, incurring a smaller penalty.

The best model is the one that finds the sweet spot, maximizing the sum of these two terms. It must be complex enough to fit the data well, but no more complex than necessary. Incredibly, this trade-off isn't something we impose. It arises naturally from the mathematics of probability—specifically, from the normalization constant of the underlying Gaussian distribution. The penalty for complexity is a fundamental feature of probabilistic inference. This principle extends to other domains, too. In linear regression, for example, the framework automatically penalizes models that include redundant or highly correlated predictor variables, as these add complexity without adding much explanatory power.

The Art of the Prior: Our Starting Assumptions Matter

The Bayesian framework is powerful, but it is not magic. The results, including the model evidence, depend on the prior probabilities we assign to the parameters. This isn't a weakness; it's a feature that forces us to be explicit about our assumptions.

Consider an experiment to test the laws of physics at the nanoscale. We bend a tiny beam and measure its deflection. Does it obey classical mechanics ( $\mathcal{M}_0$ ), or does it require a more complex theory with a new "length scale" parameter, $\ell$ ( $\mathcal{M}_1$ )?

Wide Priors: If we have no idea what the value of $\ell$ might be, we might set a very wide prior, say allowing it to be anywhere from $0$ to $100$ nanometers. This gives the model great flexibility, but at a cost: it incurs a large Occam penalty. The data will have to provide very strong support for a specific value of $\ell$ to overcome this penalty and favor the complex model.
Improper Priors: What if we try to be completely "uninformative" and set a uniform prior for $\ell$ over an infinite range, $[0, \infty)$ ? This is an improper prior because it cannot be normalized to integrate to one. If we try to compute the model evidence, we find the integral doesn't converge. The Bayes factor becomes arbitrary and meaningless. This teaches us a crucial lesson: for model comparison, our priors must be proper, representing a coherent state of belief.
Vague Priors: Simply making all priors very diffuse or "noninformative" is not a solution. Assigning a very wide prior to a parameter common to all models (like Young's modulus in the nanobeam example) penalizes all the models for making vague predictions. This lowers their evidence and can make it harder to distinguish between them. A prior is a scientific statement, and it should reflect genuine scientific uncertainty, not a desire to abdicate responsibility.

The Paradox of Old Evidence

Let's close with a fascinating, almost philosophical, puzzle. Bayesian updating uses evidence to change our beliefs. But what if the evidence is "old"? What if it's something we've known for decades? How can a known fact provide evidence for a new theory?

Consider the question of whale evolution. We are all taught in school that whales are mammals. The probability of this fact, let's call it $E$ , is essentially 1 for any educated person. Now, suppose a scientist develops a new phylogenetic model, $M_1$ , that nests whales firmly within the mammals, and wants to compare it to a rival model, $M_2$ , that places them elsewhere. Can they use the "old evidence" $E$ to test their model?

It seems paradoxical. If we already know $E$ is true, how can it change our beliefs? The key is to realize that the power of evidence lies not in its novelty, but in its ability to discriminate between hypotheses. The correct question to ask is: "If I didn't know that whales were mammals, how much more likely would this fact be under Model 1 than under Model 2?"

Suppose model $M_1$ predicts the mammalian features of whales with high probability ( $P(E|M_1) = 0.96$ ), while model $M_2$ makes these features a bizarre coincidence of convergent evolution, predicting them with low probability ( $P(E|M_2) = 0.05$ ). The Bayes Factor is $0.96 / 0.05 \approx 19$ . The evidence strongly favors $M_1$ , regardless of whether we learned it today or in the third grade. The Bayesian calculation is valid as long as our prior beliefs in the models, $P(M_1)$ and $P(M_2)$ , were set hypothetically without taking $E$ into account.

This reveals the profound nature of evidence. Evidence is not just about surprise. It is about logical force. A simple, well-known fact can be extraordinarily powerful evidence if it is a natural consequence of one theory and a deep puzzle for another. The model evidence framework captures this logic perfectly, allowing us to weigh hypotheses not just by how well they fit the data, but by the coherence and parsimony of their entire explanatory structure.

Applications and Interdisciplinary Connections

Now that we have explored the machinery of model evidence, this remarkable tool that enforces Occam’s razor, we can ask the most important question of all: What is it good for? The answer, it turns out, is wonderfully broad. The principle of balancing fit against complexity is not a niche rule for statisticians; it is a universal acid that cuts across nearly every field of quantitative science. It is the scientist’s compass for navigating the vast sea of possible explanations for the world we observe. Let's go on a tour and see it in action, from the scale of the cosmos to the intricate dance of molecules within a single cell.

The Universe in the Balance

Cosmology is a field of grand theories. We have a fantastically successful "standard model" of the universe, called the $\Lambda$ CDM model, which explains a vast array of observations with just a handful of parameters. But scientists are restless. Is this the final story? Could there be new, undiscovered physics hiding in the subtle patterns of starlight? One way to search is to propose a more complex model—for example, one where the nature of dark energy, described by a parameter $w$ , is not fixed but is allowed to vary.

This is a classic showdown. The more complex model, with its extra parameter, will almost certainly fit the data from supernovae and the cosmic microwave background a little bit better. The question is, is the improvement worth it? This is where model evidence steps in. The Bayes factor weighs the slightly better fit of the complex model against the "Occam penalty" for adding a new parameter we have to measure. In many real-world cosmological analyses, the simpler $\Lambda$ CDM model holds its ground. The data, so far, tell us that the small improvement in fit offered by more complex models is not yet convincing enough to justify the added theoretical baggage. The universe, it seems, prefers to be simple. Model evidence gives us a principled way to make that judgment, preventing us from chasing phantoms in the noise.

The same principle that helps us weigh entire universes also helps us perfect the instruments we use to observe them. The incredible interferometers of LIGO and Virgo, which listen for the faint chirps of gravitational waves from colliding black holes, are plagued by internal noise. A strange bump in the instrument's power spectrum could be a sign of a real astrophysical event, or it could be instrumental noise, perhaps from the thermal vibration of a mirror. Sometimes, the data might be ambiguous: is a feature in the noise a single, broad peak, or two smaller, overlapping peaks? A model with two peaks has more parameters and will fit the noisy data better. But is it really two peaks? By calculating the Bayesian evidence for the single-peak versus the two-peak model, physicists can make a rational decision. This isn't just academic; correctly modeling the noise is absolutely critical to being able to subtract it and reveal the whisper of a gravitational wave hidden beneath.

The Blueprints of Life

The story of life is a story of change. But how does it change? Is evolution a meandering, random walk, where traits drift aimlessly over millennia? Or is it a process of optimization, where natural selection constantly pulls organisms toward an ideal form, like a marble rolling to the bottom of a bowl? These two hypotheses can be formalized as mathematical models: the first is called Brownian Motion (BM), and the second, the Ornstein-Uhlenbeck (OU) model. The OU model is more complex; it has an extra parameter representing the "pull" of selection.

When we analyze the evolution of a trait—say, the body size of mammals—across a phylogenetic tree, we can fit both models to the data. The OU model will often fit better, but the AIC or Bayes factor tells us if that better fit is substantial enough to overcome the penalty for its added complexity. In many cases, the evidence overwhelmingly favors the OU model, giving us a quantitative confirmation that natural selection, not just random drift, is a dominant force shaping the diversity of life.

This tool becomes even more powerful when we confront puzzles in the data. Imagine sequencing the DNA of three related species of songbirds. You might find that their mitochondrial DNA (a small, separate part of the genome) tells one story about who is most closely related to whom, while the main nuclear DNA tells a conflicting story. Which is correct? Perhaps neither is the whole story. We can construct competing demographic models: one that matches the mitochondrial tree and another that matches the nuclear tree. By calculating the model evidence for each, we can determine which scenario provides a more coherent explanation for the entire genomic dataset. The results can be striking, with the Bayes factor providing strong, or even very strong, evidence for one history over the other. This allows us to resolve long-standing taxonomic puzzles and even estimate the amount of historical gene flow between the populations, helping us decide if they are truly separate species or just distinct populations of the same species.

The reach of model evidence extends from the grand tapestry of evolution down to the urgent, practical world of epidemiology. During a viral outbreak, a critical question is whether the disease was introduced into a community once, followed by local spread, or if it is being repeatedly introduced from an outside source. These two hypotheses have vastly different implications for public health interventions. We can model these scenarios phylogenetically, treating "local" versus "global" as a trait that evolves on the viral family tree. A simple model allows only for a single global-to-local transition, while a more complex model allows for transitions in both directions (representing multiple introductions). By comparing the AIC scores of these models, public health officials can gain crucial, data-driven insight into the nature of the outbreak and deploy resources more effectively.

Even within a single cell, model selection guides our understanding. When studying a protein, a biochemist might want to know how quickly it degrades. A simple model would be a single exponential decay. But what if the protein exists in two different states or cellular compartments, each with its own decay rate? This would require a more complex, two-phase decay model. A simple visual inspection of the data might be misleading. But by using a tool like AIC, which penalizes the four-parameter two-phase model relative to the two-parameter simple model, a clear winner can emerge. Sometimes, the data will overwhelmingly support the more complex model, revealing hidden biological complexity that would have otherwise been missed.

The Ghost in the Machine

The principles we've seen at work in the natural world are just as vital in the artificial worlds we create inside our computers. In quantum chemistry, simulating the behavior of a heavy atom with all its electrons is computationally prohibitive. Scientists therefore use clever approximations called Effective Core Potentials (ECPs) that replace the inner-shell electrons with a mathematical function. But what is the best mathematical form for this function? Should it just depend on the nuclear charge, $Z$ ? Or should it also include terms for angular momentum, $l$ , or even relativistic effects?

Each of these ideas can be formulated as a different statistical model. We can generate a dataset of errors from these approximate models compared to more exact (but expensive) calculations. Then, we can compute the Bayesian evidence for each model form. This allows us to quantitatively discover which physical effects are most important to include in our approximations, leading to the development of more accurate and efficient tools for simulating the molecular world.

Perhaps the most surprising connection is to the field of artificial intelligence. When training a large neural network, a common problem is "overfitting." The network becomes so powerful that it doesn't just learn the general pattern in the data; it memorizes the specific noise and quirks of the training set. A practical trick to avoid this is called "early stopping": you monitor the network's performance on a separate validation dataset and stop the training process when that performance starts to get worse, even if the fit to the training data is still improving.

This seems like a sensible heuristic, but it has a deep and beautiful justification in Bayesian model evidence. Think of the training process. At the beginning, the network's parameters are small, and the fit to the data is poor. As training progresses, the data fit improves dramatically, and the model evidence increases. However, after a certain point, the network starts to fine-tune its parameters to capture the noise in the training data. This makes the posterior distribution of the parameters incredibly sharp and pushes the parameters to large values that are considered "unlikely" by the prior (which prefers smaller, simpler solutions). The Laplace approximation to the model evidence shows us that both this sharpening of the posterior and the departure from the prior's preferred zone create a penalty. Eventually, this penalty outweighs the tiny gains in data fit. The model evidence peaks and then begins to fall. The optimal place to stop training, from a Bayesian perspective, is right at that peak of evidence. Early stopping is not just a hack; it's an unconscious application of Occam's razor.

The Wisdom of Uncertainty

Throughout this journey, we have talked about using evidence to choose the best model. But what if the evidence is equivocal? What if one model is only slightly better than another? In a high-stakes situation, like identifying a bacterial pathogen in a hospital, declaring a single "winner" might be throwing away valuable information.

This leads us to the final, and perhaps most profound, application: Bayesian Model Averaging. The posterior probabilities we calculate for each model— $p(M_k|D)$ —are not just for ranking. They represent our updated belief about the plausibility of each hypothesis. If Model A has a posterior probability of $0.6$ and Model B has $0.4$ , the most rational approach is not to discard Model B. Instead, we should make predictions by combining the predictions of both models, weighted by their respective probabilities. This is justified by the fundamental laws of probability and decision theory. Any predictive strategy based on a single model, when other models have non-trivial support, will be less accurate on average than the Bayesian model average.

This is the ultimate expression of scientific humility. It is an acknowledgment that our knowledge is incomplete. By using model evidence not just to select, but to weigh and combine, we embrace our uncertainty and turn it into a more robust and honest understanding of the world. From the grandest theories of the cosmos to the most practical decisions in medicine and machine learning, the principle of model evidence provides a unified, rational framework for learning from data. It teaches us not only how to find a good story, but how to weigh all the plausible stories to navigate the complexities of our universe.