Likelihood vs. Probability: The Machinery of Scientific Inference

SciencePedia

Key Takeaways

Probability reasons forward from a known model to predict data, while likelihood reasons backward from observed data to evaluate the plausibility of model parameters.
The principle of maximum likelihood identifies the parameter values that best explain the observed data, forming a cornerstone of statistical inference.
In Bayesian inference, likelihood is the crucial evidence that updates prior beliefs about a parameter into a more informed posterior probability.
Likelihood-based methods enable the comparison of competing scientific models, naturally incorporating Occam's Razor to penalize excessive complexity.

Introduction

In the quest to understand the universe, scientists constantly move between theory and observation. We build models to describe reality, and we collect data to test, refine, and choose between them. At the core of this inferential process lie two foundational concepts: probability and likelihood. Though often used interchangeably, their distinction is critical for rigorous scientific analysis, representing the difference between prediction and inference. Misunderstanding them can lead to flawed conclusions, while mastering them unlocks a powerful, unified framework for turning data into knowledge.

This article demystifies the relationship between probability and likelihood, illustrating why one is not simply the inverse of the other. It clarifies the common confusion and reveals how these concepts work together to form the basis of modern statistical reasoning.

Our exploration will unfold across two main sections. First, in Principles and Mechanisms, we will dissect the fundamental concepts, using a simple coin-toss example to illustrate the "forward" reasoning of probability and the "inverse" reasoning of likelihood. We'll delve into the powerful principle of Maximum Likelihood Estimation and see how likelihood serves as the engine of evidence in Bayesian inference. Following this, the Applications and Interdisciplinary Connections chapter will showcase these principles at work, revealing how likelihood is used to reconstruct evolutionary histories, classify molecular events, characterize physical materials, and even account for uncertainty in complex biological systems. We begin by untangling the core principles that make all of this possible.

Principles and Mechanisms

Imagine you find a strange coin on the street. You flip it ten times and get seven heads. A question immediately springs to mind: is this coin fair? This simple question plunges us headfirst into the heart of scientific inference, and forces us to untangle two of the most fundamental, yet often confused, concepts in science: probability and likelihood.

Probability vs. Likelihood: A Tale of Two Directions

Let's start with a clear distinction. Probability is a "forward" process. It reasons from cause to effect. If we assume the coin is fair (the "cause," or model, is that the probability of heads, $p$ , is $0.5$ ), we can calculate the probability of observing seven heads in ten flips (the "effect," or data). This is a standard textbook calculation. We know the model, and we are predicting the data.

Likelihood, on the other hand, is an "inverse" process. It reasons from effect back to cause. We have the data—seven heads in ten flips—and we want to make an inference about the coin's fairness, the unknown parameter $p$ . We ask: "For a given value of $p$ , what was the probability of getting the data we actually observed?" This quantity, viewed as a function of the parameter $p$ , is the likelihood function, denoted $L(p \mid \text{data})$ .

So, for our coin, the probability of getting 7 heads and 3 tails for a given $p$ is given by the binomial probability formula: $P(\text{data} \mid p) = \binom{10}{7} p^7 (1-p)^3$ When we think of this as the likelihood function, $L(p \mid \text{data})$ , the data is fixed (we already got 7 heads) and the parameter $p$ is the variable. We can plot this function for all possible values of $p$ from $0$ to $1$ . We would find that the curve has a peak. The value of $p$ at this peak is the one that makes our observed data "most probable" or, more accurately, most likely. For seven heads in ten flips, this peak occurs at $p=0.7$ . This is the Maximum Likelihood Estimate (MLE). It's our best guess for the coin's true nature, based purely on the evidence.

It's crucial to understand that the likelihood function is not a probability distribution for the parameter $p$ . The area under the likelihood curve doesn't have to equal one. It's a different kind of beast altogether: a measure of how well different parameter values explain the data we have in hand.

The Art of Inference: Finding the Best Explanation

This principle of maximum likelihood is the engine of modern statistical inference. Scientists are in the business of observing the universe (collecting data) and trying to infer the underlying laws and parameters that govern it.

Consider a deep space probe sending a stream of ones and zeros back to Earth. The communication channel is noisy, meaning bits can be flipped. This "crossover probability," let's call it $p$ , is unknown. We send a known test pattern and observe what arrives. Suppose we sent 10 bits and 3 of them were flipped. The likelihood of this observation is $L(p \mid \text{3 errors in 10 bits}) = p^3(1-p)^7$ . To estimate $p$ , we find the value that maximizes this function, which turns out to be $p = 3/10 = 0.3$ . This is the MLE.

But what if we have some prior knowledge? Perhaps engineers know from past missions that such channels rarely have $p$ greater than $0.2$ . Bayesian inference provides a beautiful way to combine our prior knowledge with the evidence from our new data. Bayes' theorem states:

$P(p \mid \text{data}) \propto P(\text{data} \mid p) \times P(p)$

In words, the posterior probability of the parameter is proportional to the likelihood of the data times the prior probability of the parameter. The likelihood is still the heart of the matter; it's the component that updates our beliefs in light of new evidence. In the space probe problem, if we have prior reasons to believe that $p=0.2$ is more probable than $p=0.3$ , the MAP (Maximum A Posteriori) estimate might end up being $0.2$ , even if the new data alone points to $0.3$ .

This same logic applies whether we're counting flipped bits, photons in a laser cavity, or something far more complex. The core idea remains: we write down the likelihood of our observations as a function of the unknown parameters of our model, and we seek the parameter values that make our data least surprising.

Likelihood in Action: From Genes to Genomes

The power of likelihood truly shines when we move to more complex, real-world problems. Imagine trying to map the genes in a mouse brain using a technique called spatial transcriptomics. The location of each genetic readout is encoded by a short DNA "barcode." But the sequencing machine that reads these barcodes is imperfect; it makes mistakes. Suppose we read a barcode as ACCTGA, but we know the true barcode must be one of two possibilities from a list: ACGTGA or ACCTAA. Which one is it?

We can use likelihood to make the best decision. For each candidate, we calculate the probability of observing ACCTGA if that candidate were the true one. This calculation uses a model of the sequencing errors, often summarized by quality scores for each letter. For ACGTGA, the observed sequence has one mismatch (C vs G). For ACCTAA, it also has one mismatch (G vs A). A naive approach might call it a tie. But the likelihood method is more subtle. It asks: how probable was each specific mismatch? If the 'C' was read with very low confidence (high error probability) while the 'G' was read with very high confidence (low error probability), the likelihood will favor the hypothesis that assumes the 'C' was the error. By calculating the likelihood of the observed sequence under both hypotheses, we can make a principled choice, and the ratio of the likelihoods tells us exactly how much stronger the evidence is for one over the other.

This same principle allows geneticists to estimate the rate of recombination between genes on a chromosome or to build complex models of how a hybrid genome is a mosaic of its two parents. In these cases, the "parameters" might be the recombination fraction $r$ , the proportions of the genome from each parent, or other quantities. The likelihood becomes a multidimensional surface over all these parameters. The goal of the analysis is to find the highest peak on this complex landscape—the set of parameters that best explains the genetic data we've collected.

The Ghost in the Machine: "All Models Are Wrong, But Some Are Useful"

Here we must face a profound and humbling truth about science. The likelihood calculation, $P(\text{data} \mid \text{model, parameters})$ , is always conditional on the model we assume. It doesn't give us access to absolute reality; it only tells us what is most likely within the confines of our chosen worldview. If our model is a poor reflection of reality, our likelihood-based inferences can be misleading.

A classic example comes from reconstructing evolutionary history. Imagine we have a phylogenetic tree and want to infer the traits of an ancient ancestor. A simple method, parsimony, might suggest the ancestor that requires the fewest evolutionary changes. A likelihood-based method, however, might come to a different conclusion. Why? Because the likelihood model is more sophisticated. It considers not just the number of changes, but their probability. A change happening on a very long branch (representing a long time) is more probable than a change on a very short branch. Thus, a scenario with two changes on two long branches could be more likely than a scenario with one change on a very short branch. The inference depends entirely on the model of evolution you use.

This dependence on the model can be perilous. Suppose we are studying evolution and our model assumes that all types of DNA mutations are equally likely. In reality, some mutations (transitions) are much more common than others (transversions). If we analyze data where the evidence for a particular evolutionary tree comes from many transitions, our misspecified model will dramatically underweight this evidence, because it thinks these changes are rare and thus improbable. We might confidently draw the wrong conclusion, not because our math was wrong, but because our starting assumption—our model—was flawed. Similarly, incorrectly modeling gaps in a DNA sequence alignment can create powerful, but completely artificial, evidence that groups unrelated species together, simply because the flawed model interprets their shared gaps as a sign of close kinship.

The Ultimate Showdown: Comparing Models Themselves

This leads to the final, crucial question: if our conclusions depend so heavily on our model, how do we choose the right model? Can likelihood help us here, too?

The answer is yes, through a concept called the marginal likelihood, or Bayesian evidence. Let's say we are comparing a simple evolutionary model (like Jukes-Cantor, JC69) with a more complex one (like General Time Reversible, GTR). The GTR model has more parameters, so it's more flexible. Because of this flexibility, it can almost always find some specific set of parameters that fits the data better, achieving a higher maximum likelihood than the simpler JC69 model. This is like fitting a wiggly, high-degree polynomial to a few data points—it might hit them all perfectly, but we don't trust it to be a good general description. This is overfitting.

The marginal likelihood, $P(\text{data} \mid \text{model})$ , avoids this trap. It calculates the average likelihood across all possible parameter values for a model, weighted by their prior plausibility. It asks: "On the whole, how well does this model predict the data, without cherry-picking the absolute best-fitting parameters?"

A simple model that does a decent job across its small parameter space might end up with a higher marginal likelihood than a complex model that only does a good job in one tiny, fine-tuned corner of its vast parameter space. The marginal likelihood naturally embodies Occam's Razor: it balances goodness-of-fit against model complexity. The model that provides the most robust explanation for the data—the one that is both powerful and parsimonious—wins the day.

From a simple coin toss to comparing grand models of evolution, the concept of likelihood provides a single, unifying framework for scientific reasoning. It is the mathematical tool that lets us listen to the story told by our data, weigh the evidence for competing hypotheses, and build an ever-more-refined understanding of the world and its intricate mechanisms. It is the machinery of discovery.

Applications and Interdisciplinary Connections

Now that we've tinkered with the abstract machinery of probability and likelihood, let's take it out for a spin in the real world. You might be surprised where it shows up. We have seen that probability and likelihood are two sides of the same coin: probability looks forward from a known model to predict unknown data, while likelihood looks backward from known data to assess unknown models. This act of "looking backward" is the very heart of scientific discovery.

It is the ghost in the machine that lets us turn noisy data into knowledge. From decoding the secrets of our own DNA to peering into the heart of a semiconductor, the principle is the same: let the data vote on which story is the most plausible. We are about to see how this one powerful idea becomes a universal key, unlocking secrets in fields that seem worlds apart.

The Great Detective: Reconstructing History from Its Echoes

Much of science is a form of history, an attempt to reconstruct what happened in the past from the clues left behind in the present. Likelihood is the detective's magnifying glass.

Imagine an archaeologist who unearths an artifact with an ambiguous symbol. From prior knowledge of the culture, they believe it could mean 'house' or 'river'. Then, a dating test places the artifact in a specific historical period. We know from vast collections that artifacts from this period are, say, ten times more likely to bear a 'house' symbol than a 'river' symbol. This ratio is a statement about likelihoods— $P(\text{evidence} | \text{hypothesis})$ . Bayes' theorem tells us how to use this likelihood to update our initial belief. The dating result doesn't prove the symbol means 'house', but it provides evidence that dramatically increases the posterior probability of that hypothesis. This is the basic engine of inference: evidence, weighted by likelihood, refines belief.

Now, let's scale this up. The ultimate historical record is written in the language of DNA. The genomes of living organisms are like an ancient library, a collection of stories written over billions of years, but a library where the books have typos, ripped pages, and even paragraphs copied from other volumes. Likelihood is the tool we use to read this garbled history.

Consider an evolutionary puzzle. The family tree of four bacterial species—the "species tree"—tells us that species A and B are close cousins, and so are C and D. But when a biologist sequences a particular gene, its own family tree screams that A is actually cousins with C, and B with D! What gives? Did something funny happen in evolution? There are several competing stories, or models, to explain this discordance:

Incomplete Lineage Sorting (ILS): This is a simple mix-up. Just by chance, the ancestral gene variants didn't sort out in the way the species did.
Horizontal Gene Transfer (HGT): The gene was "stolen" from another lineage. For instance, an ancestor of C might have transferred the gene directly to an ancestor of A.
Duplication and Loss: An ancient gene duplication occurred, creating two copies. Over time, different species lost different copies, creating a misleading family tree for the remaining genes.

How do we decide? We can't rerun history. But we can use likelihood. For each of these three scenarios, we can build a detailed mathematical model and ask: how likely is it that we would observe our conflicting gene data if this scenario were true? Likelihood becomes the judge in a scientific courtroom. It might turn out that the probability of seeing our data under the HGT scenario is vastly higher than under the other two. In that case, we conclude that HGT is the best-supported explanation?. This is the power of likelihood-based model selection.

We can go even deeper. We can use likelihood not just to reconstruct the shape of the tree of life, but to understand the very process of evolution that grew it. For instance, did the evolution of wings allow insects to diversify into many more species? We can construct two competing models of evolution: one where the rates of speciation ( $\lambda$ ) and extinction ( $\mu$ ) are the same for all insects, and another, the BiSSE model, where winged insects have their own rates ( $\lambda_1, \mu_1$ ) and wingless insects have theirs ( $\lambda_0, \mu_0$ ). We then calculate the likelihood of the actual, observed phylogenetic tree of insects under both models. We are, in essence, asking the data, "Are you more plausible if we assume that wings are a 'key innovation' that changes the rules of the evolutionary game?".

This framework is incredibly powerful, even when our knowledge is shrouded in uncertainty. Consider the evolution happening inside a cancer tumor. By sequencing individual cells, we get a snapshot of the present, but the past—the evolutionary tree connecting these cells—is unknown. If we want to test the hypothesis that the cells which spread to other parts of the body (metastasis) all came from a single rogue lineage, we face a conundrum: there are thousands of possible family trees! A brute-force approach is hopeless. But the logic of likelihood provides an elegant way out. We can calculate the likelihood of our sequencing data for each possible tree, and then simply sum the posterior probabilities of all the trees where the metastatic cells form a single family, or clade. This gives us the total posterior probability of our hypothesis, elegantly averaging over our uncertainty about the true tree.

A Lens for the Invisible: Characterizing the Here and Now

Likelihood is not only for looking into the deep past; it is also a powerful lens for characterizing the hidden machinery of the present.

Imagine a molecular biologist studying how cells repair broken DNA. A double-strand break is a catastrophic event, and cells have several repair kits they can use. The quick-and-dirty "duct tape" method, called c-NHEJ, is fast but often leaves small scars (insertions or deletions). A more elaborate method, MMEJ, uses tiny stretches of matching sequence (microhomology) to stitch the ends together, often creating larger deletions. A third, high-fidelity pathway, Homologous Recombination (HR), uses an intact copy of the DNA as a template to perform a perfect repair, sometimes leaving behind a tell-tale "templated insertion".

When we sequence a repaired region, we see the scar but not the process that made it. But we can become molecular detectives. Based on the known tendencies of each pathway, we can build a probabilistic model for the kinds of scars each one produces. When we observe a new repair—say, a medium-sized deletion with a long microhomology and no templated insertion—we can calculate its likelihood under the c-NHEJ model, the MMEJ model, and the HR model. The pathway that assigns the highest likelihood to our observation is our prime suspect. This is a classic classification problem, and likelihood is the engine that drives it.

This principle of using likelihood to infer hidden parameters extends far beyond biology, into the heart of physics and materials science. A fundamental property of a semiconductor is its "band gap" ( $E_g$ ), a quantum-mechanical parameter that determines the colors of light it can absorb. Theory provides us with a model that predicts the material's absorption spectrum as a function of its band gap, $E_g$ . The problem is, we don't know $E_g$ . What we can do is measure the material's absorbance at various light energies. The principle of maximum likelihood then gives us a clear instruction: the best estimate for the true band gap is the value of $E_g$ that makes the data we actually measured the most probable outcome. We can imagine sliding the value of $E_g$ up and down. For each value, our theoretical curve shifts, and the likelihood of our data points changes. We stop when we find the peak of the likelihood function. Furthermore, we can use this same framework to compare entirely different physical theories. If one theory posits that the absorption follows a power law with exponent $m_1$ and another suggests $m_2$ , we can calculate the total marginal likelihood for each theory. This allows the data itself to vote on which physical model provides a more plausible description of reality.

Perhaps the most beautiful applications come from using likelihood to deconvolve a signal from a sea of noise. Your immune system generates a staggering diversity of antibody receptors by randomly shuffling and joining different gene segments (V, D, and J). This process is intentionally messy, adding random non-templated nucleotides at the junctions to increase diversity. When we sequence these receptors, the biological messiness is compounded by somatic mutations and technical sequencing errors. Suppose we observe a particular receptor sequence and want to infer whether a single random nucleotide was inserted at a junction. This is like trying to hear a single whisper in a hurricane. By building a detailed likelihood model—a generative story that accounts for every known source of randomness, from the VJ recombination to the SHM mutations to the final sequencing errors—we can calculate the likelihood of our observed sequence under two hypotheses: $H_0$ (no insertion) and $H_1$ (one insertion). This allows us to peer through the fog of randomness and make a rigorous inference about the hidden event that occurred inside a single B-cell long ago.

The Wisdom and Honesty of Uncertainty

The logic of likelihood even permeates our attempts to rank and rate. Who is the best Go player in the world? We cannot measure "skill" directly; we can only observe game outcomes. We can, however, model skill as a hidden parameter, $\theta$ . We might propose a model where the probability of Alice beating Bob is a function of the difference in their skills, $\theta_A - \theta_B$ . After a tournament, we have a set of game outcomes. We can then test a hypothesis, such as "The skills are $(\theta_A, \theta_B, \theta_C) = (S, 0, -S)$ ," by calculating how likely the observed wins and losses were under this specific assignment of skills. The set of skill parameters that maximizes this likelihood is our best estimate. This is the core idea behind rating systems like Elo, which have been extended in modern machine learning to power everything from movie recommendations to online ad placement.

Finally, the most profound feature of likelihood-based inference is perhaps its honesty. What happens when the data simply cannot tell two stories apart? In genetics, this is a common problem. Due to the mechanics of inheritance, two genes that are physically close to each other on a chromosome are almost always inherited together. They are in high Linkage Disequilibrium (LD). If a mutation in this region is associated with a disease, it can be statistically almost impossible to tell which of the two genes is the true culprit, because their presence in the population is nearly identical.

A naive method might fail spectacularly here, but a principled likelihood analysis does something remarkable. It does not flip a coin or declare one gene the winner. Instead, it honestly reports that the evidence is ambiguous. It does this by calculating a "Posterior Inclusion Probability" (PIP) for each variant—the posterior probability that it is the causal one. In a case of perfect LD, the likelihood of the data is identical whether we assume the first or the second gene is the cause. The posterior probability gets split between them. The sum of their PIPs, our posterior expected number of causal variants, might be close to 1, correctly telling us, "there is one cause in this region." But the individual PIPs might each be close to 0.5, telling us, "but the data doesn't give me any reason to prefer one over the other". Likelihood does not invent information that isn't in the data. It faithfully and quantitatively reports the uncertainty that remains.

From ancient symbols to quantum mechanics, from the evolution of life to the evolution of cancer, we see the same principle at work. Likelihood is the universal translator between our theories and our observations. It is the engine of Bayesian inference, the arbiter of scientific models, and the tool for estimating the unseeable. Whether we are a physicist probing a crystal, a biologist deciphering a genome, or an archaeologist interpreting a fragment of the past, we are all, in the end, asking the same fundamental question: "Of all the stories I can imagine, which one makes the world I see the least surprising?" The principle of likelihood gives us a rigorous, unified, and beautiful way to find the answer.