Likelihood-based Inference

SciencePedia

Key Takeaways

Likelihood reframes statistical questions from predicting data with a known model (probability) to evaluating competing models based on observed data (plausibility).
The Likelihood Principle asserts that all evidence from an experiment is contained solely in the likelihood function of the observed data, rendering stopping rules irrelevant.
Likelihood serves as a unifying bridge, forming the core data-driven component in both frequentist (via MLE) and Bayesian (via Bayes' theorem) inference.
The framework provides principled solutions for complex data challenges, including nuisance parameters, dependent data, and missing data under MAR assumptions.

Introduction

In the quest to understand the world, science constantly grapples with uncertainty. While probability allows us to predict the chance of future outcomes based on a known model, researchers often face the inverse problem: using observed data to infer the underlying model of the world. This challenge of learning from evidence is the domain of statistical inference, and at its heart lies a powerful and elegant concept: likelihood. Likelihood-based inference provides a unified and versatile framework for quantifying evidential support, estimating unknown parameters, and comparing scientific hypotheses. It addresses the fundamental gap between observing a phenomenon and understanding the process that generated it.

This article delves into the theory and application of likelihood-based inference across two comprehensive chapters. In the first chapter, Principles and Mechanisms, we will explore the philosophical shift from probability to plausibility, define the Maximum Likelihood Estimate (MLE), and examine the profound and controversial Likelihood Principle. We will see how likelihood acts as a bridge between the frequentist and Bayesian paradigms and provides principled methods for handling complex issues like nuisance parameters and missing data. Following this theoretical foundation, the second chapter, Applications and Interdisciplinary Connections, will showcase likelihood in action. We will journey through its application in diverse fields like biostatistics, ecology, and phylogenetics, seeing how it tames dependent data, overcomes the challenge of incomplete datasets, and serves as a critical engine in the modern pursuit of causal inference.

Principles and Mechanisms

A New Perspective: From Probability to Plausibility

Let’s begin our journey with a simple question, the kind that forms the bedrock of gambling and games of chance. If you have a fair coin, what is the probability of flipping it ten times and getting exactly seven heads? This is a question about probability. Given a known model of the world (a fair coin, where the probability of heads, $\theta$ , is $0.5$ ), we predict the chances of future data. The math is straightforward, and it gives us a single number.

But science often faces the reverse problem. We don’t know the true model of the world; we are trying to figure it out. Imagine we flip a mysterious coin ten times and we observe seven heads. We have no prior guarantee that the coin is fair. Our question is no longer "what is the probability of this outcome?" but rather, "what does this outcome tell us about the coin?" This is a question of inference.

This is where the idea of likelihood enters, and it represents a profound, almost philosophical shift in perspective. To construct the likelihood function, we take the very same formula we used for probability, but we turn it on its head. The data—our observation of seven heads and three tails—is now considered fixed. The parameter—the coin’s intrinsic probability of landing heads, $\theta$ —is now the variable.

The likelihood function, written as $L(\theta; \text{data})$ , asks: "For any given hypothesis about the coin (e.g., for a specific value of $\theta$ ), what was the probability of observing the data we actually saw?"

For our coin, the probability of getting $k=7$ heads in $n=10$ tosses is given by the binomial formula: $P(\text{data} | \theta) = \binom{10}{7} \theta^7 (1-\theta)^3$ . This, viewed as a function of $\theta$ , is our likelihood function. Let’s try a few values for $\theta$ :

If the coin were fair ( $\theta = 0.5$ ), the probability of our outcome would have been $\binom{10}{7} (0.5)^7 (0.5)^3 \approx 0.117$ .
If the coin were biased towards heads with $\theta = 0.7$ , the probability would be $\binom{10}{7} (0.7)^7 (0.3)^3 \approx 0.267$ .
If the coin were biased with $\theta = 0.9$ , the probability would be $\binom{10}{7} (0.9)^7 (0.1)^3 \approx 0.057$ .

Comparing these values, the hypothesis that $\theta=0.7$ makes our observed data more plausible than the hypothesis that $\theta=0.5$ . In fact, if we plot this function across all possible values of $\theta$ from 0 to 1, we’d see it has a peak. The value of $\theta$ at this peak is the one that makes our data most plausible. This is the celebrated Maximum Likelihood Estimate (MLE). For our coin, the MLE is exactly $\hat{\theta} = \frac{7}{10} = 0.7$ . It is the value of the parameter that best explains the data we have.

It is crucial to understand that likelihood is not probability. The likelihood function does not tell us the probability that $\theta$ is $0.7$ . It is a measure of evidential support. The height of the curve at any point $\theta$ measures how well that specific parameter value explains the data. The absolute height doesn’t mean much on its own, but the relative heights are everything. We can confidently say that the data provides more support for $\theta=0.7$ than for $\theta=0.5$ because the likelihood ratio $L(0.7) / L(0.5)$ is greater than one.

This focus on ratios leads to a remarkable simplification. Notice the term $\binom{10}{7}$ in our formula. This number tells us how many ways we can get 7 heads in 10 tosses, but its value doesn't depend on $\theta$ . When we compare the likelihood of two different values of $\theta$ , this constant term appears in both the numerator and the denominator of the ratio, so it cancels out. For the purpose of inference about $\theta$ , it's irrelevant. This reveals a deep truth: all the information about $\theta$ is contained in the shape of the function, specifically in the part that depends on $\theta$ . This is why we often write the likelihood using a proportionality sign, $L(\theta; \text{data}) \propto \theta^7 (1-\theta)^3$ , capturing the essential kernel of the function. Multiplying a likelihood function by any constant that doesn't depend on the parameter doesn't change the location of the peak or any of the likelihood ratios, and therefore, it doesn't change our inference.

The Likelihood Principle: Nothing but the Truth

The idea that only the shape of the likelihood function matters leads to a powerful and controversial principle: the Likelihood Principle (LP). It states that all the evidence about a parameter $\theta$ from an experiment is contained in the likelihood function for the data that were actually observed. Any other aspect of the experiment, such as the outcomes that could have happened but didn't, or the intentions of the experimenter, is irrelevant.

This might sound like common sense, but it stands in stark opposition to some of the most common methods in traditional statistics. Let’s explore this with a classic scenario drawn from medical research.

Imagine two research teams are evaluating a new antiviral therapy. The probability that a patient responds to the therapy is $\theta$ .

Team 1 uses a fixed sample size design. They decide to enroll exactly $n=20$ patients and observe how many respond. They find that $k=8$ patients respond.
Team 2 uses a sequential design. They decide to keep enrolling patients until they observe exactly $r=8$ responses. It turns out this takes a total of $N=20$ patients.

The raw data is identical in both cases: a sequence of 20 patients containing 8 responses and 12 non-responses. But the experimental plans—the stopping rules—were completely different. Should this difference in intention affect our conclusion about the drug's effectiveness, $\theta$ ?

Let’s look at the likelihoods.

For Team 1, the number of responses follows a Binomial distribution. The likelihood is $L_1(\theta) = \binom{20}{8} \theta^8 (1-\theta)^{12}$ .
For Team 2, the total number of patients follows a Negative Binomial distribution. The likelihood is $L_2(\theta) = \binom{19}{7} \theta^8 (1-\theta)^{12}$ .

Notice something amazing? Both likelihood functions are proportional to the same core expression: $\theta^8 (1-\theta)^{12}$ . They only differ by a constant factor ( $\binom{20}{8}$ vs. $\binom{19}{7}$ ) that doesn't involve $\theta$ .

According to the Likelihood Principle, since the likelihood functions are proportional, the evidence about $\theta$ is identical. Both teams should draw the exact same conclusions. The stopping rule is irrelevant.

This is a radical idea because it directly challenges frequentist methods like p-values and significance testing. A p-value is the probability of observing your data or something more extreme, calculated under a null hypothesis. The definition of "more extreme" depends on the sample space—the set of all possible outcomes. Since the two teams had different sampling plans, their sample spaces are different, and their calculated p-values will be different! A frequentist might conclude that the evidence is different, while a practitioner of the Likelihood Principle would insist it is the same.

This extends to the notorious problem of "optional stopping". If a researcher repeatedly analyzes data as it accumulates, stopping only when the p-value drops below a threshold like $0.05$ , frequentist theory states this dramatically inflates the chance of a false positive. To a frequentist, the stopping rule is paramount. To someone following the LP, the reason for stopping is irrelevant; the only thing that matters is the likelihood function of the final data, however it was obtained.

A Unifying Framework: The Bridge Between Schools of Thought

The likelihood function is not just a tool for one school of statistics; it sits at the heart of both major inferential paradigms, acting as a bridge between them. Let’s see how it operates in frequentist and Bayesian inference.

Imagine a biostatistician is modeling the number of adverse events per patient-month, which follows a Poisson distribution with an unknown rate parameter $\theta$ .

In frequentist inference, $\theta$ is considered a fixed, unknown constant. The goal is to use the data to pinpoint this value. The primary tool is the likelihood function, $L(\theta; \text{data})$ . The best guess for $\theta$ is the Maximum Likelihood Estimate, $\hat{\theta}_{MLE}$ , which is the peak of this function. All other frequentist tools—confidence intervals, hypothesis tests—are built around the properties of this estimator over hypothetical repetitions of the experiment. The likelihood function tells us what the data has to say, which we then use to construct procedures with desirable long-run properties.

In Bayesian inference, the philosophy is different. We treat the unknown parameter $\theta$ as a random variable. Before we see any data, we have some pre-existing beliefs about $\theta$ , which are captured in a prior distribution, $p(\theta)$ . When we collect data, we use Bayes' theorem to update our beliefs. The result is a posterior distribution, $p(\theta | \text{data})$ , which represents our updated state of knowledge. The engine that drives this update is the likelihood function. The relationship is beautifully simple:

p(\theta | \text{data}) \propto L(\theta; \text{data}) \times p(\theta)

Posterior is proportional to Likelihood times Prior.

The likelihood function acts as a filter, taking our prior beliefs and re-weighting them according to how well each possible value of $\theta$ explains the observed data. Regions of the parameter space where the likelihood is high get their prior belief amplified; regions where the likelihood is low get suppressed.

This formulation shows that Bayesian inference naturally adheres to the Likelihood Principle. Since the data only enters the calculation through the likelihood function, two experiments with proportional likelihoods will, given the same prior, result in identical posterior distributions. The stopping rule, which doesn't change the likelihood, also doesn't change the Bayesian conclusion.

Taming Complexity: Nuisance Parameters and Missing Data

Real-world scientific models are rarely as simple as a single coin flip. They often involve many parameters, not all of which are of interest. For example, when studying a biomarker, we might be interested in its average level, $\theta$ , but to model the data correctly, we also need to account for its variability, or variance, $\eta$ . Here, $\theta$ is our parameter of interest, and $\eta$ is a nuisance parameter. It's a nuisance because we don't care about its value directly, but we can't ignore it if we want to make a correct inference about $\theta$ .

How can likelihood theory handle this? One elegant solution is the profile likelihood. The idea is clever and intuitive. We create a new, simplified likelihood function just for our parameter of interest, $\theta$ . For each possible value of $\theta$ we are considering, we ask: "What value of the nuisance parameter $\eta$ makes the data most likely, given this value of $\theta$ ?" We plug this best-case value of $\eta$ back into the full likelihood function. The result is a function of $\theta$ alone, $L_p(\theta)$ , which has "profiled out" the nuisance parameter. This new function can be treated like a regular one-parameter likelihood for making inferences about $\theta$ .

This power to handle complexity is even more striking when we face one of the most persistent challenges in data analysis: missing data. When data points are missing, simply ignoring them can lead to severely biased conclusions. Likelihood theory provides a principled framework for understanding when and how we can handle missing data correctly.

The key lies in the missingness mechanism, which is the process that determines why data is missing. The full likelihood for a dataset with missing values must account for both the data-generating process (parameterized by $\theta$ ) and the missingness mechanism (parameterized by a nuisance parameter $\psi$ ). The observed-data likelihood is then found by integrating over all possibilities for the missing values.

A magical simplification occurs if the data are Missing At Random (MAR). This technical term has a simple meaning: the probability of a value being missing depends only on the information we have observed, not on the missing value itself. For instance, in a longitudinal study, a patient might be more likely to miss a follow-up visit if their previously observed blood pressure was high. Under MAR (and an additional technical condition called parameter distinctness), the observed-data likelihood miraculously factorizes into two separate parts: one part involving only $\theta$ and the observed data, and a second part involving only $\psi$ and the missingness pattern. Because they are separate, we can simply maximize the first part to make inferences about $\theta$ and completely ignore the missingness mechanism. The mechanism is said to be ignorable. This provides the theoretical foundation for powerful techniques like multiple imputation.

The situation changes dramatically if the data are Missing Not At Random (MNAR). This occurs when the probability of missingness depends on the unobserved value itself. For example, if patients with extremely high (and unmeasured) viral loads are too sick to attend their clinic appointment, the missingness depends on the very information that is missing. In this case, the likelihood function does not factorize. The parameters of our scientific model, $\theta$ , become entangled with the parameters of the missingness model, $\psi$ , inside the integral over the missing data. The mechanism is non-ignorable. We cannot simply ignore it; we must explicitly and correctly model the missingness process to get a valid answer. Likelihood theory thus provides a sharp, clear warning sign, distinguishing situations where we can proceed with relative ease from those that require extreme care.

When Data Is Too Good: Rescuing Likelihood

Is the principle of maximum likelihood infallible? Not quite. Sometimes, the data can be, in a sense, too good for the model, causing the MLE to behave strangely. A classic example occurs in logistic regression, a common tool for modeling binary outcomes like disease presence or absence.

Suppose we discover a biomarker that perfectly separates healthy individuals from sick ones: every sick person has a biomarker value above a certain threshold, and every healthy person has a value below it. This is a researcher's dream! But if you try to fit a standard logistic regression model to this data, you'll find that the maximum likelihood estimate for the biomarker's effect is infinite. The likelihood function keeps climbing as you propose a stronger and stronger effect, never reaching a finite peak. The MLE doesn't exist.

This is where the flexibility of the likelihood framework shines. When the raw likelihood leads to an absurd conclusion, we can augment it. This is the idea behind penalized likelihood. We modify the log-likelihood function by adding a penalty term that expresses a preference for more "reasonable" parameter values.

Ridge regression adds a penalty that discourages very large parameter values, effectively putting a leash on them and pulling them back towards zero. This ensures the penalized likelihood function has a unique, finite peak.
Firth's regression uses a more sophisticated penalty, derived from a concept called the Jeffreys prior. This penalty not only solves the separation problem but also has the desirable property of reducing the bias of the estimates, especially in small samples.

These penalized methods are not ad-hoc fixes. They are principled extensions of the likelihood framework, demonstrating its capacity to adapt to challenging situations. They show that likelihood is not a rigid dogma but a powerful and versatile language for reasoning with data, a language that allows us to express not only what the data says, but also what constitutes a sensible answer. From its simple, intuitive core to its profound philosophical consequences and its powerful applications in complex modern problems, the principle of likelihood offers a unifying and beautiful perspective on the art of scientific inference.

Applications and Interdisciplinary Connections

In the previous chapter, we became acquainted with the formal machinery of likelihood. We saw it as a mathematical device for estimating parameters, a sort of reverse-engineering process where we ask: "Given the data we have collected, what is the most plausible version of the world, according to our model?" But to leave it at that would be like learning the rules of chess and never playing a game. The true power and beauty of the likelihood principle are not in its definition, but in its application. It is a universal solvent for a vast array of scientific problems, a common language spoken by researchers in fields that might otherwise seem worlds apart.

In this chapter, we will go on a journey to see this principle in action. We will see how this single, elegant idea allows us to build richer, more honest models of the world—models that can handle the messy, complex, and incomplete nature of real data. We will see how it provides not just answers, but a complete framework for scientific reasoning, from comparing competing ideas to bridging the daunting gap between correlation and causation.

The Art of Taming Complexity: Handling Dependent Data

A simplifying assumption in many introductory statistics lessons is that our data points are independent and identically distributed—like a series of coin flips, where one outcome has no bearing on the next. The real world, of course, is rarely so tidy. Observations are often entangled in intricate webs of dependence, and ignoring these connections is not just sloppy; it can lead to profoundly wrong conclusions. Likelihood-based inference provides a powerful and principled way to model this interdependence, to write it into the story our model tells.

Consider a modern clinical trial for a new drug. We don't just measure a patient once; we track their response over weeks or months. These repeated measurements on the same person are obviously not independent. A patient who responds well in week one is likely to be doing better than average in week two. Their measurements are correlated. A naive analysis that treats each measurement as a separate data point would be foolish—it would overestimate the amount of information we have and produce misleadingly small error bars.

A more sophisticated approach, made possible by the likelihood framework, is to use what is known as a Linear Mixed Model (LMM). Instead of just modeling the average response over time, we write down a richer, two-part story. One part describes the average trend for all patients, and the other describes how each individual patient's trajectory varies around that average. These individual variations are modeled as "random effects." By writing down the full likelihood for this two-part story, we can simultaneously estimate the overall treatment effect and the nature of the within-subject correlation. This method, by respecting the data's true structure, gives us the most precise and accurate estimate of the drug's effect. It is a direct application of the generalized Gauss-Markov theorem, ensuring our estimator is the best possible among a wide class.

This same principle extends from dependencies in time to dependencies in space. Imagine an ecologist studying the impact of a new road on bird abundance across a landscape. The abundance at one site is rarely independent of its neighbors. Favorable conditions might stretch across a whole patch of forest, creating positive spatial autocorrelation. Again, a simple regression would be invalid. But we can use likelihood to specify a model where the unobserved error at one location is explicitly a function of the errors at neighboring locations. Models like the Spatial Autoregressive (SAR) or Conditional Autoregressive (CAR) model do precisely this. They use a spatial weights matrix ( $W$ ) to define which sites are neighbors, and the likelihood function includes a parameter that quantifies the strength of this spatial dependence. In fitting this model, we not only get a valid test of the road's impact, but we also learn something about the spatial scale of the ecological processes at play.

The dependencies can be even more abstract. Think of the grandest dependency of all: the tree of life. Species are not independent entities; they are related through a shared evolutionary history. Modern phylogenetics uses likelihood to reconstruct this history. The method of Maximum Likelihood in this context involves calculating the probability of observing the DNA sequences of today's species, given a proposed evolutionary tree and a stochastic model of how DNA mutates over time. The likelihood is calculated over the entire tree, summing up probabilities across all possible, unobserved ancestral sequences. The tree topology and branch lengths that maximize this likelihood represent our best hypothesis for how these species evolved and are related. From microbiology to zoology, likelihood allows us to turn sequence data into a story about deep time, all by explicitly modeling the dependencies inherent in evolution.

The Triumph Over Imperfection: Dealing with Missing Data

Real-world science is a battle against imperfection, and one of the most common imperfections is missing data. Patients drop out of studies, equipment fails, survey respondents skip questions. A naive response is to simply analyze the "complete cases," throwing away any subject with even a single missing value. This is not only wasteful but can be catastrophically biased.

Here again, likelihood-based inference offers a solution of remarkable elegance. The key lies in the reason why the data are missing. In many situations, the missingness is what we call Missing At Random (MAR). This doesn't mean the data are missing for no reason at all; it means the probability of a value being missing can depend on other information we have observed, but not on the unobserved value itself. For example, in a clinical trial, a patient might be more likely to miss a visit if their previously measured health status was poor.

Under the MAR assumption, likelihood-based methods work as if by magic. Because the likelihood function is derived from a model for the full joint distribution of a subject's data, we can write down the likelihood for just the part we observed, and this "observed-data likelihood" is sufficient. Maximizing it provides valid, unbiased, and consistent estimates of our parameters, without any need to explicitly model the missingness mechanism itself. The model effectively uses the relationships it learns from the completers to make a statistically valid inference in the presence of dropouts.

This property is a cornerstone of modern biostatistics, allowing for valid analysis of longitudinal studies with dropouts. It is a profound demonstration of the power of specifying a complete probabilistic model. Of course, the MAR assumption itself might be wrong. What if patients drop out for reasons related to how poorly they feel at that moment, a value we failed to record? This is called Missing Not At Random (MNAR). Even here, the likelihood framework provides a path forward. It serves as the foundation for sensitivity analyses, where we can build models that include an explicit parameter for the MNAR effect (a $\delta$ -adjustment) and see how our conclusions change as we vary our assumption about the degree of departure from MAR.

Beyond Estimation: A Framework for Scientific Reasoning

The power of likelihood extends far beyond simply finding the best value for a parameter. It provides a comprehensive toolkit for scientific inquiry.

First, it gives us a universal method for comparing competing scientific hypotheses in the form of the Likelihood Ratio Test (LRT). Suppose we have a simple model of the world and a more complex one that contains the simple one as a special case. Is the extra complexity justified? The LRT provides a direct answer. We calculate the maximum likelihood for both models. The ratio of these likelihoods tells us how much more plausible the data are under the more complex model. This ratio can be converted into a test statistic that, under the null hypothesis that the simpler model is true, follows a known $\chi^2$ distribution. This allows us to make a principled decision. For instance, ecologists can use this method to test if the species turnover ( $\beta$ diversity) differs between two regions by comparing a model that forces the regions to be the same against a model that allows them to differ.

Second, the likelihood framework contains the seeds of its own critique. Any model rests on assumptions—for example, that the effect of an environmental toxin increases linearly with its concentration. But what if the effect is U-shaped? A good statistical framework should give us tools to check our assumptions. Likelihood does just this. The score function, which is the derivative of the log-likelihood, can be used to construct powerful model diagnostics. These tests can detect subtle departures from our model's assumptions, such as an incorrect functional form for a covariate, allowing us to build better, more accurate models. This is an essential practice in fields like epidemiology when analyzing matched data from nested case-control studies.

Finally, and perhaps most profoundly, likelihood-based models are an indispensable engine in the quest to move from mere association to causal inference. By itself, a model that estimates the probability of an outcome $Y$ given a treatment $T$ and covariates $X$ , i.e., $p(Y|X,T)$ , is purely associational. It does not, and cannot, tell us that $T$ causes $Y$ . However, within a larger logical framework built on causal assumptions (like conditional exchangeability and positivity), such models become crucial tools. In modern epidemiology and medical data science, likelihood models are used to estimate propensity scores—the probability of receiving a treatment given a patient's characteristics. These scores are then used in methods like inverse probability weighting to create a pseudo-population in which confounding has been removed, allowing for the estimation of causal effects from observational data, as is done with Marginal Structural Models. In the Bayesian paradigm, which builds directly upon the likelihood, prior distributions can be used to formally encode causal knowledge—for example, by forcing the coefficient of a variable known to be a post-treatment collider to zero, thereby excluding it from the model in a principled way.

When the Full Story is Too Complex

What happens when the web of dependencies in our data is so intricate that writing down the full, correct likelihood becomes computationally impossible or theoretically intractable? This is a common challenge with high-dimensional spatial or genetic data. Here, a brilliant and pragmatic extension of the likelihood principle comes to our aid: composite likelihood.

The idea is to construct a pseudo-likelihood by multiplying the likelihoods of smaller, overlapping, and more manageable pieces of the data. For instance, instead of the joint likelihood of all 1,000 measurements on a subject, we might multiply the bivariate likelihoods for all possible pairs of measurements. This product is not a true likelihood because the pairs are not independent. However, maximizing this composite likelihood still yields consistent estimators. We lose some statistical efficiency compared to the full (but unavailable) maximum likelihood estimator, but we gain tractability. Composite likelihood offers a robust alternative to methods like Generalized Estimating Equations (GEE) and can be preferable when one desires the "automatic" handling of MAR data that is characteristic of likelihood methods, but without the burden of specifying the full, complex joint distribution.

A Unified View

Our journey is at an end. We began with likelihood as a simple tool for parameter estimation. We have seen it blossom into a philosophy of modeling. We've witnessed it tame the complex dependencies found in our own bodies, across landscapes, and through the vastness of evolutionary time. We have seen its almost magical ability to deliver valid inferences from incomplete data. We have seen it serve as a rigorous framework for testing hypotheses, diagnosing our own models, and as a critical component in the search for causal understanding. From the bedside to the biosphere, the likelihood principle provides a unified, powerful, and deeply beautiful language for learning from data. It allows us to tell ever more honest and nuanced stories about the world, and to know just how much to believe them.