try ai
Popular Science
Edit
Share
Feedback
  • Likelihood and Maximum Likelihood Estimation

Likelihood and Maximum Likelihood Estimation

SciencePediaSciencePedia
Key Takeaways
  • Maximum Likelihood Estimation (MLE) is a method for estimating model parameters by finding the values that maximize the probability of observing the given data.
  • Using the log-likelihood simplifies optimization by converting products into sums, allowing the use of calculus or numerical algorithms to find the maximum.
  • The power of MLE stems from its connection to information theory, as it effectively minimizes the Kullback-Leibler (KL) divergence between the model and the empirical data distribution.
  • MLE serves as a foundational tool across sciences, enabling inference in fields like genetics, finance, machine learning, and paleontology by fitting models to data.

Introduction

In science and everyday life, we are constantly faced with a fundamental challenge: how do we draw reliable conclusions from incomplete or noisy data? When we observe a pattern, how can we deduce the underlying process that created it? This question of reverse-engineering reality from observation is the cornerstone of scientific discovery. Without a systematic framework for this task, we would be left adrift in a sea of ad-hoc guesses and intuition.

This article introduces the principle of likelihood and its powerful extension, Maximum Likelihood Estimation (MLE), a unified and rigorous philosophy for learning from data. It provides the answer to the question: "Given the data we have seen, what is the most plausible story about the world that could have generated it?"

We will embark on a journey to understand this pivotal concept. In the first chapter, "Principles and Mechanisms," we will dissect the core logic of MLE, starting with simple intuitions and building up to the mathematical machinery of log-likelihood and optimization that makes it a universal tool. We will explore why this method is not just a clever trick, but a deeply principled approach with desirable statistical properties. Following this, the chapter on "Applications and Interdisciplinary Connections" will demonstrate the breathtaking scope of likelihood in action, revealing how this single idea helps scientists decode the genome, model financial markets, navigate spacecraft, and even reconstruct Earth's ancient history. By the end, you will see how likelihood serves as a common language for inference across the sciences.

Principles and Mechanisms

Imagine a friend hands you a strange, lopsided coin and asks you to figure out its bias. You have no idea if it's fair or not. This is a classic problem of inference: we have some data (the outcomes of coin flips), and we want to infer the properties of the underlying process that generated it. How should we go about making our best guess?

This is the central question that the principle of likelihood was designed to answer. Instead of asking, "Assuming the coin is fair, what's the chance of getting seven heads in ten flips?", we turn the question on its head: "Given that we observed seven heads in ten flips, what is the most plausible or likely bias of the coin?" The answer your intuition screams—seven out of ten, or 0.7—is not just a good guess; it's the very heart of one of the most powerful ideas in all of science.

What's the Most Plausible Story? The Heart of Likelihood

Let's make this more concrete. Suppose we flip that strange coin just once, and it comes up heads. We want to estimate the probability of getting heads, which we'll call ppp. What's our best guess for ppp? If we had to bet, most of us would bet on p=1p=1p=1. It's a wild guess based on tiny evidence, but it's the most plausible one. If it had come up tails, we’d have guessed p=0p=0p=0.

The principle of ​​Maximum Likelihood Estimation (MLE)​​ formalizes this intuition. We write down a function, the ​​likelihood function​​, which represents the probability of seeing our data, but we view it as a function of the unknown parameter, ppp. For a single Bernoulli trial (a single event with two outcomes), the probability of getting an outcome xxx (where we'll say x=1x=1x=1 for heads and x=0x=0x=0 for tails) is given by P(x;p)=px(1−p)1−xP(x; p) = p^x (1-p)^{1-x}P(x;p)=px(1−p)1−x.

When we observe an outcome, say x=1x=1x=1, the likelihood of the parameter ppp is L(p;x=1)=p1(1−p)1−1=pL(p; x=1) = p^1(1-p)^{1-1} = pL(p;x=1)=p1(1−p)1−1=p. To find the "most likely" ppp, we just need to find the value of ppp that maximizes this function. Obviously, the function L=pL=pL=p is maximized when ppp is as large as possible, so our estimate p^\hat{p}p^​ is 1. If we had observed tails (x=0x=0x=0), the likelihood would be L(p;x=0)=1−pL(p; x=0) = 1-pL(p;x=0)=1−p, which is maximized when p=0p=0p=0. The mathematics perfectly captures our intuition.

From Intuition to a Universal Tool: The Log-Likelihood Function

This is simple enough for one coin flip, but what about many? If we flip the coin NNN times and observe kkk heads, the probability of this specific sequence of outcomes (the likelihood) is proportional to pk(1−p)N−kp^k(1-p)^{N-k}pk(1−p)N−k. Finding the maximum of this function isn't too hard, but you can imagine that for more complex models, we'd be dealing with the product of many, many small numbers—a recipe for mathematical headaches and numerical errors.

Here, we employ a beautiful mathematical trick that is central to the practice of MLE. Since the natural logarithm, ln⁡(z)\ln(z)ln(z), is a strictly increasing function, maximizing a function LLL is equivalent to maximizing ln⁡(L)\ln(L)ln(L). Taking the log transforms our messy product into a tidy sum. This new function, ℓ=ln⁡(L)\ell = \ln(L)ℓ=ln(L), is called the ​​log-likelihood​​.

For our series of coin flips, the log-likelihood becomes: ℓ(p)=ln⁡(pk(1−p)N−k)=kln⁡(p)+(N−k)ln⁡(1−p)\ell(p) = \ln(p^k(1-p)^{N-k}) = k \ln(p) + (N-k) \ln(1-p)ℓ(p)=ln(pk(1−p)N−k)=kln(p)+(N−k)ln(1−p) (We can ignore the constant binomial coefficient (Nk)\binom{N}{k}(kN​) because it doesn't depend on ppp and thus doesn't affect where the maximum is.)

Now, we can use a basic tool from calculus: to find the maximum of a function, we take its derivative, set it to zero, and solve. dℓdp=kp−N−k1−p=0\frac{d\ell}{dp} = \frac{k}{p} - \frac{N-k}{1-p} = 0dpdℓ​=pk​−1−pN−k​=0 Solving this simple equation for ppp gives us the maximum likelihood estimate: p^=kN\hat{p} = \frac{k}{N}p^​=Nk​ There it is, in all its glory. The most likely value for the coin's bias is simply the proportion of heads we observed. This wonderfully intuitive result emerges not from a guess, but from a powerful and universally applicable mathematical procedure.

Likelihood in a World of Complexity

The true beauty of MLE is that this same fundamental procedure works for an astonishing variety of problems, far beyond simple coin flips.

What if we have a system with more than two outcomes? Imagine a controller that can choose one of four modes, and we observe the counts for each mode over 40 cycles as (17,9,4,10)(17, 9, 4, 10)(17,9,4,10). What are the most likely probabilities {p1,p2,p3,p4}\{p_1, p_2, p_3, p_4\}{p1​,p2​,p3​,p4​} for these modes? The log-likelihood is now a function of four variables: ℓ=17ln⁡(p1)+9ln⁡(p2)+4ln⁡(p3)+10ln⁡(p4)\ell = 17\ln(p_1) + 9\ln(p_2) + 4\ln(p_3) + 10\ln(p_4)ℓ=17ln(p1​)+9ln(p2​)+4ln(p3​)+10ln(p4​). We need to maximize this, subject to the constraint that p1+p2+p3+p4=1p_1+p_2+p_3+p_4=1p1​+p2​+p3​+p4​=1. Using a standard technique for constrained optimization called the method of ​​Lagrange multipliers​​, we find that the result is just as intuitive as before: the MLE for each probability is its observed frequency, p^i=ni/N\hat{p}_i = n_i / Np^​i​=ni​/N. So, p^1=17/40\hat{p}_1 = 17/40p^​1​=17/40, p^2=9/40\hat{p}_2 = 9/40p^​2​=9/40, and so on. This principle holds even if the probabilities are linked by some underlying model, for instance, if a theory proposes that p1=p2=θp_1 = p_2 = \thetap1​=p2​=θ and p3=1−2θp_3 = 1-2\thetap3​=1−2θ. The machinery of MLE gracefully handles this constraint and finds the most likely value of θ\thetaθ based on the data.

What about continuous measurements, like the lifetime of a laser diode or the price of a stock? The principle is exactly the same. We simply replace the probability mass function with a ​​probability density function (PDF)​​ when we write down our likelihood. For example, if we believe our data x1,…,xnx_1, \dots, x_nx1​,…,xn​ come from a log-normal distribution, we can write down the log-likelihood function for its parameters μ\muμ and σ\sigmaσ. When we carry out the maximization for μ\muμ (assuming σ\sigmaσ is known), we find another beautifully simple result: the best estimate for μ\muμ, the mean of the underlying normal distribution, is simply the average of the logarithms of our data points. μ^=1n∑i=1nln⁡(xi)\hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} \ln(x_i)μ^​=n1​∑i=1n​ln(xi​) Again and again, the MLE principle takes a complex scenario, applies a standard procedure, and returns an answer that is both elegant and deeply intuitive.

Climbing the Likelihood Hill: Optimization and Machine Learning

Let's step back and look at the big picture. The task of finding the maximum likelihood estimate is, fundamentally, an ​​optimization problem​​. We can visualize the log-likelihood function as a landscape of hills and valleys. Our goal is to find the coordinates of the highest peak.

For many problems, we can find this peak using calculus. To be sure we've found a peak and not a valley or a saddle point, we can use the second derivative test. For a model with multiple parameters, like finding the slope α\alphaα and intercept β\betaβ of a line in a regression model, this "second derivative" takes the form of a matrix called the ​​Hessian​​. By analyzing the properties of the Hessian matrix at our solution, we can confirm that we have indeed found a local maximum.

This framing of MLE as optimization provides a powerful bridge to the world of computer science and, in particular, machine learning. When a data scientist "trains" a classification model, they are typically trying to minimize a "loss function." One of the most common and effective loss functions is the ​​cross-entropy loss​​. It turns out that this is nothing more than the negative of the log-likelihood function. So, when you hear about an AI model being trained to minimize cross-entropy, what it's really doing is climbing the likelihood hill to find the model parameters that make the observed data most plausible.

But what happens when we can't solve for the peak with a simple algebraic formula? This is often the case. For ​​logistic regression​​, a cornerstone of modern statistics and machine learning, setting the derivative of the log-likelihood to zero yields a system of non-linear equations with no general closed-form solution. This doesn't mean the principle has failed. It just means we need a different way to climb the hill. Instead of jumping to the peak, we use numerical optimization algorithms. A common approach is ​​gradient ascent​​ (or its cousin, gradient descent, for minimizing loss), where we start at a random point on the landscape and take small steps in the direction of the steepest ascent—the direction given by the gradient of the log-likelihood function. We repeat this process until we converge on a peak.

Why is Likelihood So Special? The Deeper Connections

By now, you can see that MLE is an incredibly versatile and practical tool. But is it just a clever computational recipe, or is there a deeper reason for its success? The answer is a profound one.

It can be shown that maximizing the likelihood is mathematically equivalent to minimizing the ​​Kullback-Leibler (KL) divergence​​ from the empirical distribution of our data to the distribution proposed by our model. The KL divergence is a concept from information theory that measures how one probability distribution differs from a second, reference distribution. In essence, it quantifies the "information loss" or "surprise" when using the model to represent reality. Therefore, performing MLE is not just finding the "most likely" parameter; it is an attempt to find the model that is informationally closest to the world as represented by our data.

This fundamental property gives rise to the remarkable statistical properties that make MLE the gold standard for estimation. Under general conditions, MLEs are:

  1. ​​Consistent​​: As you collect more and more data, the maximum likelihood estimate is guaranteed to converge to the true, underlying parameter value. The more you see, the closer to the truth you get. (We must be careful, as this property can break if the likelihood landscape is pathological, for example, having multiple persistent peaks that can "trap" the estimator).
  2. ​​Asymptotically Efficient​​: For large sample sizes, no other well-behaved estimator is more precise. The MLE makes the most out of the information contained in your data, yielding estimates with the smallest possible variance. This is why for complex models, such as ARMA models used in time series analysis, MLE is generally preferred over simpler but less efficient techniques like the method of moments.

The principle of likelihood, then, is far more than a mere estimation technique. It is a unified philosophy for learning from data, connecting statistics to information theory and optimization. It begins with a simple, intuitive question—"What's the most plausible story?"—and leads us to a powerful, robust, and theoretically elegant framework for revealing the hidden parameters that govern the world around us.

Applications and Interdisciplinary Connections

We have spent some time with the machinery of likelihood, but a machine is only as good as what it can build. Is this principle of maximum likelihood just an elegant piece of statistical theory, a toy for mathematicians? The truth, as is so often the case in science, is far more surprising and beautiful. The principle of likelihood is not a toy; it is a universal key. It unlocks secrets hidden in the jitter of atoms, the chaotic dance of the stock market, the intricate patterns of our own DNA, and even the faint echoes of ancient cataclysms.

In this chapter, we will embark on a journey across the sciences to see this one powerful idea at work, again and again. In each new land, we will find scientists asking the same simple, profound question: "Of all the possible stories we could tell, which one makes the data we actually see the most probable?"

The Invisible World Made Visible

Let's begin with a question that seems simple: what is temperature? You might say it's what a thermometer reads. But what is it? We know that a gas is a swarm of frantic little particles, a chaotic ballet of microscopic collisions. Temperature, it turns out, is not a property of a single particle. You cannot ask about the temperature of one molecule, any more than you can ask about the 'wetness' of one molecule of water. Temperature is a statistical property of the entire collective; it is a measure of the average kinetic energy of the particles.

But we can't see the "true" temperature. We can't interview every particle and ask its speed. So how could we ever know it? Suppose an experimentalist manages to trap a small sample of a gas and measure the speeds of a handful of particles. The measurements might be all over the place. What can be done with this list of speeds? This is where likelihood steps onto the stage. The laws of physics, specifically the Maxwell-Boltzmann distribution, provide a mathematical story for how particle speeds should be distributed for any given temperature TTT. For a high temperature, the story predicts a wide range of high speeds. For a low temperature, it predicts a cluster of slower speeds.

Our list of measured speeds is a single page torn from this storybook. The principle of maximum likelihood tells us to try on every possible value of the temperature TTT and, for each one, calculate the probability, or likelihood, of having observed our specific list of speeds. The value of TTT that makes our data most plausible is our best estimate. It is the temperature that the universe, through our data, is whispering to us. Remarkably, when you work through the mathematics, this maximum likelihood estimate for the temperature turns out to be directly proportional to the average of the squares of the measured speeds—a beautiful confirmation of the idea that temperature is tied to the average kinetic energy of the particles. In this way, likelihood allows us to take a few glimpses of the microscopic world and infer a fundamental property of the invisible, macroscopic whole.

Decoding the Blueprints of Life

Now let's jump from the world of physics to the core of biology. Inside every living cell is a library of instructions—the genome—written on long molecules called chromosomes. The story of how likelihood helps us read this library is one of the great triumphs of modern science.

A fundamental question in genetics is about location. If two genes are close together on the same chromosome, they tend to be inherited together. If they are far apart, a process called recombination can swap pieces of chromosomes, making them appear to be inherited independently. The probability of this swap, the recombination fraction rrr, is a measure of the distance between the genes. To build a map of the genome, we need to estimate these distances.

How is it done? In a classic experiment, we can perform a "testcross," breeding organisms and simply counting the different types of offspring. Some offspring will have the parental combination of traits, while others will be "recombinant." Each offspring is a roll of the dice, and the probability of getting a recombinant is rrr. Given the counts of recombinant and parental offspring, what is our best estimate for the distance rrr? You might guess it's simply the proportion of recombinant offspring you saw, and you would be right. But why is that the right answer? Maximum likelihood gives us the rigorous foundation. We write down the likelihood of observing our specific counts as a function of rrr, and the value of rrr that maximizes this likelihood is, indeed, the simple, intuitive ratio of recombinants to the total.

But science is rarely so simple. What if the gene we're interested in—say, one that influences susceptibility to a disease—is itself invisible? We can't see it directly. This is the problem of finding a Quantitative Trait Locus (QTL). We may, however, have visible "signposts" nearby on the chromosome, called genetic markers. The situation is like trying to find a hidden treasure buried somewhere between two known landmarks. Here, likelihood performs a truly beautiful trick through a computational strategy called the Expectation-Maximization (EM) algorithm. We start with a wild guess about the parameters of our model.

  1. ​​The 'E' Step (Expectation):​​ Given our current guess, we calculate the probability for each individual that their hidden gene is of one type or another. It’s a "soft" assignment, not a hard choice. For individual #1, we might say, "There's a 0.7 chance the hidden gene is type A, and a 0.3 chance it's type B."
  2. ​​The 'M' Step (Maximization):​​ We then update our model parameters by finding the values that have the maximum likelihood, assuming our probabilistic assignments from the E-step are correct.

We repeat these two steps, E-M-E-M..., and with each iteration, we are guaranteed to climb higher up the hill of likelihood, converging on the best possible estimates for the genetic effects, even though we never saw the gene itself. It is an "educated guessing" machine, powered by the engine of likelihood.

Of course, finding a possible location is one thing; being confident about it is another. How do we know we haven't just been fooled by random chance? Again, likelihood provides the answer in the form of a hypothesis test. We compare two stories: Story A says the genes are unlinked (r=0.5r=0.5r=0.5), and Story B says they are linked at the distance we estimated (r^\hat{r}r^). The ratio of the likelihoods, L(r^)/L(0.5)L(\hat{r})/L(0.5)L(r^)/L(0.5), tells us how much more probable our data are under Story B than Story A. For historical and practical reasons, geneticists take the base-10 logarithm of this ratio to get a "LOD score" (logarithm of the odds). A LOD score of 3.0, a long-standing benchmark in human genetics, means the data are 103=100010^3=1000103=1000 times more likely under the hypothesis of linkage. This isn't just an arbitrary number; it was a wise convention established to prevent scientists from chasing ghosts in the vastness of the genome, where false leads are abundant.

The principle scales to the grandest questions in biology. When we compare traits across different species—say, brain size versus body size in mammals—we face a new problem: species are not independent data points. A human and a chimpanzee are more similar than a human and a kangaroo because they share a more recent common ancestor. Their shared history on the "Tree of Life" creates statistical non-independence. The likelihood framework is powerful enough to handle this. Instead of assuming our data points are independent, we build a model where the covariance between them is itself a function of their shared evolutionary history on a phylogenetic tree. Likelihood then allows us to estimate not only the relationship between brain and body size, but also the strength of the "phylogenetic signal" itself—a parameter that tells us how strongly shared ancestry influences the trait. This is Phylogenetic Generalized Least Squares (PGLS), a cornerstone of modern evolutionary biology, and it allows us to correctly interpret the patterns of life while respecting its deep, interconnected history.

Taming the Chaos

From the quiet unfolding of evolution, let's turn to the buzzing, dynamic processes that govern our world in real time.

Consider the stock market. The price of a stock appears to be a "random walk," a dizzying and unpredictable journey. Is there any order to be found? In finance, a common model for a stock's price is Geometric Brownian Motion, which describes the price evolution in terms of two hidden parameters: a drift μ\muμ, which represents the average trend or growth rate, and a volatility σ\sigmaσ, which represents the magnitude of the random fluctuations or "risk." Neither is directly visible. All we have is a sequence of past prices. By looking at the sequence of log-returns (the percentage changes in price), we can write down a likelihood function for the observed data in terms of μ\muμ and σ\sigmaσ. Finding the parameters that maximize this likelihood gives us our best estimate of the underlying trend and risk of the asset, turning a chaotic history into actionable insight.

This idea of using likelihood to peer inside a dynamic system is incredibly general. Imagine ecologists studying the classic "dance" of predators and prey, like foxes and rabbits. They can write down a simple mathematical model—a set of differential equations—that describes how the populations should change over time based on birth rates, death rates, and the rate at which foxes eat rabbits. But these rates are unknown. The data they collect—perhaps yearly counts of each animal—are sparse and noisy. How can they find the underlying parameters of their ecological model? They can use the likelihood function to connect their idealized model to the messy data. For any set of parameters, their model predicts a "perfect" trajectory of populations. The likelihood calculates the probability of seeing the actual, noisy data given that perfect trajectory. By searching for the parameters that maximize this likelihood, they find the version of their story that best explains the real world. The very same logic applies to a chemical engineer modeling a reaction in a vat or a pharmacologist modeling how a drug is processed in the human body.

Perhaps the most spectacular application of this idea is in guidance and control. Imagine trying to navigate a spacecraft to Mars. You have a model based on Newton's laws that tells you where the spacecraft should be. You also have noisy sensor readings from antennas on Earth that tell you where the spacecraft seems to be. The celebrated Kalman filter is an algorithm that brilliantly combines these two pieces of information to produce an optimal, continually updated estimate of the spacecraft's true state. But where does likelihood fit in? The connection is deep. At each step, the filter makes a prediction. The difference between this prediction and the next noisy measurement is called the "innovation." The Kalman filter shows that these innovations are statistically independent. Therefore, the likelihood of the entire sequence of measurements can be broken down into the product of the likelihoods of each individual innovation. To find the parameters of the underlying model (like the effect of solar wind or the efficiency of the thrusters), one can maximize this likelihood, which is calculated on the fly by the Kalman filter itself. What began as a tool for static estimation now becomes the heart of a real-time system for tracking, learning, and navigating through a dynamic world.

Choosing Between Stories

So far, we have used likelihood to estimate unknown numbers within a single story, a single model of the world. But its power goes even further: it can help us choose between entirely different stories.

Let's travel back in time to one of Earth's great cataclysms, the mass extinction at the end of the Permian period, which wiped out over 90% of marine species. Paleontologists dig up fossils and, using geological dating methods, estimate the last-known appearance of many different species. Each date has a degree of uncertainty. A central question is about the tempo of the extinction. Was it a single, catastrophic event, like a massive asteroid impact (a "single-pulse" model)? Or was it a more complex, drawn-out affair with multiple waves of death, perhaps from massive volcanic eruptions (a "two-pulse" model)?

These are two fundamentally different stories. How do we ask the data which story it prefers? We can use likelihood. For the single-pulse model, we find the single extinction time τ\tauτ that maximizes the likelihood of the observed fossil dates. For the two-pulse model, which is a type of mixture model, we find the two extinction times τ1\tau_1τ1​ and τ2\tau_2τ2​ (and their relative importance π\piπ) that together maximize the likelihood.

Naturally, the two-pulse model, being more complex, will almost always fit the data better. It has more knobs to turn. Is the improvement in fit genuine, or is the model just "overfitting" the noise? This is where a likelihood-based tool called the Akaike Information Criterion (AIC) comes in. The AIC is defined as AIC=2k−2ln⁡(L^)\mathrm{AIC} = 2k - 2\ln(\hat{L})AIC=2k−2ln(L^), where L^\hat{L}L^ is the maximized likelihood and kkk is the number of parameters in the model. AIC rewards a model for a high likelihood score but penalizes it for each parameter it uses. It is a principled form of Occam's Razor. We can compute the AIC for the single-pulse story and the two-pulse story. The story with the lower AIC is the one that provides the most explanatory power for the least amount of complexity. In this way, likelihood allows us to perform a kind of forensic science on the history of life itself, weighing the evidence for competing narratives of our planet's past.

From the fleeting existence of a subatomic particle to the epic saga of life on Earth, from the hidden logic of our genes to the guiding intelligence in our machines, the principle of likelihood provides a single, coherent, and profoundly beautiful framework for reasoning in the face of uncertainty. It is the engine of scientific inference, consistently turning raw observation into genuine understanding.