
The scientific process often involves comparing competing models to determine which best explains our data. In the realm of Bayesian inference, this comparison is formally handled by a crucial quantity known as the marginal likelihood, or model evidence. This value serves as the ultimate score for a model's performance, but its calculation presents a significant computational hurdle, often requiring the solution of a difficult high-dimensional integral.
This challenge has spurred the creation of various estimation techniques. Among the most well-known, and infamous, is the harmonic mean estimator (HME), which promises a nearly "free" estimate from standard MCMC output. However, its alluring simplicity conceals deep-seated statistical problems that can invalidate scientific conclusions. This article dissects the harmonic mean estimator, offering a comprehensive look at why it fails. First, the chapter on "Principles and Mechanisms" will uncover the elegant identity it is based on and the statistical fallacies, such as infinite variance, that make it unreliable. Subsequently, in "Applications and Interdisciplinary Connections," we will explore the real-world consequences of this instability for scientific model selection, connect its failure to concepts from other fields, and briefly touch upon more robust alternatives. By understanding why this popular method fails so spectacularly, we gain a deeper appreciation for the principles of sound statistical inference.
In our journey to understand the world through the lens of Bayesian inference, we often find ourselves needing to choose between competing theories or models. Which model better explains the data we've observed? The Bayesian answer lies in a quantity of profound importance: the marginal likelihood, or model evidence. For a given model and data , the evidence tells us the probability of seeing that data under that model, averaged over all possible parameter values . It is the ultimate arbiter in model comparison.
The evidence also plays a crucial role as the normalizing constant in Bayes' theorem, ensuring that the posterior distribution is a proper probability distribution that integrates to one.
Calculating this quantity, however, is often a formidable challenge, requiring us to solve a high-dimensional integral. This has led scientists to devise clever methods to estimate it from simulations. One of the most famous, or perhaps infamous, of these is the Harmonic Mean Estimator. Its story is a fascinating lesson in the interplay between mathematical elegance, statistical reality, and the hidden pitfalls of computation.
At first glance, the harmonic mean estimator seems like a stroke of genius. It emerges from a simple, beautiful rearrangement of Bayes' theorem. Let's start with an identity that comes directly from the definition of expectation. The expected value of the reciprocal likelihood (), when averaged over the posterior distribution , is:
Now, if we substitute the definition of the posterior, , something wonderful happens:
Since the prior distribution must integrate to 1, we are left with an astonishingly simple result:
This identity is the heart of the harmonic mean estimator (HME). It tells us that to estimate the reciprocal of the evidence, , we just need to average the reciprocal likelihoods over our posterior samples. Since modern Bayesian analysis relies on MCMC methods that provide thousands of samples from the posterior, this seems like we're getting the evidence estimate for free! We take our posterior samples , compute the sample mean of the reciprocal likelihoods, and then take the reciprocal of that result to get our estimate of .
This seems almost too good to be true. And as we're about to discover, it is.
The first sign of trouble appears when we consider whether the HME is an unbiased estimator. We know that the sample mean of the reciprocal likelihoods, , is an unbiased estimator of . But the HME is not ; it is .
Here we encounter a fundamental rule of statistics: the expectation of a non-linear function of a random variable is not the function of its expectation. That is, unless is a straight line.
The function here is . This function is not linear; it's a curve. Specifically, for positive values, it is a convex function—it curves upwards like a bowl. A beautiful mathematical result known as Jensen's inequality tells us that for any convex function , .
Applying this to our estimator, we find:
This reveals that the harmonic mean estimator is systematically biased upwards. The extent of this bias, it turns out, is related to the variability of . A Taylor expansion shows that the bias is approximately proportional to the variance of . More variability means more bias. This is our first clue that the stability of this estimator might be a serious concern.
The bias is a problem, but the true disaster lies in the variance. Let's think about what we're averaging: the quantity . The posterior distribution is, by its nature, concentrated in regions where the parameters make the data probable—that is, where the likelihood is high.
However, the posterior distribution doesn't just vanish everywhere else. It often has "tails" that extend into regions of the parameter space that are plausible under the prior but are a very poor fit to the data. In these regions, the likelihood can be extraordinarily small. What happens when our MCMC sampler, in its random walk through the parameter space, takes a step into one of these regions?
If is, say, , its reciprocal is —an astronomically large number. A single sample from such a region can produce a value so enormous that it completely overwhelms the sum of all the other "normal" values. This makes the sample average, and thus the final estimate, wildly unstable.
This isn't just a hypothetical worry. It's a mathematical certainty in many common situations. We can prove that for a simple model, like estimating the mean of a normal distribution with a normal prior, the variance of the reciprocal likelihood is finite only if the prior is more concentrated (has a smaller variance) than the posterior. This is a condition we almost never want or meet in practice; we usually want our priors to be less informative than our data. In most realistic scenarios, the quantity we are trying to average has infinite variance.
An estimator based on a quantity with infinite variance is a statistical nightmare. It means that no matter how many samples we collect, the estimate will never settle down. It will always be subject to catastrophic jumps caused by rare, extreme events.
Living with an infinite-variance estimator means that all our usual statistical intuitions and tools fail us.
The Central Limit Theorem (CLT), the bedrock of so much of statistics, tells us that the average of many independent random variables will have a distribution that looks like a nice, bell-shaped Gaussian curve. But the CLT has a crucial requirement: the variables must have finite variance. When the variance is infinite, the CLT no longer applies. The distribution of our estimator doesn't converge to a well-behaved Gaussian. Instead, it converges (if at all) to a much wilder, heavy-tailed object known as an alpha-stable distribution. This means that extreme, nonsensical estimates are not just possible; they are an inherent feature of the estimator's distribution.
This breakdown has severe practical consequences. Our standard methods for calculating confidence intervals or standard errors for our estimates are based on the CLT. With the HME, these methods are invalid. They give us a false sense of certainty.
Even our diagnostic tools for MCMC simulations can be compromised. For instance, the Effective Sample Size (ESS) is a standard metric used to assess the efficiency of an MCMC sampler. It tells us how many independent samples our correlated MCMC chain is worth. But the calculation of ESS relies on the existence of autocorrelations, which are defined in terms of variance. If the variance is infinite, the very concept of autocorrelation becomes undefined. Our diagnostic tools are built on an assumption that the HME violates, rendering them useless for assessing its stability.
The harmonic mean estimator is fundamentally broken at its core. But in practice, the way we generate our samples using MCMC can make the situation even worse.
MCMC samplers produce a correlated sequence of draws. Positive autocorrelation means that the sampler tends to stay in the same neighborhood of the parameter space for several steps before moving on. If a slow-mixing sampler wanders into one of those treacherous low-likelihood regions, it might get "stuck" there for a while, generating not just one but a whole cluster of catastrophic values for . This dramatically inflates the variability of the estimate in a finite number of samples.
This problem is particularly severe with poor model parameterizations that create challenging geometries for the sampler. A famous example is "Neal's funnel," a hierarchical model where a scale parameter is strongly coupled to other parameters. A naive ("centered") parameterization creates a funnel-shaped posterior that is notoriously difficult for standard MCMC algorithms to explore, leading to high autocorrelation. A clever reparameterization ("non-centered") can break these dependencies and allow the sampler to mix much more freely. While better mixing can reduce the practical, finite-sample damage by preventing the sampler from getting stuck, it's important to remember that it cannot fix the HME's fundamental infinite-variance problem. It only makes a terrible situation slightly less terrible.
There is one final, beautiful, and ironic twist to this story. As we've seen, the HME requires us to sum numbers like , where is the log-likelihood. If a likelihood is tiny, its log-likelihood is a large negative number, and becomes a large positive number. A direct computation of would instantly cause a numerical overflow on a computer.
To solve this, programmers use a clever technique called the log-sum-exp trick. It's a way of reformulating the calculation entirely in the logarithmic domain, using a constant shift to ensure that no intermediate value ever overflows. This trick allows us to compute the final value of the harmonic mean estimator with high numerical precision, regardless of the magnitude of the numbers involved.
And here lies the irony. We can build a perfectly stable, numerically robust machine to calculate the value of the harmonic mean estimator. But the number this machine produces is, due to the estimator's infinite statistical variance, essentially meaningless. We have succeeded in building a precision instrument to measure a phantom. The elegance of the numerical solution only serves to highlight the depth of the statistical fallacy. The harmonic mean estimator is a cautionary tale, a beautiful idea that shatters upon contact with reality, teaching us a profound lesson about the difference between mathematical identities and reliable statistical tools.
Having journeyed through the principles of the harmonic mean estimator, you might be left with a feeling of unease. We have a tool that seems simple on the surface but is built upon a mathematical identity that is, as we've seen, treacherously unstable. You might ask, "So what? Where does this abstract statistical curiosity actually cause trouble?" The answer, it turns out, is that it appears in a domain of profound importance: the scientific method itself. The story of the harmonic mean estimator is a fantastic lesson in the philosophy of science, a cautionary tale about the tools we use to weigh evidence, and a beautiful illustration of how different branches of science and mathematics connect in unexpected ways.
Before we dive into the world of Bayesian model selection, let's consider a much more familiar problem: choosing the fastest route for a daily commute. Imagine you have two possible routes, Route A and Route B. You want to figure out which route, on average, has the shorter travel time. The obvious thing to do is to time yourself for, say, 100 days on each route and compute the average time for each.
Now, let’s think about the speed on these routes. The total travel time is the sum of times for each segment, and time is distance divided by speed. So, the average travel time is related to the average of the reciprocal of the speed. The harmonic mean of your speeds gives you the true average speed over the whole journey. Estimating this is precisely analogous to the problem faced by the harmonic mean estimator.
Suppose Route A is a consistent, if slightly slower, city road, while Route B is a fast highway that, on very rare occasions, has a catastrophic, multi-hour standstill due to an accident. For 99 days, Route B is much faster. But on one day, a massive delay occurs. When you compute your average travel time for Route B, that single, disastrous day might add so much time to the total that the average makes it look far worse than Route A, even though it's usually better.
This is the heart of the harmonic mean estimator's problem, translated into everyday life. The estimator is trying to average a quantity—the reciprocal of the likelihood—that is like the travel time on our highway. Most of the time, its value is small and well-behaved. But it has the potential for rare, astronomically large values, just like the travel time on a day with a massive traffic jam. These "heavy tails," as statisticians call them, mean the average can be completely dominated by a single, unlucky event. A naive sample average is no longer a reliable guide. An estimator built on this principle can lead you to make the wrong decision—choosing the wrong route—based on your data. The distribution of these catastrophic delays can often be described by something like a Pareto distribution, which has a "heavy tail" where the variance can be infinite even if the mean is finite.
Now, let's return from the highway to the laboratory. One of the central tasks in science is comparing competing theories or models. Given a set of data, which model provides a better explanation? Bayesian statistics offers a formal, quantitative answer through a quantity called the marginal likelihood, or model evidence, which we denote by . You can think of as a score that a model receives based on how well it predicts the data we actually observed. It naturally penalizes models that are overly complex—a principle often called Occam's razor.
To compare two models, and , we compute the ratio of their scores, . This ratio is called the Bayes factor. If is much larger than 1, the evidence strongly favors model over .
The problem is that calculating involves a difficult integral over all possible parameters of a model. The harmonic mean estimator (HME) presents itself as a seductively simple way to compute this integral using samples from the model's posterior distribution—samples that are often already available from other parts of a Bayesian analysis. And here is where our travel-time analogy becomes a serious scientific problem. The HME is our unreliable route timer, and the marginal likelihood is the crucial quantity we need to judge scientific theories.
When we use the HME to calculate the Bayes factor, we are compounding the instability. We are taking a ratio of two potentially unreliable numbers. If either one of the HME estimates for or has infinite variance, the variance of their ratio, the estimated Bayes factor, will also be infinite. This means our judgment between two scientific theories could be wildly wrong, swinging dramatically based on the random chance of our simulation. It's like trying to determine which of two runners is faster by using two broken stopwatches. Interestingly, the mathematics shows that if the estimates for the two models are positively correlated (for instance, if they share parameters), it can sometimes reduce the variance of the ratio, but this is a subtle detail in a generally grim picture. The bottom line is that using the HME for model selection can lead to catastrophic errors in scientific inference.
Why does the HME's underlying variable, the reciprocal likelihood , have such heavy tails? The instability arises from a fundamental tension between the prior distribution, which represents our beliefs about a parameter before seeing the data, and the likelihood, which tells us what parameter values are plausible given the data.
The HME is calculated by averaging over samples drawn from the posterior distribution. The posterior is a compromise between the prior and the likelihood. Sometimes, the posterior distribution will still assign a small but non-zero probability to regions of the parameter space where the likelihood is incredibly small. These are regions that the data tells us are very implausible, but which our prior beliefs haven't completely ruled out. When our simulation happens to draw a sample from such a region, is tiny, and its reciprocal becomes astronomically large, poisoning the average.
This happens in many common situations:
If we have a sick patient, the first step is diagnosis. How can we tell if our HME is suffering from this instability? The answer connects us to fascinating ideas from other corners of statistics.
One powerful idea comes from the field of robust statistics, which designs methods that are not overly affected by outliers. We can ask: how much does our final estimate change if we remove just one of our posterior samples? For a well-behaved estimator, the change should be tiny. For the HME, a single "outlier" sample can flip the result entirely. This sensitivity can be formalized by the influence function, which measures the infinitesimal effect of adding a new data point to a sample. The influence function for the HME is unbounded, meaning a single point can have an arbitrarily large effect. This provides a concrete diagnostic: if we see that our estimate is completely dominated by one or two samples, we know we are in trouble.
Another beautiful connection is to extreme value theory (EVT), the branch of statistics that deals with, well, extreme events—like 100-year floods or stock market crashes. EVT tells us how to characterize the "tail" of a distribution. For distributions with heavy tails, like the Pareto distribution from our travel-time analogy, the tail's behavior can be summarized by a single number, the tail index .
This index tells us everything about the moments of the distribution. If , the variance is finite and things are well-behaved. If , the mean is finite, but the variance is infinite—this is the highway with disastrous-but-not-infinitely-long delays. Our average will eventually converge, but excruciatingly slowly and unreliably. If , even the mean is infinite, and our estimate will never converge at all. We can actually estimate from our posterior samples of using methods like the Hill estimator. If our estimated is less than or equal to 2, a loud alarm bell should go off. The Central Limit Theorem, the bedrock theorem that gives us our familiar bell-curve uncertainties, no longer applies.
The story is not all doom and gloom. The failure of a simple tool inspires the invention of better ones.
The HME's flaw stems from averaging with respect to the posterior distribution. What if we averaged with respect to a different distribution? This is the idea behind importance sampling. It turns out that we can estimate the same marginal likelihood by drawing samples from a cleverly chosen "proposal" distribution and weighting them appropriately. If we design this proposal distribution to have lighter tails than the problematic HME integrand, we can construct an estimator for that has finite variance and is wonderfully stable, even in cases where the HME is guaranteed to fail. Other stable methods, like Chib's method or approximations like WBIC, also provide reliable alternatives that avoid this pathology.
What if we are stuck with the HME framework? Can we make it more robust? Yes! Instead of calculating one giant, vulnerable average, we can split our samples into many small batches. We calculate the average within each batch, and then—this is the crucial step—we take the median of these batch averages. The median is famously robust to outliers. If a few of our batches are "contaminated" by an extreme value, the median simply ignores them. This "median-of-means" estimator provides a much more stable result by preventing a few rogue samples from hijacking the entire estimate. A similar idea is to use a "truncated" estimator, where we simply cap any extreme values before averaging, effectively giving them less power [@problemid:3311570].
The story of the harmonic mean estimator is thus a perfect microcosm of the scientific process. We start with a simple, intuitive idea, discover its deep and dangerous flaws through analysis and analogy, connect these flaws to broader principles from other fields, and finally, driven by the failure, we invent better, more reliable tools. It reveals the beautiful, interconnected nature of statistical thinking and serves as a powerful reminder that in science, as in life, understanding the limitations of our tools is the first step toward true discovery.