Improper Priors

SciencePedia

Key Takeaways

The search for a universally "uninformative" prior is futile, as even seemingly neutral flat priors introduce strong, often arbitrary, assumptions.
Improper priors are mathematically invalid for Bayesian model comparison, as they result in arbitrary and meaningless marginal likelihoods and Bayes factors.
In models with non-identifiable parameters, a proper prior can act as a regularizer, making an otherwise unsolvable estimation problem solvable.
Thoughtfully chosen priors are essential tools for integrating existing scientific knowledge and physical constraints into a statistical model, leading to more robust and meaningful inferences.

Introduction

In the practice of science, we constantly update our understanding of the world by integrating new evidence with existing knowledge. Bayesian inference offers a formal framework for this process, structuring it as a conversation between our prior beliefs and the data we collect. However, a crucial and often misunderstood element of this framework is the 'prior' itself. The desire for objectivity can lead researchers to use so-called 'uninformative' or 'improper' priors, a choice fraught with hidden assumptions and significant statistical dangers. This article tackles this critical issue head-on. The first chapter, "Principles and Mechanisms," will dissect the mechanics of Bayesian inference, revealing why the quest for a truly uninformative prior is futile, how priors interact with model identifiability, and why improper priors render model comparison impossible. Following this theoretical foundation, the second chapter, "Applications and Interdisciplinary Connections," will demonstrate how these principles are applied in practice, showing how carefully constructed priors serve as essential tools for solving complex problems in fields from neuroscience to evolutionary biology.

Principles and Mechanisms

In our journey to understand the world, we often find ourselves in a conversation between our prior beliefs and the evidence we gather. Bayesian inference provides the formal language for this conversation, with Bayes' theorem as its grammatical core. The introduction has set the stage, showing us that this framework is a powerful tool for scientific reasoning. Now, we shall pull back the curtain and examine the machinery itself. What are these "priors" really? How do they work? And what happens when we handle them carelessly? We will discover that while they offer profound power, they also harbor subtle traps for the unwary.

The Treacherous Quest for "Uninformativeness"

When we first encounter Bayesian analysis, a noble instinct often takes hold: we want to be objective. We want to "let the data speak for themselves." This desire leads us to seek a so-called uninformative prior, a starting belief that is perfectly neutral and imposes no assumptions. The most obvious candidate for such a prior seems to be a flat line: assign equal probability to every possible value of our parameter. If we're estimating a parameter $\theta$ , we might simply say $p(\theta) \propto 1$ .

But this seemingly simple idea is like trying to draw a perfect, distortion-free map of the spherical Earth onto a flat piece of paper. It's impossible. A map can preserve angles (like a Mercator projection) or areas, but not both. Similarly, a prior that is "flat" for a parameter $\theta$ is not flat for $\theta^2$ or $\ln(\theta)$ . If we believe a reaction rate, $\lambda$ , is equally likely to be between 1 and 2 as it is to be between 99 and 100, we are implicitly saying that a 100% increase is just as likely as a 1% increase. Our "uninformative" prior has smuggled in a very strong belief: it favors large absolute changes. A truly neutral stance on a scale parameter like a rate would treat a doubling from 1 to 2 the same as a doubling from 50 to 100. This corresponds to a flat prior on the logarithm of the rate, which is equivalent to a prior $p(\lambda) \propto 1/\lambda$ .

This brings us to a more dangerous beast: the improper prior. If our parameter can take any positive value (like a rate, which can't be negative but has no fixed upper limit), a flat prior stretches over an infinite domain. If you try to calculate the total probability, you integrate a constant from zero to infinity and get an infinite result. This "distribution" doesn't integrate to one, so it's not a true probability distribution at all. It is a mathematical abstraction, a sort of ghost of a distribution.

Even if we avoid the infinite by putting hard, finite bounds on our flat prior—say, assuming a substitution rate in a phylogenetic model must be between 0 and 100—we run into a different problem. The number 100 is completely arbitrary. As pointed out in a critique of a phylogenetic analysis, if we change our units of time from millions of years to years, all our rate parameters change by a factor of a million. A prior that was once $\text{Uniform}(0, 100)$ might now need to be $\text{Uniform}(0, 0.0001)$ . A prior that depends so whimsically on our choice of units cannot be a fundamental representation of our knowledge, or lack thereof. The quest for a universally "uninformative" prior is a siren's song; a better goal is to choose priors that are transparent about their assumptions and robust to arbitrary choices like units of measurement.

When the Data Cannot Speak for Itself: Priors and Identifiability

Let's play a simple game. I tell you that I have two numbers, $\theta_1$ and $\theta_2$ , and their sum is exactly 10. Now, tell me: what is the value of $\theta_1$ ?

You can't. Is it 5? 9? -2.7? For any value of $\theta_1$ you choose, you can find a corresponding $\theta_2$ that makes the statement true. There are infinite solutions lying on the line $\theta_1 + \theta_2 = 10$ . This is the essence of non-identifiability. It's a situation where the data—in this case, the sum being 10—are insufficient to pin down a unique value for the parameters.

This exact scenario occurs frequently in scientific modeling. In a simple engineering calibration, we might only be able to measure the sum of two components' contributions. In a biological model of gene expression, measuring only the final steady-state protein level tells us about the ratio of the synthesis rate ( $k_{\mathrm{syn}}$ ) to the degradation rate ( $k_{\mathrm{deg}}$ ), but it cannot disentangle the two individual rates. In these cases, the likelihood function—the part of Bayes' theorem that represents the voice of the data—doesn't have a single peak. Instead, it forms a long, narrow "ridge" of parameter combinations that are all equally compatible with what we've observed.

What happens when we try to do Bayesian inference here? This is where the nature of our prior becomes critical.

If we stubbornly cling to our "uninformative" ideal and use improper flat priors for both $\theta_1$ and $\theta_2$ , we're in trouble. Bayes' theorem tells us to multiply the likelihood ridge by our flat prior. The result? A posterior distribution that is also a flat ridge extending to infinity. We haven't learned anything more about the individual parameters, and worse, our posterior is improper—it contains an infinite amount of probability! The calculation has failed to produce a valid answer.

But now, what if I add a new piece of information to our game? "By the way," I say, "I have good reason to believe that $\theta_2$ is very close to 6." We can encode this belief as a proper, informative prior on $\theta_2$ —perhaps a sharp Gaussian distribution centered at 6. Suddenly, everything changes. The puzzle is solved. The likelihood tells us $\theta_1 + \theta_2 \approx 10$ , and our prior tells us $\theta_2 \approx 6$ . The inescapable conclusion is that $\theta_1$ must be approximately 4. By providing information about one parameter, the prior has allowed us to identify the other. The posterior distribution for $\theta_1$ becomes a perfectly well-behaved, proper Gaussian centered at 4.

This is a profound result. The prior is not just some philosophical mumbo-jumbo; it is a mathematical tool that can make an otherwise unsolvable problem solvable. It acts as a regularizer, taming the wild uncertainty that arises from non-identifiability. When a model's parameters are weakly identified by the data, a proper prior ensures a well-behaved posterior. The underlying uncertainty doesn't magically vanish; it is revealed in the posterior's shape. For instance, the posterior might still be an elongated ridge, but a proper prior ensures the ridge fades away at the extremes, containing a finite, interpretable amount of probability. In more complex scenarios, like modeling gene family evolution where birth and death rates nearly cancel out, the most effective strategy is often to re-parameterize the model itself, aligning the new parameters with the identifiable (net change) and weakly identifiable (total turnover) directions suggested by the data.

Comparing Worlds: Why Improper Priors Wreck Model Selection

So far, we have been estimating parameters within a single, assumed model of the world. But science often involves a grander task: comparing entirely different models, or different "worlds." Is a simple lock-and-key model sufficient to explain this enzyme's binding, or do we need a more complex induced-fit model? Is the rate of evolution constant across a gene, or does it vary from site to site?

To answer such questions, Bayesians compute a quantity called the marginal likelihood, also known as the evidence for the model. Its definition is simple but its meaning is deep: $p(\text{Data} | \text{Model}) = \int p(\text{Data} | \theta, \text{Model}) p(\theta | \text{Model}) d\theta$

In plain English, the marginal likelihood is the predictive performance of the model. It's the probability of having seen our actual data, averaged over every possible parameter setting the model could have, with each setting weighted by its prior probability.

This integral performs a beautiful and automatic version of Occam's razor. A simple model with few parameters has a small parameter space. Its prior probability is concentrated. If it fits the data reasonably well, its average score (the marginal likelihood) will be respectable. A complex model, however, has a vast parameter space. To be a valid probability distribution, its prior must be spread thinly over this huge volume. For this complex model to get a high score, it needs to not only fit the data well, but fit it exceptionally well in a small region of its parameter space to overcome the low average score from all the other parameter settings that don't fit well. Complexity is automatically penalized.

Now for the knockout punch. What happens if the prior, $p(\theta | \text{Model})$ , is improper? The integral for the marginal likelihood involves multiplying by a function that doesn't have a finite total area. The result is either infinite or, worse, depends on a completely arbitrary constant. If you try to compare two models, $M_1$ and $M_2$ , by calculating the ratio of their marginal likelihoods (the Bayes factor), you end up with a ratio of two arbitrary constants. The answer is meaningless. It's like asking which of two infinite rooms is larger. There is no sensible answer.

This is not a minor technicality; it is a catastrophic failure. It means that improper priors cannot be used for Bayesian model comparison. Full stop.

This has crucial implications for widely used tools. Many scientists use criteria like the Bayesian Information Criterion (BIC) to compare models, often believing it to be a prior-free shortcut. This is a dangerous illusion. BIC is, in fact, an approximation to the log marginal likelihood, and its derivation is only valid under specific assumptions, including that the priors are proper and their properties don't depend on the sample size. When you use improper priors, the theoretical link between BIC and Bayesian model choice is severed.

The lesson is clear. When moving from estimating parameters to comparing models, priors transform from a helpful regularizer into a non-negotiable, central component of the calculation. If we want to ask which of several competing theories is better supported by the data, we must commit to proper priors. Moreover, for the comparison to be fair, these priors must be chosen thoughtfully. The most principled approaches involve designing priors—often through hierarchical structures—that ensure different models make comparable predictions about observable data before seeing the evidence, thus placing them on a level playing field for a fair contest. The prior is not a nuisance to be swept under the rug, but an honest and explicit statement of the assumptions that frame our scientific questions.

Applications and Interdisciplinary Connections

We have spent some time learning the formal rules of Bayesian inference—the grammar of priors, likelihoods, and posteriors. But learning grammar is not the same as appreciating poetry. The real magic happens when we see these tools in action, not as abstract mathematical formulas, but as a way of thinking that helps scientists solve deep and fascinating puzzles across the universe of knowledge.

In this chapter, we will embark on a journey through different scientific disciplines. We will see that a 'prior' is not merely a subjective starting guess; it is a powerful device for encoding physical constraints, existing scientific knowledge, and a fundamental sense of "reasonableness" into our models. This becomes especially crucial when our data, by themselves, are ambiguous—when they whisper several different stories at once.

The Scientist's Dilemma: Tangled Parameters

Imagine you are a detective investigating a crime. You have a set of clues—the data—but they seem to point to two different suspects with equal plausibility. The clues are ambiguous. In science, this common predicament is known as a problem of identifiability. It means the data are not sufficient to tell the difference between one potential explanation (one set of parameter values) and another. Many different combinations of our model's parameters can produce the exact same observable outcome, leaving us stuck in a thicket of possibilities. As we will now see, this theme of "tangled parameters" appears again and again, in nearly every corner of science.

Unraveling the Secrets of the Synapse

Let’s begin our journey inside the brain. Communication between two neurons occurs at a specialized junction called a synapse. This communication happens in discrete packets, or "quanta," of neurotransmitters. When we observe that a synapse has become stronger, we face a classic puzzle: Is it because the neuron now has more potential release sites ( $n$ ), or is it because the probability of release ( $p$ ) at each existing site has increased?

With a limited amount of noisy experimental data, these two possibilities—a change in $n$ or a change in $p$ —are notoriously tangled. A naive statistical analysis might not only fail to distinguish between them but could even lead to nonsensical conclusions, such as a negative electrical response from a single packet of neurotransmitters, which is physically impossible.

Here, Bayesian priors act as our voice of reason. We can build a model that respects the fundamental biology of the system. We can instruct our model, "The response to a single packet, the quantal size $q$ , must be positive." We can tell it, "The number of release sites $n$ must be a positive integer." And we can gently nudge the release probability $p$ away from the absurd extremes of being exactly zero or exactly one, which are unlikely in a dynamic biological system. This isn't cheating; it's embedding fundamental knowledge into our statistical machinery. By doing so, we can make sensible inferences about $n$ and $p$ where we otherwise could not.

Measuring the Breath of a Stream

Let's now leave the brain and visit a forest stream. An ecologist wants to measure the stream's "metabolism"—how much oxygen is produced by photosynthesis during the day (Gross Primary Production, or GPP) and how much is consumed by all the organisms living in it (Ecosystem Respiration, or R). The strategy is to measure the dissolved oxygen concentration over a full 24-hour cycle.

But what if it’s a dark, overcast day, and the stream is in a deeply shaded part of the forest? The light level, $I(t)$ , barely changes throughout the day. Photosynthesis is driven by light, so we can model its rate as $P(t) = \alpha I(t)$ , where $\alpha$ is a light-use efficiency. Respiration, $R$ , is assumed to be roughly constant over the day. If $I(t)$ is nearly constant at some mean level $\bar{I}$ , then the oxygen data can really only tell us about the net effect, the combination $\alpha \bar{I} - R$ . The model cannot untangle the contribution of photosynthesis from that of respiration. Any estimated increase in photosynthetic efficiency ( $\alpha$ ) can be almost perfectly canceled out by an equivalent increase in the estimate for respiration ( $R$ ). Our parameters are, once again, tangled.

The solution is to bring in outside scientific knowledge through priors. We know from basic biochemistry that respiration rates are temperature-dependent. We can encode this relationship into the prior for $R$ . We also know from a vast body of literature that the photosynthetic efficiency $\alpha$ for aquatic plants falls within a plausible range. By incorporating this hard-won ecological wisdom into our priors, we provide the extra information needed to break the statistical deadlock and separately measure the stream's inhalation and exhalation.

The Ghost in the Genes: Heritability and Evolution

The problem of tangled parameters haunts the fields of genetics and evolutionary biology with particular vigor. Consider a trait that is either present or absent, like survival from a particular disease. Quantitative geneticists often imagine an underlying, unobservable continuous trait called "liability." If an individual's liability crosses a certain threshold, they show the trait. The total variation in this liability comes from genes (the additive genetic variance, $V_A$ ) and from environmental and other non-genetic factors (the residual variance, $V_R$ ).

From observing only the binary outcome (e.g., survived or died), we can never determine the absolute values of $V_A$ and $V_R$ . If we double both $V_A$ and $V_R$ , the underlying probability of crossing the threshold doesn't change. The overall scale of the latent liability is fundamentally unidentifiable from the data alone. The standard solution in many statistical packages is to simply fix the scale by setting one of the variances to a constant, for instance, by assuming $V_R = 1$ . This is, in essence, an extremely strong and rigid prior! A more explicit Bayesian approach can handle this more gracefully. For example, we can re-parameterize the model and place a prior directly on the quantity we actually care about and which is identifiable: the heritability on the liability scale, $h^2 = V_A / (V_A + V_R)$ .

This style of thinking—building models with hierarchical levels of parameters—is immensely powerful when studying evolution in action. Imagine an experiment with multiple, replicate lines of plants all undergoing artificial selection for the same trait. We want to measure the realized heritability ( $h^2$ ) by observing the response to selection. However, in any population of finite size, gene frequencies jiggle randomly from one generation to the next due to a process called genetic drift. This drift creates random, line-specific deviations in the response to selection. A simple model that ignores this effect and pools all the data together will be misled.

A Bayesian hierarchical model, however, provides a beautiful solution. It can treat each replicate line as a variation on a common theme. It is designed to simultaneously estimate the global heritability $h^2$ that is common to all lines, while also estimating the magnitude of the random "drift noise" ( $\sigma_b^2$ ) that makes each line unique. The model "borrows strength" across all the lines to get a more robust estimate of the big picture, a prime example of how structured priors allow us to dissect complex processes with multiple sources of variation.

Clocks, Trees, and the Deep Past

The challenge of estimating when different species diverged millions of years ago is another area rife with identifiability problems. The number of genetic differences observed between two species depends on the product of their divergence time ( $T$ ) and the rate of mutation ( $r$ ). Without some external information, it is impossible to separate rate from time. A faster rate over a shorter time looks identical to a slower rate over a longer time.

A powerful solution is to use a hierarchical model across many different genes. We may not know the specific mutation rate $r_\ell$ for any single gene $\ell$ , but we can reasonably assume that all these gene-specific rates are drawn from some common distribution. If we can anchor this distribution—for instance, by assuming its average rate $\mu_0$ is known from other calibrations—we can break the confounding and estimate the divergence time $T$ . These hierarchical models also have a wonderful stabilizing property known as "shrinkage" or "partial pooling." Estimates for the rates of individual genes are gently pulled toward the overall average rate. This prevents a single gene with a bizarrely high or low number of mutations from throwing off our entire evolutionary timeline.

This framework allows us to ask even more sophisticated questions. For instance, do life history traits like body mass affect the rate of molecular evolution? We can build a model where the evolutionary rate on each branch of the tree of life depends on the inferred body mass of the ancestor that lived along that branch. But once again, we must be careful. We first have to anchor the system in absolute time, either by using the fossil record or by analyzing "heterochronous" data where DNA sequences have been sampled at different points in time (such as with ancient DNA or fast-evolving viruses). We also have to be honest about our uncertainty. If our body mass data are noisy and we ignore this measurement error, we will systematically underestimate the true effect of body mass on evolutionary rates—a classic statistical pitfall known as attenuation bias. Priors and hierarchical models give us the tools to confront all of these thorny issues.

Taming the Overfitting Beast with Regularization

Sometimes the problem is not that our model is too simple, but that it is too flexible. When trying to reconstruct the history of biodiversity—the rate of speciation minus the rate of extinction—over geological time, we might allow the rate to change freely over many small time intervals. With so much freedom, the model can start to "overfit" the data; that is, it begins to trace the random, noisy jiggles of our single reconstructed phylogenetic tree rather than capturing the true, smooth underlying historical trend.

Priors come to the rescue as a form of regularization. We can design priors that implement an Occam's razor, guiding the inference toward a simpler, more plausible explanation. For instance, we can use a prior that penalizes models with too many abrupt shifts in the rate of diversification. Or we can use a prior that encourages smoothness by penalizing large jumps in the rate between adjacent time intervals. Whether it's a Laplace prior that encourages sparsity or a Gaussian Markov Random Field that encourages smoothness, the principle is the same: priors allow us to build more robust models that capture the signal without getting lost in the noise.

A Universal Calculus for Uncertainty

From the microscopic firing of a neuron to the grand sweep of evolutionary history, the challenges of ambiguity, confounding, and overfitting are universal in science. We have seen how carefully chosen priors—often in the form of a hierarchical model—provide a powerful way to solve these problems by integrating external knowledge and imposing reasonable constraints.

But even without strong prior knowledge, the Bayesian framework provides something invaluable: a coherent calculus for uncertainty. Imagine a systems biologist integrating data from two different experiments: RNA-sequencing to measure messenger RNA (mRNA) levels, and mass spectrometry to measure protein levels. Each technology has its own characteristic sources of noise and variability. The goal is to determine if the protein-to-mRNA ratio is changing between two experimental conditions. This is a question about a derived quantity that depends on both measurements. The Bayesian machinery allows us to build a single model that incorporates all the pieces: the true (but unknown) mRNA levels, the true protein levels, the ratio connecting them, and the known measurement error from each experiment. The framework then automatically and correctly propagates all sources of uncertainty through the calculation. The final posterior distribution gives us a complete and honest picture of what we know and, just as importantly, how well we know it.

A prior is not a crutch or a fudge factor. It is a tool for reasoning, a way to weave together disparate strands of knowledge into a single tapestry of inference, and a leash to keep our models honest and focused on the signal. It is a vital part of the modern scientist's toolkit for navigating the complex and uncertain world we all seek to understand.