Jeffreys Prior

SciencePedia

Key Takeaways

The Jeffreys prior solves the paradox of parameterization by providing a rule for creating objective priors that remain consistent regardless of how a model's parameters are described.
It is derived from the square root of the Fisher information, a measure of the data's expected informativeness about a parameter, grounding the prior in the statistical model itself.
The prior naturally distinguishes between parameter types, yielding a flat prior ( $\pi(\theta) \propto 1$ ) for location parameters and a log-uniform prior ( $\pi(\theta) \propto 1/\theta$ ) for scale parameters.
In applications, the Jeffreys prior provides a principled starting point for inference, from estimating simple binary probabilities to modeling complex systems in physics and engineering.

Introduction

In Bayesian statistics, our ability to update beliefs in light of new evidence hinges on a crucial starting point: the prior distribution. While subjective priors are powerful when expert knowledge is available, a fundamental question arises when it is not: how do we choose a prior that expresses genuine ignorance and lets the data speak for itself? The quest for such an "objective" or "non-informative" prior is fraught with paradoxes, as seemingly simple assumptions can hide unintended biases. This article addresses this challenge by exploring a principled solution.

The first chapter, "Principles and Mechanisms," delves into the theoretical foundation of the Jeffreys prior, explaining the problem of reparameterization invariance and how Sir Harold Jeffreys used Fisher information to solve it. We will see how this elegant principle provides a consistent rule for generating priors. Following this, the "Applications and Interdisciplinary Connections" chapter demonstrates the prior's practical utility across diverse fields, from fundamental physics and clinical medicine to engineering and astrophysics, revealing it as a unifying concept in scientific reasoning.

Principles and Mechanisms

In our journey into the world of Bayesian reasoning, we’ve arrived at a critical juncture. We have Bayes' theorem, a magnificent engine for updating our beliefs. But to start this engine, we need fuel: a prior distribution. This prior represents our state of knowledge—or ignorance—before we see any data. If we have strong, expert knowledge, we can encode it into a "subjective" prior, like a seasoned physician who has a well-founded hunch about a new drug's efficacy. But what if we don't? What if we want to approach a problem with as few preconceptions as possible, to let the data speak for itself?

This is the quest for an objective, or non-informative, prior. The name is a bit of a misnomer; any prior contains some information. A better term might be a "reference prior"—a standard, default starting point. The most obvious-seeming choice is to simply assume all possibilities are equally likely. If we don't know a parameter's value, we might assign a flat, uniform probability to everything. But as we'll see, this seemingly simple idea is a slippery eel.

The Tyranny of Labels: A Paradox of Parameterization

Imagine we are trying to determine a physical property of a material, but we're not sure whether to describe it by its resistance, $R$ , or its conductance, $C = 1/R$ . We are completely ignorant about both. A natural first step might be to assign a uniform prior to the resistance, say over some large range: $\pi(R) \propto 1$ . This flat line seems to say, "I have no preference for any particular value of $R$ ."

But what does this imply about our belief in the conductance, $C$ ? The laws of probability tell us how to change variables: a probability density for $C$ must be related to the density for $R$ by $\pi(C) = \pi(R) |\frac{dR}{dC}|$ . Since $R=1/C$ , the derivative is $dR/dC = -1/C^2$ . So, our prior for conductance becomes $\pi(C) \propto 1 \cdot |-1/C^2| = 1/C^2$ .

Look at what happened! Our state of "total ignorance" about resistance, $\pi(R) \propto 1$ , has magically transformed into a very specific, highly-informed belief about conductance, $\pi(C) \propto 1/C^2$ . This new prior is far from flat; it says that very small values of conductance are vastly more likely than large ones. Our supposed objectivity was an illusion, entirely dependent on the arbitrary label—resistance or conductance—we chose for our parameter.

This is a deep and troubling paradox. If our expression of ignorance depends on the language we use to describe the problem, then it is not true ignorance at all. We need a more principled approach, a method for defining a prior that gives consistent, objective results no matter how we parameterize the problem. We need a prior that is reparameterization invariant.

Fisher Information: A Ruler for Statistical Space

The solution to this puzzle came from the brilliant mind of Sir Harold Jeffreys, who built upon the work of another giant, Sir Ronald Fisher. Jeffreys's idea was to stop thinking about the parameter's space as a simple, flat line and to start thinking about its geometry. He realized that the statistical model itself—the likelihood function—defines a kind of "landscape" for the parameter. Some regions of this landscape are steep and rugged, while others are flat and gentle.

This landscape's curvature is measured by a quantity called Fisher Information, denoted $I(\theta)$ . Intuitively, Fisher information tells you how much information a single piece of data is expected to provide about the unknown parameter $\theta$ .

Think of it this way: imagine you are trying to find the value of a parameter by looking at the log-likelihood function, which peaks at the most likely value. If the peak is incredibly sharp and narrow, like a needle, then even a tiny bit of data lets you pinpoint the parameter with great precision. The information is high. If the peak is broad and rounded, like a gentle hill, the data is less helpful; a wide range of parameter values are almost equally plausible. The information is low. Mathematically, the Fisher Information is the negative expected value of the second derivative of the log-likelihood: a measure of its expected curvature.

I(\theta) = -E\left[\frac{\partial^2}{\partial \theta^2} \ln f(X|\theta)\right]

This quantity is our ruler. It's a property of the model itself, telling us how sensitive our inferences are to changes in the parameter's value at different locations in its space.

The Jeffreys Prior: A Principle of Invariance

Here is Jeffreys's masterstroke. He proposed a prior distribution that is proportional to the square root of the Fisher information:

\pi_J(\theta) \propto \sqrt{I(\theta)}

Why this specific form? Because it possesses the magic property of invariance we've been searching for. When you change parameters, say from $\theta$ to a new parameter $\phi$ , the Fisher information transforms according to a precise rule: $I(\phi) = I(\theta) \left(\frac{d\theta}{d\phi}\right)^2$ . Now, if we take the square root of both sides, we get $\sqrt{I(\phi)} = \sqrt{I(\theta)} |\frac{d\theta}{d\phi}|$ .

This is exactly the same way a probability density function transforms! So, by defining our prior as being proportional to $\sqrt{I(\theta)}$ , we ensure that the rule itself remains unchanged no matter which parameterization we use. A Jeffreys prior for resistance, when transformed, yields the Jeffreys prior for conductance. The paradox is resolved.

The intuition is beautiful. The Jeffreys prior assigns more prior density to regions where the Fisher information is low (where the likelihood is flat and the data is uninformative) and less prior density where the information is high (where the likelihood is sharp and the data speaks loudly). It is a humble prior; it steps back and becomes "louder" in just those places where the data is likely to be quiet, ensuring our posterior belief is driven by the evidence, not by our choice of labels.

A Gallery of Priors

Let's see this principle in action. The functional form of the Jeffreys prior depends entirely on the statistical model.

The Coin Flip (Bernoulli Parameter $p$ ): What is our prior belief about the probability $p$ of a coin coming up heads? Calculating the Fisher information for a Bernoulli trial gives $I(p) = \frac{1}{p(1-p)}$ . The Jeffreys prior is therefore:

\pi_J(p) \propto \sqrt{\frac{1}{p(1-p)}} = p^{-1/2}(1-p)^{-1/2}

This is a specific type of Beta distribution, the $\text{Beta}(1/2, 1/2)$ . What does it look like? It's a U-shaped curve, piling up probability near $p=0$ and $p=1$ . This might seem strange—why favor unfair coins? The reason is subtle. It's much harder to distinguish a coin with $p=0.999$ from one with $p=0.9999$ than it is to distinguish a coin with $p=0.5$ from one with $p=0.6$ . The data is less informative at the extremes, so the prior compensates by raising its voice there. For a roll of a $k$ -sided die, this beautifully generalizes to the Dirichlet distribution with all parameters equal to $1/2$ , where the prior is proportional to $\prod_{i=1}^{k} p_{i}^{-1/2}$ .

Location and Scale: A profound distinction emerges when we consider different types of parameters.

Location Parameters: These parameters, like the mean $\mu$ of a normal distribution or the location $\mu$ of a Gumbel distribution, specify a position along the number line. For many such parameters, the Fisher information turns out to be constant—the data is equally informative regardless of where the distribution is centered. This leads to a Jeffreys prior $\pi_J(\mu) \propto 1$ . This is the flat prior we first guessed, but now it stands on a solid theoretical foundation. It expresses ignorance about location. Such a prior is often improper, meaning it doesn't integrate to a finite number over its infinite domain $(-\infty, \infty)$ . This isn't a flaw; it's a feature, correctly capturing complete uncertainty over an unbounded range.
Scale Parameters: These parameters describe magnitude or scale, like the failure rate $\lambda$ of a laser diode or the standard deviation $\sigma$ of a population. Ignorance about scale is different from ignorance about location. A change from a scale of 1 to 2 feels like the same "proportional" jump as a change from 100 to 200. This suggests we should be uniform on a logarithmic scale. The Jeffreys prior naturally discovers this. For the rate parameter $\lambda$ of an exponential distribution, $I(\lambda) = 1/\lambda^2$ , so the prior is $\pi_J(\lambda) \propto 1/\lambda$ . Similarly, for a Poisson rate parameter, the prior is $\pi_J(\lambda) \propto 1/\sqrt{\lambda}$ . For a pure scale parameter like $\sigma$ , the prior is generally $\pi_J(\sigma) \propto 1/\sigma$ . This is also called a "log-uniform" prior, and it is the mathematical expression of scale invariance.

The Thicket of Multiple Parameters

What happens when we are ignorant about more than one parameter at once, like both the mean $\mu$ and the standard deviation $\sigma$ of a Normal distribution? The concept extends, but with fascinating new subtleties. We now have a Fisher Information Matrix, and the Jeffreys prior is proportional to the square root of its determinant: $\pi(\boldsymbol{\theta}) \propto \sqrt{\det(I(\boldsymbol{\theta}))}$ .

For a Normal distribution $N(\mu, \sigma^2)$ , a direct calculation reveals that the joint Jeffreys prior is:

\pi(\mu, \sigma) \propto \frac{1}{\sigma^2}

What's remarkable is that for the very different, heavy-tailed Cauchy distribution, the joint Jeffreys prior for its location and scale parameters also turns out to be $\pi(\mu, \sigma) \propto 1/\sigma^2$ . This points to a deeper structure shared by location-scale families.

But here lies a final, important lesson. If we had derived the Jeffreys priors for $\mu$ (assuming $\sigma$ was known) and for $\sigma$ (assuming $\mu$ was known) separately, we would have gotten $\pi(\mu) \propto 1$ and $\pi(\sigma) \propto 1/\sigma$ . Their product is $1/\sigma$ , which is not the same as the joint prior $1/\sigma^2$ we found. What does this mean? It means that constructing an "objective" prior in multiple dimensions is not as simple as treating each parameter in isolation. The full Jeffreys rule, using the determinant, respects the joint geometry of the parameter space and remains the gold standard for invariance.

The Jeffreys prior is not a panacea. It can be difficult to compute, and its interpretation requires care. But its profound contribution is that it grounds the choice of a "non-informative" prior in a fundamental principle: invariance to the language of our description. It provides a principled, if not always perfect, starting point, allowing us to proceed with Bayesian analysis in a way that is as objective as possible, letting the story told by our data take center stage.

Applications and Interdisciplinary Connections

We have spent some time getting to know the Jeffreys prior, this curious recipe for choosing a state of ignorance. We derived it from a rather abstract concept, the Fisher information, and admired its mathematical elegance—particularly its invariance. But what good is it? Does this formal rule actually help us interrogate nature and make sense of our measurements, or is it merely a beautiful piece of mathematics, disconnected from the messy reality of scientific discovery?

The answer, as we shall see, is that this principle is an extraordinarily potent tool. It appears again and again, from the most fundamental acts of counting to the sophisticated modeling at the frontiers of science. Let's take a journey through some of these applications. We will find that the Jeffreys prior is not just a formula, but a deep principle for reasoning under uncertainty that unifies disparate fields of inquiry.

The Simple Act of Counting

Let's start with the simplest possible experiment: a binary outcome. A coin lands heads or tails, a user clicks a button or doesn't, a quantum particle is in state $|1\rangle$ or state $|0\rangle$ . We perform $n$ trials and observe $k$ "successes." Our goal is to estimate the unknown, underlying probability of success, $p$ .

Common sense might suggest the best estimate is simply $\hat{p} = k/n$ . But what if we have very little data? If we flip a coin once and it comes up heads, should we conclude the probability of heads is 1? Surely not. Our prior beliefs—or lack thereof—must play a role.

If we adopt the Jeffreys prior for this Bernoulli process, which we know is a $\text{Beta}(1/2, 1/2)$ distribution, it is as if we are starting the experiment with "half a success" and "half a failure" already in the bank. After observing $k$ successes and $n-k$ failures, our new best guess for $p$ —the posterior mean—becomes:

\mathbb{E}[p | k, n] = \frac{k + 1/2}{n+1}

This result is remarkable. Notice how it gracefully handles the edge cases. If we see 0 successes in $n$ trials, the estimate isn't 0, but $1/(2(n+1))$ . If we see $n$ successes, the estimate isn't 1, but $(n+1/2)/(n+1)$ . The prior gently pulls our estimate away from the certainty of 0 or 1, acknowledging that we have only seen a finite amount of data.

This approach is more than a theoretical curiosity; it has real consequences. In the world of tech, A/B testing compares different versions of a website. In medicine, clinical trials assess the efficacy of a new drug. In physics, experimenters measure the probability of a quantum event. In all these cases, especially with limited data, the choice of prior matters. Comparing the Jeffreys prior to the seemingly "obvious" uniform prior (which corresponds to starting with one success and one failure) reveals subtle differences in predictions about future events.

Interestingly, the two priors give the exact same answer only in a state of perfect symmetry: when the number of successes is exactly half the number of trials, $k = n/2$ . In this special case, the data perfectly balances, and the different starting assumptions of the priors wash out. When the data is lopsided, the priors pull the result in slightly different directions, reminding us that our initial state of "ignorance" is not a uniquely defined concept.

Measuring the Continuous World: Location and Scale

Nature, of course, is not just about counting. It is also about measuring continuous quantities: the temperature of a gas, the thickness of a material, the voltage from a sensor. These measurements are often plagued by noise, which we typically model with a Normal (or Gaussian) distribution, characterized by a mean $\mu$ (the "true" value) and a variance $\sigma^2$ (the measurement uncertainty or process variability).

What is the "uninformative" thing to say about $\mu$ and $\sigma$ before we've made a measurement? Here, the Jeffreys prior provides a profound insight. As noted in the previous section, for a Normal distribution with both parameters unknown, the joint Jeffreys prior is:

\pi(\mu, \sigma) \propto \frac{1}{\sigma^2}

This prior is derived from the geometry of the two-parameter model and ensures reparameterization invariance.

When we apply this principle, something wonderful happens. Suppose we are characterizing a new quantum dot temperature sensor and we take $n$ measurements. We want to know the true temperature $\mu$ . After applying Bayes' rule with the Jeffreys prior, we can integrate away our uncertainty in the unknown noise $\sigma$ . The resulting posterior distribution for $\mu$ is not a Gaussian! It is a Student's t-distribution. The uncertainty in the scale parameter $\sigma$ has "inflated" the tails of our probability distribution for the mean $\mu$ , making us appropriately more cautious about its value. This is the very same distribution that a frequentist statistician arrives at through a different line of reasoning for the t-test. The Bayesian framework, guided by the Jeffreys prior, derives it from first principles of information and invariance.

The variance itself is often a quantity of great interest. In a semiconductor fabrication plant, the stability of the process is paramount. Engineers want to know and control the variance $\sigma^2$ in the thickness of a silicon dioxide layer. Using the same Jeffreys prior, we can derive a posterior distribution for $\sigma^2$ . This allows us to construct a "credible interval"—a direct probabilistic statement like, "Given our data, there is a 95% probability that the true process variance lies between these two values." This provides a clear, actionable assessment of process stability.

Rates, Lifetimes, and Hidden Connections

Another fundamental process in nature is the random arrival of events: a radioactive nucleus decaying, a photon striking a detector, a machine component failing. These are often described by a Poisson process, governed by a single rate parameter $\lambda$ .

For this Poisson rate, the Jeffreys prior is $\pi(\lambda) \propto 1/\sqrt{\lambda}$ . This seemingly simple form has profound implications, especially in the "sparse-count regime" where events are rare. Imagine a physicist searching for a hypothetical rare particle decay. An experiment is run for a time $T$ , and zero events are observed. What can be said about the decay rate $\lambda$ ? While the most likely value is zero, the Jeffreys prior yields a proper posterior distribution that doesn't vanish. It allows the physicist to calculate an upper limit on the rate, a statement of the form: "We are 95% certain the decay rate is no more than $\lambda_{max}$ ." This ability to reason coherently from a lack of evidence is crucial in searches for new physics, and the Jeffreys prior provides a principled way to do it.

The flip side of the Poisson process is the exponential distribution, which models the waiting time between events or the lifetime of a component. For an exponential distribution with mean lifetime $\theta$ , the Jeffreys prior is $\pi(\theta) \propto 1/\theta$ —once again, the classic scale-invariant prior.

Here, we find a beautiful and deep connection between the Bayesian and frequentist schools of thought. A reliability engineer might use a frequentist "Uniformly Most Powerful (UMP)" test to decide if a component's mean lifetime exceeds a specification $\theta_0$ . This involves calculating a test statistic from the data and seeing if it falls into a "rejection region." A Bayesian engineer, using the Jeffreys prior, would instead calculate the posterior probability that $\theta > \theta_0$ and reject the null hypothesis if this probability exceeds some threshold. It turns out that for the exponential model, these two procedures can be made identical. The frequentist's rejection region corresponds exactly to the Bayesian's decision for a specific posterior probability threshold. It is as if they are speaking two different languages to describe the same logical reality. The Jeffreys prior reveals a hidden unity in the principles of scientific inference.

The Frontier: Modeling Complex Reality

So far, we have seen the Jeffreys prior at work in relatively simple, canonical models. But its true power lies in its generality. We don't need a pre-compiled list of priors for different problems. We can derive the appropriate prior for any well-defined statistical model by calculating the Fisher information.

Consider a chemical engineer studying a reaction $A \rightarrow B$ described by a differential equation with an unknown rate constant $k$ . Measurements of the concentration of species A are taken over time, corrupted by Gaussian noise. This is a complex, non-linear model. Yet, by writing down the likelihood function and calculating the Fisher information, one can derive the Jeffreys prior for the rate constant $k$ . It is a custom-built prior, perfectly tailored to the structure of this specific physical model and measurement process.

Or let's fly to the stars. An astrophysicist analyzes the light curve from a distant star as an exoplanet passes in front of it. The shape of this "transit" dip depends on several parameters, like the ratio of the planet's radius to the star's radius ( $p$ ) and how centrally it crosses the star (the impact parameter, $b$ ). The relationship between these physical parameters and the observable light curve is non-linear. By computing the Fisher information matrix for this multi-parameter model, the astrophysicist can derive a joint Jeffreys prior $\pi(b,p)$ . This prior provides an objective, invariant starting point for inference in a complex parameter space, ensuring the scientific conclusions about the planet's size are robust and not an artifact of an arbitrary parameterization.

A Unifying Thread

Our tour is complete. We started by counting simple successes and ended by sizing alien worlds. Through it all, a single, powerful idea—the Jeffreys prior, born from the geometry of information—provided a principled path for reasoning from data. It is not a panacea; the choice of a prior in Bayesian analysis remains a deep and sometimes contentious subject. But the Jeffreys prior stands as a landmark, a default option grounded not in subjective belief, but in the very mathematical structure of the problem at hand. It embodies the principle that our method for expressing ignorance should be intrinsically linked to our capacity to learn. It is a beautiful testament to the profound and often surprising unity of scientific thought.