Regression Through the Origin

SciencePedia

Key Takeaways

Regression through the origin (RTO) is a specialized model used only when the relationship between variables is one of strict proportionality, justified by scientific theory.
Standard metrics like the coefficient of determination (R²) are misleading for RTO models; an uncentered R² should be used instead to assess goodness of fit.
Forcing an RTO model on data with a true non-zero intercept results in a biased estimate of the slope and invalidates subsequent statistical tests.
The decision to use RTO is a non-statistical, a priori hypothesis about the system, as seen in physical laws, controlled experiments, or methods like Phylogenetically Independent Contrasts.

Introduction

Many fundamental relationships in nature and science are governed by a simple, strict rule: direct proportionality. In these systems, zero input yields zero output, and doubling the cause exactly doubles the effect. Modeling such a relationship requires a specialized statistical tool that honors this constraint. Standard linear regression is too flexible, allowing for an intercept that may not be physically meaningful. This introduces a critical question: how do we properly model, interpret, and validate a relationship that we know, from first principles, must pass through zero?

This is the domain of regression through the origin (RTO), a linear model constrained to pass through the point (0,0). While seemingly a minor adjustment, this constraint fundamentally alters the model's mathematical properties and interpretation. This article provides a comprehensive guide to this powerful but often misunderstood method. In the "Principles and Mechanisms" chapter, we will dissect the statistical engine of RTO, from the derivation of its slope to the unique challenges of measuring its goodness-of-fit and the severe consequences of its misapplication. Then, in the "Applications and Interdisciplinary Connections" chapter, we will see these principles in action, exploring how RTO provides critical insights in diverse fields, from materials physics and analytical chemistry to the study of evolutionary biology.

Principles and Mechanisms

Imagine a law of nature that is ruthlessly simple. No matter what, if you have zero of one thing, you get zero of another. Double the cause, and you exactly double the effect. There are no starting fees, no baseline offsets. This is the world of direct proportionality, a concept that is as elegant as it is strict. Many fundamental relationships in science and engineering live in this world. For instance, an ideal resistor with zero current flowing through it will have zero voltage across it—this is the heart of Ohm's Law. Stretching an ideal spring requires zero force for zero extension. This is the world we enter when we study regression through the origin (RTO).

Unlike a standard linear regression that is free to cross the vertical axis wherever it pleases, the RTO model is anchored, or constrained, to pass through the point $(0,0)$ . This isn't just a minor tweak; it fundamentally changes the character of our analysis, leading to some beautiful simplicities and some dangerous traps. The model we build is deceptively simple to write down:

$Y_i = \beta x_i + \epsilon_i$

Here, $Y_i$ is our observation, $x_i$ is the variable we control or measure, $\epsilon_i$ is the inevitable random noise or error, and $\beta$ is the single, all-important parameter we want to discover: the constant of proportionality.

Finding the "Best" Proportionality

Suppose we have a scatter of data points that we believe should follow such a proportional law. How do we draw the best possible line through the origin to represent them? The principle of least squares, our trusted guide in regression, still applies. Imagine a line pivoting around the origin like the hand of a clock. For each possible angle (slope), we can measure how far the line misses each data point vertically. These misses are our residuals. We then square each of these vertical distances and add them all up. The "best" line is the one that makes this total sum of squares as small as possible.

When we turn the crank of calculus on this minimization problem, a wonderfully straightforward formula for our estimated slope, which we call $\hat{\beta}$ , emerges:

$\hat{\beta} = \frac{\sum_{i=1}^{n} x_{i} Y_{i}}{\sum_{i=1}^{n} x_{i}^{2}}$

At first glance, this might look like a jumble of sums. But there's a lovely intuition here. This formula is essentially a weighted average of the individual slopes, $Y_i/x_i$ . The points with larger $x_i$ values get a heavier weight—proportional to $x_i^2$ , in fact. This makes perfect sense! A point far from the origin acts like a long lever, giving us a much more stable and reliable indication of the true slope than a point huddled near the origin, where even a small amount of random noise ( $\epsilon_i$ ) can send its individual slope $Y_i/x_i$ swinging wildly. This idea of a point's influence is called leverage, and we shall return to it.

Once we have our estimate, we naturally want to know how good it is. Two key properties stand out. First, our estimator is unbiased. This means that if we were to repeat our experiment countless times, the average of all our calculated $\hat{\beta}$ values would land squarely on the true, unknown $\beta$ . Our method doesn't systematically aim too high or too low. Second, the variance of our estimator—a measure of how much the $\hat{\beta}$ values would jitter around the true value in these repeated experiments—is also beautifully simple:

$\text{Var}(\hat{\beta}) = \frac{\sigma^2}{\sum_{i=1}^n x_i^2}$

where $\sigma^2$ is the variance of our measurement errors. This formula tells us something profound about experimental design. To get a very precise estimate of $\beta$ (a small variance), we have two levers to pull: reduce the noise in our measurements (decrease $\sigma^2$ ), or, more practically, make the denominator $\sum x_i^2$ as large as possible by using predictor values $x_i$ that are far from zero.

A Peculiar Kind of Error

In a standard regression with an intercept, the residuals—the differences between the actual and predicted values—are constructed to have a simple sum of zero. The intercept term's job is to adjust the line up or down until the positive and negative misses cancel out.

But in the world of RTO, there is no intercept to perform this balancing act. The least squares criterion only imposes one condition on the residuals $e_i = Y_i - \hat{\beta} x_i$ : that they be, on the whole, uncorrelated with the predictor. Mathematically, this is the orthogonality condition $\sum x_i e_i = 0$ . This ensures our line has the correct tilt. However, the simple sum of the residuals, $\sum e_i$ , is not forced to be zero. It can, and usually will, be some non-zero number. This might seem like a minor technicality, but it has dramatic and often misunderstood consequences.

The $R^2$ Illusion

The most famous of these consequences concerns the coefficient of determination, or $R^2$ . In standard regression, $R^2$ tells us the "proportion of variance explained" by our model, and it's neatly confined between 0 and 1. This works because of a tidy mathematical identity: the total variation in $Y$ (SST) is perfectly partitioned into the variation explained by the model (SSR) and the unexplained residual variation (SSE).

Because the sum of residuals is not zero in RTO, this beautiful identity, $SST = SSR + SSE$ , breaks down. If you blindly use the standard formula for $R^2$ with an RTO model, you can get bizarre results, including a negative $R^2$ ! A negative $R^2$ simply means that your model, forced through the origin, is a worse predictor of the $Y$ values than just using their simple average, $\bar{y}$ .

To properly measure the goodness of fit for an RTO model, we must use an uncentered definition of variation. We rely on a different identity that does hold for RTO models: $\sum y_i^2 = \sum \hat{y}_i^2 + \sum e_i^2$ . This partitions the total uncentered sum of squares into a part due to the regression and a part due to error. This leads to the uncentered $R^2_{uc}$ :

$R^2_{uc} = 1 - \frac{\sum e_i^2}{\sum y_i^2}$

This quantity is always between 0 and 1 and correctly reflects the proportion of the total uncentered variation captured by the model. The moral is clear: software will happily report a standard $R^2$ , but for an RTO model, you are in danger of being misled unless you understand which $R^2$ you are looking at and why.

Confidence, Not Just an Estimate

Finding the best estimate $\hat{\beta}$ is only the beginning. We also want to quantify our uncertainty. How confident are we that the true $\beta$ isn't zero? To do this, we need to construct a pivotal quantity—a standardized version of our estimator whose distribution we know, regardless of the true parameter values.

If we were lucky enough to know the true error variance $\sigma^2$ , we could form a standard normal (Z) statistic:

$Z = \frac{(\hat{\beta} - \beta) \sqrt{\sum x_i^2}}{\sigma} \sim N(0,1)$

This follows directly from the mean and variance of $\hat{\beta}$ that we found earlier. In the real world, however, $\sigma^2$ is almost never known. We must estimate it from the data using our residuals. The correct estimator for the error variance is $\hat{\sigma}^2 = \frac{SSE}{n-1}$ . Notice the denominator: $n-1$ . We started with $n$ data points but used up one "degree of freedom" to estimate the single parameter $\beta$ . This is a crucial difference from a standard simple regression with an intercept, which estimates two parameters ( $\beta_0$ and $\beta_1$ ) and thus has $n-2$ degrees of freedom for error.

By substituting our estimate $\hat{\sigma}^2$ for the true $\sigma^2$ in the pivotal quantity, we arrive at a statistic that follows a t-distribution:

$T = \frac{\hat{\beta} - \beta}{\sqrt{\frac{\hat{\sigma}^2}{\sum x_i^2}}} = \frac{\hat{\beta} - \beta}{\sqrt{\frac{SSE}{(n-1)\sum x_i^2}}} \sim t_{n-1}$

This T-statistic is the workhorse for building confidence intervals and testing hypotheses about $\beta$ in nearly all practical applications. From this, one can also construct an F-test to assess the overall significance of the regression, which for this simple one-parameter model is equivalent to the t-test.

The Leverage of a Point

We've mentioned that points far from the origin have more say in determining the slope. Let's make this precise. The leverage of a data point, $h_{ii}$ , measures how much its own value $Y_i$ influences its own predicted value $\hat{Y}_i$ . For the RTO model, the formula for leverage is remarkably elegant:

$h_{ii} = \frac{x_i^2}{\sum_{j=1}^n x_j^2}$

This tells the whole story. A point's influence depends only on the magnitude of its own $x$ value squared, relative to the sum of all the squared $x$ values. A point at $x=10$ has 100 times the leverage of a point at $x=1$ . It literally has more pull on the regression line. This simple formula is a powerful diagnostic tool, immediately telling us which observations are dominating our results.

A Cautionary Tale: When the Origin Resists

The RTO model is a tool of great power and simplicity, but it rests entirely on one crucial assumption: the true relationship really does pass through the origin. What happens if we are wrong? What if the true relationship has a non-zero intercept, $Y_i = \alpha + \beta x_i + \epsilon_i$ with $\alpha \neq 0$ , but we foolishly fit an RTO model?

The consequences are severe. Our fitted line is now being asked to perform an impossible task: pass through the origin while also trying to fit data that has a different "natural" starting point. The result is a biased estimate of the slope $\beta$ . The line gets twisted away from its true slope in an effort to compromise.

Even more insidiously, our estimate of the error variance, $\hat{\sigma}^2_*$ , becomes positively biased. The residuals from the misspecified model are systematically inflated because the model is fundamentally wrong for the data. The expected value of our variance estimate is not the true variance $\sigma^2$ , but something larger:

$E[\hat{\sigma}^2_*] = \sigma^2 + \text{a positive bias term}$

This bias term depends on the size of the true intercept $\alpha$ that we wrongly ignored. This means our statistical tests will be unreliable; we'll underestimate the precision of our coefficients and may fail to see significant relationships.

The lesson is humbling and profound. The decision to force a regression through the origin cannot be made lightly. It is not a statistical choice to be optimized, but a scientific hypothesis that must be justified by theory or prior knowledge. Before ever fitting an RTO model, one must first look at the data. A simple scatter plot is your most honest friend. If the cloud of points does not look like it's aimed squarely at the origin, then forcing your line to go there is an act of statistical violence, and the distorted results will be the evidence of the crime.

Applications and Interdisciplinary Connections

Now that we have explored the mathematical machinery of regression through the origin, we can embark on a more exciting journey: to see where this seemingly simple idea appears in the real world. We will find that forcing a line through the point $(0,0)$ is far more than a statistical convenience; it is a profound scientific statement about the nature of the system being studied. It is a declaration that a true zero exists, that proportionality is fundamental. As we travel from the flow of rivers to the glow of molecules and the grand sweep of evolution, we will see how this single constraint becomes a powerful tool for discovery, a diagnostic for experimental error, and a window into the deep symmetries of nature.

The Physical World: A Realm of Direct Proportionality

Our first stop is the physical sciences, the most natural home for regression through the origin. Here, many of our most fundamental laws are statements of direct proportionality. Consider a simple, intuitive scenario: the relationship between rainfall and river flow. It stands to reason that if there is zero additional rainfall from a storm, there will be zero additional discharge in a nearby river. The baseline flow might continue, but the change is zero. A model relating the peak discharge $D$ to the total rainfall $R$ as $D = \beta_1 R + \epsilon$ is therefore not an approximation, but a statement of physical reality. By setting the intercept to zero, we are not simplifying our model; we are making it more accurate by baking in a piece of a priori knowledge. The slope, $\beta_1$ , then takes on a clear physical meaning: it is the coefficient of discharge, a single number that encapsulates the complex hydrology of the watershed, telling us how many cubic meters of water flow per second for every millimeter of rain that falls.

This principle extends to far more abstract realms. Let's journey from a river basin into the heart of a crystal. In materials physics, X-ray diffraction is a primary tool for determining the atomic structure of solids. When X-rays pass through a crystal, they diffract off the planes of atoms, creating a unique pattern. The spacing between these atomic planes, $d$ , is governed by a beautiful equation derived from the geometry of the crystal lattice. For any cubic crystal (like table salt or aluminum), this relationship is:

\frac{1}{d_{hkl}^2} = \frac{h^2+k^2+l^2}{a^2}

Here, $(h, k, l)$ are the Miller indices—a set of integers that identify a specific family of planes—and $a$ is the lattice parameter, the length of one side of the fundamental cubic unit cell.

Look closely at this equation. If we define a new variable $y = 1/d^2$ and another variable $x = h^2+k^2+l^2$ , the law becomes $y = (\frac{1}{a^2})x$ . This is a perfect regression through the origin! When a materials scientist measures a series of diffraction peaks, they obtain a set of $d$ values. By calculating the corresponding values of $y$ and identifying the integer sums $x$ for each peak, they can plot $y$ versus $x$ . The data points must lie on a straight line passing through the origin. The slope of this line is not just a correlation; it is $1/a^2$ . By fitting a regression through the origin, the scientist can obtain a highly precise estimate of the material's fundamental lattice parameter, a cornerstone of its physical identity. Here, RTO is not an assumption, but a direct consequence of the geometry of space itself.

The Laboratory: Calibration, Controls, and the Meaning of a Blank

Let's move from the natural world to the world we create in the laboratory. In analytical chemistry and biology, a common task is to measure the concentration of a substance. This is often done by observing a signal—like the absorbance of light or the intensity of fluorescence—that is proportional to the concentration. To do this, we create a calibration curve using standards of known concentration. A natural first thought might be to model this with a regression through the origin, assuming that zero concentration yields zero signal. But is this always true?

This question forces us to think carefully about our experiment. Imagine a state-of-the-art biological assay using RNA sequencing to measure gene expression. To ensure accuracy, researchers add known amounts of synthetic RNA molecules, called ERCC spike-ins, to their samples. The number of sequencing "reads" for a spike-in should be proportional to the amount added. If we are careful, we can make this a true RTO problem. We run a "blank" sample containing no spike-ins and measure its background signal. By subtracting this blank signal from all our other measurements, we are defining our response variable $Y$ as the excess signal above background. By this very definition, when the input amount $X$ is zero, the expected value of $Y$ is also zero. Here, regression through the origin is the correct and most powerful model, a testament to a well-designed experiment.

However, the world is often messier. Consider a chemist using a spectrophotometer to measure an antioxidant in a sports drink. They prepare a "blank" sample, which contains the drink matrix but none of the target antioxidant. When they measure its absorbance, they find it is not zero. Perhaps the drink's coloring agents or other ingredients absorb a little bit of light at that wavelength. This non-zero reading for a zero-concentration sample is invaluable information. It tells us that a regression through the origin is the wrong model.

If we were to force the line through $(0,0)$ , we would introduce a systematic bias into all our measurements. The correct approach is to fit a standard linear model, $y = \beta_0 + \beta_1 x$ . The intercept $\beta_0$ now has a clear physical meaning: it is the background signal from the sample matrix. Ignoring this intercept is not simplifying; it's ignoring a part of reality. The same issue arises in other contexts: flow cytometers have electronic offsets, and cells themselves have natural autofluorescence. In these cases, a non-zero intercept is not a nuisance but a diagnostic, revealing a fundamental aspect of the measurement system. The lesson is clear: we must not assume a zero origin. We must test for it. The humble intercept becomes a gatekeeper, telling us whether our simple assumption of direct proportionality holds true.

The Evolutionary Tree: A World of Symmetry and Abstract Differences

Our final journey takes us to the most abstract, and perhaps most beautiful, application of regression through the origin: the study of evolution. When biologists compare traits across different species—say, brain size versus body mass—they face a difficult problem. Two closely related species, like a chimpanzee and a bonobo, are similar not because they independently evolved their traits, but because they inherited them from a recent common ancestor. They are not independent data points.

To solve this, Joseph Felsenstein developed a brilliant method called Phylogenetically Independent Contrasts (PICs). Instead of comparing species, this method compares the evolutionary divergences that occurred at each branching point in the tree of life. For each node in the tree, we calculate a "contrast" by taking the difference in a trait's value between the two descending lineages and standardizing it by the amount of evolutionary time that has passed. This yields a set of $N-1$ statistically independent data points from $N$ species.

Now, suppose we have contrasts for two traits, say log(brain mass) $c_y$ and log(body mass) $c_x$ . We want to see if they are evolutionarily correlated. We regress $c_y$ on $c_x$ . And here is the key: this regression must be forced through the origin. Why?

There are two converging reasons. The first is intuitive: under the standard model of trait evolution (a "random walk" or Brownian motion), changes are undirected. At any branching point, there's no a priori reason for a trait to increase or decrease. Therefore, the expected value of any contrast is zero. If we have a regression model where the expected value of the predictor $c_x$ is zero, the model only makes sense if the intercept is also zero. Zero evolutionary change in body mass should correspond to an expectation of zero evolutionary change in brain mass.

The second reason is deeper and more elegant, rooted in symmetry. When we calculate a contrast at a node with descendants A and B, our choice to calculate $(A-B)$ versus $(B-A)$ is completely arbitrary. If we flip the order, the contrast for every trait simply flips its sign: $(c_x, c_y)$ becomes $(-c_x, -c_y)$ . Our statistical conclusion cannot depend on this arbitrary choice. Let's see what happens to a regression with an intercept, $c_y = \alpha + \beta_1 c_x$ . If we flip the signs, the same evolutionary event would be described by $-c_y = \alpha + \beta_1(-c_x)$ . If our model is to be consistent, both equations must hold. Adding them together gives $0 = 2\alpha$ , which implies that the intercept $\alpha$ must be exactly zero. The only model that respects the inherent symmetry of the problem is a regression through the origin.

This is a stunning result. The constraint comes not from a simple physical argument, but from the logical necessity of invariance. By respecting this symmetry, we unlock immense explanatory power. The slope of the PIC regression is no mere correlation; it is an estimate of the evolutionary allometric exponent, a fundamental parameter describing how two traits evolve in concert over millions of years. It reveals the deep patterns of correlated evolution, not just the superficial patterns seen among species today. And, just as in our chemistry example, the intercept can be used as a diagnostic. If we fit an intercept anyway and find it to be significantly non-zero, it tells us that our underlying model of evolution might be wrong.

From physics to biology, from concrete to abstract, the regression through the origin is a unifying thread. It is a hypothesis of perfect proportionality, a badge of a well-controlled experiment, and a consequence of deep symmetry. To understand when and why to use it is to engage with the very foundations of scientific modeling.

Regression Through the Origin

Introduction

Principles and Mechanisms

Finding the "Best" Proportionality

A Peculiar Kind of Error

The R2R^2R2 Illusion

Confidence, Not Just an Estimate

The Leverage of a Point

A Cautionary Tale: When the Origin Resists

Applications and Interdisciplinary Connections

The Physical World: A Realm of Direct Proportionality

The Laboratory: Calibration, Controls, and the Meaning of a Blank

The Evolutionary Tree: A World of Symmetry and Abstract Differences

Regression Through the Origin

Introduction

Principles and Mechanisms

Finding the "Best" Proportionality

A Peculiar Kind of Error

The R2R^2R2 Illusion

Confidence, Not Just an Estimate

The Leverage of a Point

A Cautionary Tale: When the Origin Resists

Applications and Interdisciplinary Connections

The Physical World: A Realm of Direct Proportionality

The Laboratory: Calibration, Controls, and the Meaning of a Blank

The Evolutionary Tree: A World of Symmetry and Abstract Differences

The $R^2$ Illusion

The $R^2$ Illusion