The Four Commandments of OLS: Understanding the Assumptions of Linear Regression

SciencePedia

Key Takeaways

The Gauss-Markov theorem states that OLS is the Best Linear Unbiased Estimator (BLUE) only when its core assumptions about linearity and error structure are met.
Violations like heteroscedasticity and autocorrelation do not bias coefficient estimates but render standard errors invalid, leading to faulty confidence intervals.
Multicollinearity among predictors dramatically inflates the variance of coefficient estimates, making individual effects unstable and difficult to interpret.
Endogeneity, a correlation between predictors and the error term, is a critical violation that causes OLS estimates to be both biased and inconsistent.

Introduction

Ordinary Least Squares (OLS) regression is a cornerstone of data analysis, prized for its simplicity and power. It acts like a trusted detective's tool, capable of uncovering the underlying relationships within a complex set of clues—our data. However, the reliability of OLS depends on a set of fundamental rules or assumptions. When these rules are followed, OLS provides the clearest and most precise answer possible. The problem many practitioners face is applying this tool without a deep understanding of its operating conditions, leading to interpretations that can be misleading or dangerously overconfident.

This article bridges that knowledge gap by moving beyond a simple checklist of rules. We will reframe these assumptions as a dialogue with our data, where violations are not failures, but important discoveries that point toward a more complex reality. You will learn to diagnose and interpret what your model is telling you when its foundational assumptions are not met.

The journey begins in our first chapter, Principles and Mechanisms, where we will explore the Gauss-Markov theorem and the promise of the "Best Linear Unbiased Estimator" (BLUE). We will detail the four core commandments of OLS and investigate what happens when the ideal world breaks down, examining the causes and consequences of heteroscedasticity, autocorrelation, and multicollinearity. From there, the second chapter, Applications and Interdisciplinary Connections, will take these statistical concepts into the real world, demonstrating their profound relevance in fields from chemistry and biology to finance and climate science, showing that a mastery of OLS assumptions is essential for any thoughtful scientist or data detective.

Principles and Mechanisms

Imagine you're a detective trying to solve a mystery. You have a set of clues—your data—and you're trying to figure out the underlying relationship between them. Ordinary Least Squares (OLS) regression is one of your most trusted tools. It's simple, elegant, and often surprisingly powerful. But like any powerful tool, it operates on a set of rules. If the situation respects these rules, OLS gives you a wonderfully clear answer. If the rules are broken, it can still give you an answer, but it might be misleading, like a compass near a large magnet.

Our mission in this chapter is to understand these rules—not as a list of commandments to be memorized, but as a conversation with our data. We will explore what these rules are, why they matter, and what happens when they are bent or broken. You will see that violations of these assumptions are not failures, but clues—often pointing toward a deeper, more interesting truth about the world we are trying to model.

The Promise of BLUE: The Ideal World of Gauss and Markov

At the heart of OLS lies a beautiful piece of mathematics known as the Gauss-Markov theorem. You can think of it as a guarantee. It promises that if your model and data adhere to a few specific conditions, the OLS estimator for your model's coefficients isn't just a good estimator—it's the Best Linear Unbiased Estimator, or BLUE.

Let's quickly unpack that acronym. It's a mouthful, but each word is a promise:

Estimator: It's a procedure for estimating an unknown truth (the true coefficients) from your data.
Linear: The estimator is a linear combination of your observed outcomes. This means it's a simple, well-behaved mathematical object.
Unbiased: On average, your estimate will be right. If you could repeat your experiment many times, the average of all your OLS estimates would converge on the true value. The method doesn't have a systemic tendency to aim too high or too low.
Best: This is the kicker. Among all possible linear and unbiased estimators, OLS is the one with the smallest variance. This means its estimates are the most tightly clustered around the true value. It's the most precise, the most efficient.

So, what are these "magic" conditions that grant us this remarkable BLUE property? They are the foundational assumptions of OLS.

The Four Commandments of OLS

These assumptions define the ideal world in which OLS is king. They are about the structure of your model and, more importantly, about the nature of the "errors"—the part of your data that the model can't explain. Let's call this the model's "surprise" component, $\epsilon$ .

Linearity in Parameters: The model must be a linear combination of its coefficients. A model like $Y = \beta_0 + \beta_1 X$ is linear. A model like $Y = \beta_0 + \beta_1 X^2$ is also linear in the parameters $\beta_0$ and $\beta_1$ , which is what matters. This assumption ensures our problem has a straightforward algebraic structure.
Zero Conditional Mean of Errors: The expected value of the error term, for any given value of your predictors, must be zero. In symbols, $\mathbb{E}[\epsilon \mid X] = 0$ . This is a profound statement. It means your model doesn't make systematic mistakes. It's not, for example, consistently underestimating the outcome for high values of $X$ and overestimating it for low values of $X$ . On average, the surprises cancel out, everywhere.
Spherical Errors (Homoscedasticity and No Autocorrelation): This is a two-part assumption about the "shape" of the errors.
- Homoscedasticity (Constant Variance): The variance of the errors is the same across all levels of the predictors. Imagine firing a shotgun at a target. Homoscedasticity means the spread of the pellet holes is the same whether you're aiming at the top, bottom, left, or right of the target. The level of random noise is constant.
- No Autocorrelation: The error for one observation is not correlated with the error for another. Each surprise is a fresh surprise. It doesn't depend on the surprise that came before it. The errors don't have a "memory."
No Perfect Multicollinearity: The model's predictors cannot be perfectly linearly related. Each predictor must bring some unique information to the table. You can't have one predictor that is just a multiple of another, or one that can be perfectly calculated from a combination of the others. If you did, it would be impossible for the model to disentangle their individual effects.

When these four assumptions hold, the Gauss-Markov theorem guarantees that OLS is BLUE. But what happens when we step out of this ideal world?

The Unsteady Rumble: When Error Variance Changes (Heteroscedasticity)

The assumption of homoscedasticity—constant error variance—is one of the first to crumble in real-world data. When it fails, we have heteroscedasticity. Visually, it's one of the easiest violations to spot. If you plot your model's residuals (the $e_i = y_i - \hat{y}_i$ ) against its predicted values, instead of a random horizontal band, you see a cone or fan shape. The vertical spread of the errors changes as the predicted value changes.

This isn't some abstract statistical artifact; it happens everywhere.

An analytical chemist measures the concentration of a compound. At high concentrations, the signal is strong, but the random fluctuations in the instrument are also larger. The error variance increases with concentration.
An engineer uses two different instruments to measure a physical constant. One instrument is simply more precise than the other. If you pool the data, the measurement errors from the two instruments will have different variances.
An economist models income based on education. There's much more variability in income among people with PhDs than among people who finished high school. The error variance increases with education level.

So, what's the consequence of this changing noise level? Here's the truly surprising part: even with heteroscedasticity, your OLS estimates for the model's coefficients are still unbiased. On average, they still point to the right answer! The problem isn't with the estimate itself, but with our confidence in it.

OLS, being "unweighted," assumes the noise level is the same everywhere. It calculates an average level of noise across all your data. In regions where the true noise is low, OLS overestimates it. In regions where the true noise is high, OLS underestimates it.

This leads to a dangerous deception. Imagine our chemist has an unknown sample with a very high concentration. The true measurement error in this region is large. But the standard OLS formula for the confidence interval uses the averaged error, which is smaller. The result? The calculated confidence interval will be artificially narrow. We become overconfident in our result, reporting a precision that is simply not true. It's like thinking you're walking a wide, sturdy bridge when in fact it's a fraying rope. While visual inspection is a great first step, formal tools like the Breusch-Pagan test can give a definitive statistical verdict on whether homoscedasticity is violated.

The Lingering Echo: When Errors Remember the Past (Autocorrelation)

The second part of the "spherical errors" assumption is that errors are independent. When this is violated, especially in data collected over time, we have autocorrelation. The errors have a memory.

A classic example comes from finance. You model a stock's daily return based on the market's return. After fitting the model, you notice that a positive residual (the stock did better than the model predicted) is often followed by another positive residual. A negative surprise is followed by a negative surprise. The errors are not independent; they are "sticky". The surprise from yesterday lingers into today.

This seemingly simple statistical pattern can be a profound clue about the underlying reality. Consider a chemist studying a reaction over time. She assumes the temperature is constant and fits a simple first-order kinetic model. After fitting the model, she calculates the Durbin-Watson statistic and finds strong evidence of positive autocorrelation—the residuals drift slowly and systematically, not randomly.

What's happening? The statistical violation is a symptom of a physical reality. Perhaps the reaction vessel is slowly cooling down. Since the reaction rate depends on temperature (via the Arrhenius equation), the "true" rate constant $k$ is not constant but is drifting over time. By forcing a constant- $k$ model onto a changing system, we are left with a "memory" in the errors. The autocorrelation isn't a nuisance; it's a discovery! It's telling us our physical model is incomplete.

Much like heteroscedasticity, autocorrelation does not bias the OLS coefficient estimates. However, it severely distorts our standard errors, typically making them much smaller than they truly are. This again leads to overconfidence, invalidating our hypothesis tests and confidence intervals. We think we've measured a parameter with great precision when, in fact, our uncertainty is much larger.

The Tangled Web: When Predictors Overlap (Multicollinearity)

The final assumption we'll explore is that of no perfect multicollinearity. Each predictor should bring some new information. But what if they don't? What if two predictors are highly correlated? This is multicollinearity.

Imagine you are trying to estimate the separate effects of two correlated traits on an organism's fitness—say, beak length and beak depth in finches. Since birds with long beaks also tend to have deep beaks, the two predictors are tangled together. When you run a regression, the model has a hard time telling how much of the effect on fitness is due to length versus depth. It's like trying to determine the individual contributions of two collaborating singers who always sing in harmony.

The consequences are not about bias—the estimates are still unbiased. The problem is about precision and stability. This is beautifully illustrated by looking at what happens when we add an irrelevant but correlated variable to a model. Suppose the true model is $y = \beta_1 x_1 + \epsilon$ . Now, we naively add a correlated predictor $x_2$ (that has no true effect) and fit $y = \gamma_1 x_1 + \gamma_2 x_2 + \nu$ . The variance of our estimate for the coefficient of $x_1$ gets inflated by a factor of $1/(1 - r_{12}^2)$ , where $r_{12}$ is the correlation between $x_1$ and $x_2$ .

This is the famous Variance Inflation Factor (VIF). Think about its implications. If the correlation $r_{12}$ is $0.9$ , the variance of your estimate for $\gamma_1$ will be $1 / (1 - 0.81) \approx 5.26$ times larger than the variance of the estimate for $\beta_1$ in the simple model! Your estimate becomes dramatically less precise. The standard error explodes, confidence intervals widen, and the estimate itself can swing wildly with small changes to the data. You lose the ability to make firm statements about the effect of $x_1$ .

In complex models, like the polynomial models used in evolutionary biology to study selection, we can distinguish between essential multicollinearity (a real biological correlation between traits) and nonessential multicollinearity (a statistical artifact from including terms like $x$ and $x^2$ in the same model). While we can't do anything about the essential kind (it's how the world works), we can often fix the nonessential kind through simple data transformations like mean-centering, which can stabilize our estimates.

The Map and the Territory

The assumptions of OLS are not a bureaucratic checklist. They are a set of lenses for interrogating reality. They form the "map" of an ideal world where our statistical tools work perfectly. When we find that our data—the "territory"—does not match the map, we have not failed. We have made a discovery.

A cone-shaped residual plot tells us that the nature of randomness in our system is more complex than we thought. A lingering, autocorrelated residual pattern can point to a missing dynamic in our scientific model. A tangled web of correlated predictors forces us to be more humble about our ability to isolate and quantify individual causes. By understanding these principles, we move from being mere users of a black-box tool to being thoughtful, critical scientists and detectives of data.

Applications and Interdisciplinary Connections

Now that we have explored the theoretical underpinnings of Ordinary Least Squares (OLS), we might be tempted to think of them as abstract rules in a statistician's handbook. But nothing could be further from the truth. These assumptions are not mere technicalities; they are the very points of contact between our neat mathematical models and the messy, glorious complexity of the real world. Violating them is not just a statistical faux pas; it can lead us to fundamentally misunderstand how nature works.

Let us embark on a journey through different scientific disciplines to see these principles in action. We'll see that understanding the OLS assumptions is not about memorizing a checklist, but about developing a deeper intuition for the structure of the world we are trying to measure—from the dance of molecules to the evolution of life, from the ebb and flow of markets to the health of our planet.

The Assumption of Linearity: Bending the World to Fit a Ruler

The most basic premise of OLS is that we are fitting a straight line to our data. But what if nature doesn't speak in straight lines? Often, it doesn't. In chemistry, the relationship between temperature and the rate of a chemical reaction is famously exponential, described by the Arrhenius equation: $k = A \exp(-E_a / (RT))$ . If we naively plot the reaction rate constant $k$ against temperature $T$ and try to fit a line, we'll get a nonsensical result.

Here, a little cleverness saves the day. By taking the natural logarithm of the equation, we transform it into $\ln k = \ln A - E_a/R \cdot (1/T)$ . Suddenly, we have a linear relationship not between $k$ and $T$ , but between $\ln k$ and $1/T$ . By fitting a straight line to these transformed variables, we can use the familiar power of OLS to extract profound physical quantities like the activation energy ( $E_a$ ) of the reaction. This is a beautiful example of not forcing the world into our model, but transforming our view of the world so that our model becomes a useful tool.

However, sometimes the problem is that we fail to recognize a non-linear relationship that is staring us in the face. In a complex system like Earth's climate, the effect of rising $\text{CO}_2$ concentrations on global temperature may not be perfectly linear. There could be "tipping points" or accelerating effects. If we estimate a simple model of Temperature vs. $\text{CO}_2$ , we might be ignoring a crucial quadratic term ( $C_t^2$ ). Omitting this term when it is truly part of the data-generating process leads to a misspecified model, and our estimate for the linear effect will itself be biased and misleading. The world is often curvy, and our linear models must be used with wisdom and a constant eye for what we might be missing.

The Assumption of Independence: When Data Points Are Not Strangers

One of the most subtle but powerful assumptions of OLS is that each of our data points is an independent piece of information. This means that the "error" or "surprise" in one measurement tells us nothing about the error in another. When this assumption holds, each new data point adds fresh, unadulterated information. But what happens when our data points are not strangers, but are related in some way?

The most intuitive example is data collected over time. Imagine a biologist tracking the abundance of a protein in a cell hour by hour after a stimulus. The measurement at hour two is not a complete surprise, given the measurement at hour one; biological processes have continuity. If the protein level was unexpectedly high at hour one, it's likely to be higher than average at hour two as well. This connection between errors over time is called autocorrelation. If we ignore it, OLS will still give us an unbiased estimate of the trend, but it will be overconfident. Because the data points are not truly independent, we have less unique information than we think. This leads to standard errors that are too small and a false sense of certainty about our findings.

This problem of non-independence is not limited to time. Consider a network of financial institutions. The health and risk of one bank are not independent of the health of the banks it is connected to. A shock to one bank can spill over to its neighbors through the network. If we model the risk of a bank based on its characteristics, the unobserved factors (the error term) for one bank will be correlated with the errors of its neighbors. This is a form of spatial or network autocorrelation. To ignore it is to miss the systemic nature of risk [@problem_s_id:2417187].

Perhaps the most profound example of non-independence comes from the grand sweep of evolutionary history. When we compare traits across different species—say, body mass and running speed—we cannot treat each species as an independent data point. A lion and a tiger are more similar to each other than either is to a kangaroo, because they share a more recent common ancestor. This shared ancestry means their traits are not independent observations from nature's grand experiment. A simple OLS regression might show a spurious correlation that is merely an artifact of evolutionary history. To address this, biologists use methods like Phylogenetic Generalized Least Squares (PGLS), which explicitly model the non-independence of species based on the "family tree" of life—the phylogeny. It is a stunning reminder that the history of our data matters.

The Assumption of Constant Variance: A World of Uneven Uncertainty

OLS assumes that the amount of random scatter, or "noise," around the true relationship line is the same everywhere. This property is called homoskedasticity. But in many real-world systems, the level of uncertainty is not constant. This condition of non-constant variance is called heteroskedasticity.

Think about the relationship between household income and electricity consumption. We might expect richer households to use more electricity, on average. But they also have far more discretionary ways to use electricity—pools, hot tubs, extensive air conditioning, charging multiple electric vehicles. A lower-income household's consumption is more tightly constrained by basic necessities. As a result, the variability or unpredictability of electricity usage is likely to be much larger for high-income households. If we plot this data, the cloud of points will fan out, becoming wider as income increases. This is a classic violation of the constant variance assumption.

We see the same pattern in financial markets. In a simple model of a stock's return (like Amazon's) versus the market's return (the S&P 500), the error term represents the stock-specific "surprise" not explained by the market. On calm trading days, these surprises are typically small. But during periods of high market volatility, all sorts of dramatic, firm-specific news can break, and the magnitude of these surprises tends to increase. The variance of the error term is conditional on the state of the market, which again violates the homoskedasticity assumption. In both the economics and finance examples, the consequence is the same: our coefficient estimates are still unbiased, but our standard errors are wrong, leading to faulty hypothesis tests and confidence intervals. We are misjudging the certainty of our own conclusions.

The Perils of Collinearity and Endogeneity

Finally, we arrive at two of the most critical assumptions, those governing the relationship between our explanatory variables themselves, and between them and the unseen world of the error term.

First, OLS assumes our explanatory variables are not perfectly redundant. This is the no perfect multicollinearity assumption. In practice, the bigger problem is near-perfect multicollinearity. Imagine building a model in medicinal chemistry to predict a drug's effectiveness based on its molecular properties, or "descriptors". It's common for two descriptors, like a molecule's size and its weight, to be very highly correlated. If we include both in the model, OLS has a difficult time disentangling their individual effects. Is it the size or the weight that matters? Since they move together, the model can't tell. The result is that the coefficient estimates for both variables can become wildly unstable and have enormous standard errors, making them uninterpretable, even if the model as a whole has good predictive power.

The last, and arguably most treacherous, issue is endogeneity, which violates the zero-conditional mean assumption ( $E[u \mid X] = 0$ ). This assumption says that our explanatory variables must not be correlated with the unobserved error term. When they are, our OLS estimates become biased and inconsistent—they don't just become imprecise, they point in the wrong direction, even with infinite data. A classic cause is omitted variable bias, as we saw in the climate change example. If we omit solar cycles from a model of temperature vs. $\text{CO}_2$ , and solar cycles are correlated with $\text{CO}_2$ , then the effect of solar cycles gets absorbed into the error term, which is now correlated with our $\text{CO}_2$ variable.

An even more subtle example comes from finance. Suppose we are studying the impact of a public news announcement on a stock's price. But what if there is insider trading? Traders with private information will act before the public announcement, causing price movements that are not yet explained by the public news. These movements become part of the error term. But since both the insider trading and the eventual public news are driven by the same underlying information, our regressor (the news) is now correlated with the error term. OLS will give a biased account of the news's true impact, because it can't separate the effect of the public information from the preceding private information.

The Wisdom of a Modeler

This tour of the sciences might seem like a litany of depressing problems. But the message is one of empowerment. By understanding where and how the assumptions of our simple linear model can fail, we become better scientists. We learn to check our residuals, to think critically about where our data comes from, and to appreciate that sometimes, the right answer is that OLS is simply the wrong tool for the job. For count data like the number of patents a company files, the inherent nature of the data (non-negative integers, variance related to the mean) violates OLS assumptions so fundamentally that a different approach, like Poisson regression, is a much better starting point.

The art and science of modeling is not about finding the one "true" model, for such a thing rarely exists. It is about a conversation between our theories and the data, a conversation in which OLS is a powerful but demanding participant. It demands that we think deeply about the structure of our problem—about linearity, independence, variance, and causality. By honoring these demands, we move beyond just fitting lines and closer to genuine understanding.