Linear Regression for Classification

SciencePedia

Key Takeaways

Using linear regression for classification is a simple but flawed approach due to its extreme sensitivity to outliers and its inability to produce calibrated probabilities.
Despite its practical drawbacks, least-squares classification exhibits the "double descent" phenomenon, where test error improves in massively overparameterized settings.
The choice of loss function and training algorithm imparts implicit regularization, connecting least-squares to Ridge regression and logistic loss to max-margin SVMs.
Analyzing this "wrong" method reveals deep connections between seemingly disparate concepts like dimensionality reduction (PCA vs. LDA), feature scaling, and algorithmic fairness.

Introduction

Can a tool designed for predicting continuous values, like linear regression, be repurposed for a sorting task like classification? This question initiates a journey into one of machine learning's most instructive thought experiments. While the immediate answer involves a simple and elegant mathematical trick, this approach is fraught with fundamental flaws. This article explores the paradoxical nature of using linear regression for classification, treating it not as a recommended technique, but as a "distorted lens" that reveals the deep, unifying principles of modern data science.

The reader will first explore the core "Principles and Mechanisms" of this method. This includes understanding the mechanics of least-squares classification, its critical weaknesses regarding outliers and uncalibrated outputs, and why models like logistic regression are generally superior. We will also uncover its surprising second act in high-dimensional regimes, where it connects to the frontier concepts of double descent and implicit regularization. Following this, the "Applications and Interdisciplinary Connections" chapter broadens the perspective, using the model's failures to illustrate fundamental concepts in dimensionality reduction, the importance of feature scaling, and even the societal implications of algorithmic fairness. By pushing a simple tool to its limits, we gain a richer understanding of the entire machine learning landscape.

Principles and Mechanisms

A Simple, Naive, and Wonderful Idea

Let's begin with a question so simple it feels almost foolish: can we use a tool for drawing lines—linear regression—to solve a problem about sorting things into boxes—classification? Imagine you have data points belonging to two classes, say, "Class 0" and "Class 1". Regression is designed to predict continuous numbers, like temperature or price. Classification is about predicting a discrete label. How could one possibly do the job of the other?

The most direct approach is to simply pretend the class labels are numbers. We can assign the value $y=0$ to every point in Class 0 and $y=1$ to every point in Class 1. Now, we have a set of points $(x_i, y_i)$ , and we can ask our familiar friend, linear regression, to find the best line (or hyperplane in higher dimensions) that fits this data. The model takes the form $f(x) = w^{\top}x + b$ , where $w$ is a vector of weights and $b$ is an intercept. The goal is to find the parameters $(w,b)$ that minimize the sum of squared differences between the predicted scores $f(x_i)$ and the numerical labels $y_i$ .

Once we have this line, how do we make a classification? A natural rule suggests itself: if the model's output $f(x)$ for a new point is closer to 1 than to 0, we'll predict Class 1. If it's closer to 0, we'll predict Class 0. The decision boundary, the point of perfect indecision, would be where the score is exactly halfway: $w^{\top}x + b = 0.5$ .

Alternatively, we could have labeled our classes as $-1$ and $+1$ . In this case, the natural decision boundary is where the score is zero: $w^{\top}x + b = 0$ . A positive score means we lean towards class $+1$ , and a negative score means we lean towards class $-1$ . This setup, as we will see, has some rather elegant properties. In either case, the decision boundary is a straight line (or a flat plane), a linear classifier.

The Machinery of Least Squares

This method, which we can call least-squares classification, has a straightforward mathematical engine. The task is to minimize the total squared error, an objective function we can write as $L(w,b) = \sum_{i=1}^n ((w^{\top}x_i + b) - y_i)^2$ . Using the tools of calculus, we can find the exact parameters $(w,b)$ that minimize this loss. The solution is found by setting the gradient of the loss function to zero, which gives rise to a famous set of linear equations known as the normal equations.

In a compact matrix form, if we augment our feature matrix $X$ with a column of ones (let's call it $Z$ ) and stack our parameters into a single vector $\theta$ , the normal equations are simply:

Z^{\top} Z \theta = Z^{\top} y

If the matrix $Z^{\top} Z$ is invertible, we can solve for $\theta$ directly:

\theta = (Z^{\top} Z)^{-1} Z^{\top} y

This gives us a closed-form, analytical solution. No iterative searching required; it's a one-shot calculation. If $Z^{\top} Z$ is not invertible (which can happen if some features are redundant), a solution still exists and can be found using the Moore-Penrose pseudoinverse, which gives the solution with the smallest possible norm.

So, we have a simple method with an elegant, exact solution. The story should end here, right? A triumph of simplicity! But Nature, as she often does, has a few surprises in store. When we look closer, we find that this simple idea has some profound, and instructive, flaws.

When Good Intentions Go Wrong: The Tyranny of Outliers

The greatest virtue of least-squares regression—its mathematical simplicity—stems from its loss function: the sum of squared errors. But this is also its greatest weakness. By squaring the error, we give immense power to points that are far from the regression line. A point that is twice as far from the line contributes four times the error. A point ten times as far contributes one hundred times the error. The model becomes obsessed with placating these distant, demanding points.

Now, imagine this in our classification context. Suppose we have a set of well-behaved data points that are nicely separated. Our least-squares classifier finds a perfectly reasonable boundary. Then, a single new point arrives. It is far away from the other data (a "high-leverage" point), and due to some error, it's given the wrong label. For instance, imagine our initial data suggests a line with a slope of $+1$ . The new point is at $x=10$ , so the model would expect $y \approx 10$ . But what if it's mislabeled as $y=-10$ ?

The squared error for this one point is enormous. To reduce this gigantic penalty, the least-squares method will do something drastic: it will tilt the entire line, sacrificing the good fit on all the other points just to get closer to this one troublesome outlier. A perfectly good classifier with a slope of $+1$ can be violently tilted to have a slope of nearly $-1$ by a single, mislabeled, high-leverage point.

This extreme sensitivity makes least-squares classification a brittle and unreliable method. It lacks robustness. It's like a political system where the person who shouts the loudest gets all the attention. Other methods, like logistic regression, use a gentler loss function that doesn't panic so much about outliers. Robust regression methods even use "redescending" loss functions, like Tukey's biweight loss, where the influence of a point actually decreases and eventually drops to zero once its error becomes too large. The model essentially learns to ignore points that are pathologically inconsistent with the rest of the data.

What Does a "Score" Mean? The Quest for Calibrated Probabilities

A second, more subtle issue arises. The output of our least-squares classifier, $f(x)$ , is just a raw score. We decided to use $0.5$ as a threshold, but is a score of $0.7$ meaningfully different from $0.9$ ? Does it mean the point has a $70\%$ chance of being in Class 1? Not at all. The scores are not calibrated probabilities.

This is where logistic regression enters the stage as the protagonist. Instead of fitting a line directly to the 0/1 labels, logistic regression models the probability of belonging to a class. It is a discriminative model; it doesn't make any assumptions about how the data $X$ is distributed, unlike generative models like Linear Discriminant Analysis (LDA) which assume the data in each class comes from a Gaussian distribution.

Logistic regression proposes that the logarithm of the odds of being in Class 1 is a linear function of $x$ :

\ln\left(\frac{P(Y=1|x)}{1-P(Y=1|x)}\right) = w^{\top}x+b

By solving for the probability $P(Y=1|x)$ , we get the famous sigmoid function, $\sigma(w^{\top}x+b)$ . The beauty of this formulation is that its output is always between 0 and 1 and, when trained properly by maximizing the likelihood of the data, it yields a model that is probability-calibrated. This means a predicted probability of $0.8$ can be interpreted as an $80\%$ confidence that the point belongs to Class 1. This is a far more useful and interpretable output than the arbitrary score from least squares.

The superiority of a probabilistic output becomes even clearer when we consider how to evaluate our models. A simple metric like accuracy can be dangerously misleading, especially with imbalanced classes. If a disease affects only $1\%$ of the population, a trivial classifier that always predicts "healthy" will be $99\%$ accurate, but it's completely useless because it has zero recall for the sick patients. This is the accuracy paradox. A probabilistic classifier allows us to use more nuanced evaluation metrics like the Brier score or the Area Under the ROC/PR Curve, which assess the quality of the probabilities themselves, not just the final hard classification.

Interestingly, the machinery to fit a logistic regression model, which seems so different from least squares, is secretly connected. The most common algorithm, Iteratively Reweighted Least Squares (IRLS), solves for the logistic regression parameters by solving a sequence of Weighted Least Squares problems, where the weights are cleverly derived from the model's own variance at each step.

The Overfitting Paradox: When More is Better

So far, the story seems to be a cautionary tale: don't use regression for classification. But modern machine learning has taught us that this simple idea has a surprising and profound second act. This happens when we venture into the strange world of high dimensions, where the number of features $p$ is much larger than the number of data points $n$ ( $p \gg n$ ).

In this "overparameterized" regime, our intuitions from classical statistics break down. With more dimensions than data points, a linear model has so much freedom that it can perfectly fit any set of labels. With probability one, you can find a weight vector $w$ that produces exactly the desired output for every single training point ( $Xw = y$ ). This is a manifestation of the curse of dimensionality. It seems like the ultimate recipe for overfitting—the model has just memorized the training data.

The model that does this with the smallest possible parameter norm $\|w\|_2$ is called the minimum-norm interpolator. Classical theory would predict that such a model, which achieves zero training error by brute force, should generalize terribly to new data. And indeed, as the number of features $p$ approaches the number of samples $n$ , the test error blows up towards infinity.

But something magical happens. As we continue to add features, pushing $p$ far beyond $n$ , the test error, after peaking, starts to decrease again! This phenomenon is known as double descent. It turns out that in the massively overparameterized regime, not all perfect solutions are created equal.

The Hidden Wisdom of the Algorithm

The final piece of the puzzle lies not in the model, but in the algorithm we use to train it. The simple algorithm of gradient descent, starting from zero weights, has a hidden bias. It doesn't just find any solution that fits the data; it implicitly guides the model towards a very special kind of solution, one with remarkable properties.

When we use gradient descent on the squared error loss (our original least-squares classifier), stopping the training process early has an effect equivalent to  $\ell_2$ (Ridge) regularization. It preferentially learns the "simple" patterns in the data (associated with large singular values of the data matrix) and shrinks the components corresponding to more complex, noisy patterns. Early stopping acts as a built-in defense against overfitting.
When we use gradient descent on the logistic loss, something even more astonishing occurs. As the algorithm runs, the norm of the weight vector $\|w\|$ grows towards infinity. But the direction of the weight vector converges to the unique solution of a hard-margin Support Vector Machine (SVM)! The algorithm implicitly seeks out the decision boundary that has the maximum possible geometric margin from the data points of both classes.

So, the very algorithm we use to find the solution imparts a form of implicit regularization, a hidden wisdom that steers the overparameterized model towards a solution that generalizes well. The choice of loss function—squared error versus logistic loss—imprints a fundamentally different character on this implicit bias.

What began as a simple, naive idea—using regression for classification—has taken us on a journey through the core principles of modern machine learning. We discovered its flaws—sensitivity to outliers and lack of probabilistic grounding—which led us to appreciate the elegance of logistic regression. But then, in the high-dimensional world, we found that this simple idea, when paired with a simple algorithm, contains hidden depths, connecting to regularization, max-margin classifiers, and the surprising frontier of the double descent phenomenon. The "flawed" method, it turns out, was a wonderful teacher all along.

Applications and Interdisciplinary Connections

Now that we’ve explored the machinery of using linear regression for classification, you might be left with a nagging question. We’ve seen it’s a bit of a clumsy tool, theoretically flawed and often outperformed by methods actually designed for the task. So, why did we bother? Why spend time on an idea that seems, at first glance, to be a “wrong” use of a tool?

The answer, and the reason this journey is so worthwhile, is that by pushing a simple tool beyond its intended purpose, we uncover a breathtaking landscape of connections. We start to see the deep, unifying principles that tie together seemingly disparate fields of statistics, machine learning, and even social science. Studying the failures and quirks of linear regression as a classifier is like using a distorted lens; it reveals the hidden light paths and fundamental structures of the world of data that a perfect lens would simply focus without comment. In this chapter, we’ll embark on an exploration of these connections, applications, and consequences.

The Square Peg and the Round Hole: Why the Fit is Often Awkward

Let's first confront the obvious. A linear model draws a straight line (or a flat plane in higher dimensions). A classification task requires drawing a boundary, which might be curved, twisted, or even broken into pieces. What happens when the boundary we need simply isn't a line?

Consider the famous "exclusive-or" (XOR) problem. Imagine a dataset where the label is "true" if either feature $x_1$ is high or feature $x_2$ is high, but not both. This creates a checkerboard pattern of classes. A single straight line is utterly powerless to separate these classes; no matter how you draw it, you will always make a substantial number of errors. A flexible model like a decision tree can easily solve this by making two simple, axis-aligned cuts, effectively isolating the regions. This is the most fundamental limitation: a linear model is only suitable when the classes are, in fact, linearly separable.

But the awkwardness runs deeper than just geometry. It goes to the very heart of how we measure "success." In regression, we typically measure how well our line fits the data using a metric like $R^2$ , the "proportion of variance explained." The goal is to minimize the squared distance between our predictions and the true values. But what does "variance" even mean for a binary, yes/no outcome?

If we apply linear regression directly to a binary $\{0, 1\}$ target—a setup called the Linear Probability Model (LPM)—we often find ourselves in a strange situation. The calculated $R^2$ might be incredibly low, say $0.01$ , suggesting a terrible fit. Yet, if we use the model's output to make classifications, the accuracy might be quite reasonable. This happens because $R^2$ is answering the wrong question. It's telling us we're doing a poor job of predicting the exact values of $0$ and $1$ , which is a strange goal to begin with. A proper classification model, like logistic regression, is optimized using likelihood, a measure of how well the model's predicted probabilities explain the observed outcomes—a much more natural fit for the problem. Comparing the paltry adjusted $R^2$ from an LPM to a more meaningful pseudo- $R^2$ from logistic regression on the same data often reveals that the regression framework is simply measuring the wrong thing.

This leads to a final, critical issue: the outputs of an LPM are not probabilities. A line can easily shoot past $1$ or dip below $0$ . What does a predicted "probability" of $1.3$ or $-0.2$ even mean? They are uncalibrated and nonsensical. A well-calibrated classifier, by contrast, provides outputs that can be trusted as true probabilities: if it predicts a 70% chance of rain, it should actually rain on about 70% of the days it makes that prediction. Metrics like Expected Calibration Error (ECE) are designed to measure this trustworthiness, and on this front, models built for classification shine while the LPM typically fails.

Unsupervised Eyes on a Supervised World

The mismatch between regression and classification can be understood through a beautiful analogy involving dimensionality reduction—the art of simplifying complex data. Imagine you have data with many features, and you want to reduce it to just one or two dimensions to make it easier to work with.

One of the most famous tools for this is Principal Component Analysis (PCA). PCA is, in its soul, a regression-minded algorithm. It looks at the cloud of data points and asks: "In which direction does this cloud vary the most?" It finds the axes of maximum variance and projects the data onto them. This is often exactly what you want for a regression task, as the directions of high variance are frequently the ones that contain the most information about the outcome.

But what about for classification? The goal of classification isn't to explain variance; it's to find separation between groups. What if the crucial information that distinguishes two classes lies along a direction of very low variance? PCA, with its unsupervised, regression-minded eyes, would see this direction as unimportant "noise" and discard it. It would be like trying to find a whispered conversation in a noisy room by only listening to the loudest sounds—you'd miss the signal entirely.

A supervised tool like Linear Discriminant Analysis (LDA), by contrast, is classification-minded. It explicitly looks for the direction that best separates the means of the classes, while simultaneously minimizing the variance within each class. It doesn't care if that direction is "loud" or "quiet" in terms of overall variance; it only cares if it's discriminative.

This powerful contrast serves as a perfect allegory for our main topic. Using linear regression for classification is like using PCA for classification-oriented dimensionality reduction. It imposes a regression objective—minimizing squared error, a variance-like quantity—onto a problem whose true goal is class separation. Sometimes this works by coincidence, but when the directions of variation and separation diverge, the approach can fail spectacularly.

Echoes and Analogies: The Unifying Principles

Here is where our journey takes a turn from criticism to appreciation. By comparing the mathematics of regression and classification, we uncover profound similarities that reveal a deep unity in the world of statistical modeling.

Consider the concept of model "confidence." In a multiclass classification model, we might have a "temperature" parameter, $\tau$ . When $\tau$ is low, the model's predicted probabilities become very sharp and "confident" (e.g., 99% for one class, tiny fractions for others). When $\tau$ is high, the probabilities become soft and "uncertain," closer to a uniform guess. Lowering the temperature makes the loss function's landscape steeper and more curved, which can make optimization trickier.

Is there an echo of this in linear regression? Amazingly, yes. In a probabilistic view of regression, we often assume the data points are scattered around the true line with some Gaussian (normal) noise of variance $\sigma^2$ . This $\sigma^2$ is a measure of our uncertainty about the data. If $\sigma^2$ is small, we believe the data is very precise and lies close to the line. If $\sigma^2$ is large, we believe the data is noisy.

The beautiful connection is this: the role of $1/\tau$ in classification is mathematically analogous to the role of $1/\sigma^2$ in regression. Decreasing the noise variance $\sigma^2$ in regression is like decreasing the temperature $\tau$ in classification. Both actions signal higher confidence in the data, and both have the exact same effect of increasing the curvature of the loss function. This isn't a mere coincidence; it's a sign that the fundamental trade-offs between confidence, uncertainty, and optimization difficulty are shared across these different domains.

Another fascinating parallel emerges when we consider reweighting samples.

In regression, if we have heteroskedastic data—where some data points are noisier (higher variance) than others—we can use a technique called Weighted Least Squares (WLS). WLS gives less weight to the noisy, unreliable points to obtain a more efficient and precise estimate of the regression line.
Now consider a different problem. In a classification survey, some demographic groups might be less likely to respond, leading to missing data. To get an unbiased estimate of population-wide trends, we can use Inverse Propensity Weighting (IPW), which gives more weight to the observed individuals from the underrepresented groups to correct for the bias caused by the non-random missingness.

In both cases, we are reweighting data points. But the logic is inverted. WLS down-weights unreliable points to improve precision. IPW up-weights under-observed points to improve accuracy (reduce bias). This comparison highlights the subtle but crucial differences in the goals of regression (estimation efficiency) and classification-related population inference (bias correction).

From Theory to Practice: Real-World Entanglements

The conceptual connections we've discussed have very real practical consequences. Consider the mundane task of feature scaling. Should you standardize your features to have zero mean and unit variance before feeding them to a model?

The answer depends entirely on the model. For a decision tree, which only cares about the ordering of values within a feature, scaling is irrelevant. But for many linear models, it is absolutely critical. An unregularized linear or logistic regression is, perhaps surprisingly, immune to scaling. The model can simply adjust its coefficients to compensate perfectly. However, the moment we introduce regularization—a vital technique for preventing overfitting by penalizing large coefficient values—scaling becomes paramount.

A standard $\ell_2$ penalty term, $\lambda \sum w_j^2$ , treats all coefficients $w_j$ equally. But if feature $X_1$ is measured in meters (ranging from 0 to 1000) and feature $X_2$ is a 0/1 indicator, any coefficient for $X_1$ will naturally be much smaller than the coefficient for $X_2$ to have a comparable effect. The regularization term, blind to this fact, will unfairly penalize the model for using feature $X_2$ . Standardizing the features puts them on a level playing field, allowing the regularization to do its job properly. This applies equally to regularized linear regression and other popular linear classifiers like Support Vector Machines (SVMs).

Finally, let's revisit the idea of a probabilistic output. While the simple LPM fails to produce valid probabilities, more sophisticated regression frameworks can. A Bayesian linear regression model, for example, doesn't just output a single prediction; it can output an entire predictive distribution. This distribution tells us not only the most likely outcome but also the full range of possibilities and our uncertainty about them. This is immensely powerful. In a field like medical diagnostics or finance, the cost of an error is not symmetric. A false negative (missing a disease) can be far more catastrophic than a false positive (a needless follow-up test). By using a regression framework that quantifies uncertainty, we can move beyond simple classification and into the realm of risk-based decision making, where we can set decision thresholds based not just on the predicted outcome, but on the potential costs of being wrong.

A Broader Lens: Societal Implications

We end our tour in a place that might seem far from the mathematics of lines and planes: the domain of ethics and fairness. The models we build are not abstract entities; they are increasingly used to make high-stakes decisions about people's lives—in hiring, loan applications, and criminal justice.

What happens when we apply a simple linear model to data from a society where historical biases are present? Suppose the statistical properties of our features—say, income and credit history—have different distributions for different demographic groups due to systemic inequities. A linear model, being a purely mathematical creature, will learn a decision boundary based on the pooled data. Because the input distributions differ, the model's predictions $\hat{Y}$ will almost certainly not be independent of the sensitive group attribute $A$ . For example, the model might have a different positive prediction rate for two groups, $P(\hat{Y}=1 | A=0) \neq P(\hat{Y}=1 | A=1)$ , violating a fairness criterion known as Demographic Parity.

This is not a malicious act by the algorithm; it is a direct mathematical consequence of its sensitivity to the statistics of its inputs. The field of algorithmic fairness grapples with this challenge. One proposed strategy involves preprocessing the data itself, applying transformations to align the feature distributions across different groups before the model ever sees them. The goal is to create a "fairer" representation of the data, in which a downstream classifier is less likely to perpetuate or amplify existing societal biases.

This brings us full circle. Our exploration of the seemingly simple, "wrong" idea of using linear regression for classification has led us from geometry and metrics to deep structural analogies between different model families, and finally to some of the most pressing ethical questions of our time. It teaches us that the most profound lessons are often learned not when our tools work perfectly, but when we push them to their limits and carefully study how and why they break.