Least Squares Problems

SciencePedia

Key Takeaways

Least squares solves overdetermined systems by finding the solution $\hat{\mathbf{x}}$ that minimizes the squared error, which geometrically corresponds to projecting the data vector onto the model's column space.
The normal equations ( $A^T A \hat{\mathbf{x}} = A^T \mathbf{b}$ ) provide a direct algebraic solution but are numerically unstable as they square the problem's condition number.
Methods like QR factorization and Singular Value Decomposition (SVD) offer numerically stable solutions by avoiding the formation of $A^TA$ , with SVD being the most robust for handling ill-conditioned or rank-deficient problems.
The least squares principle is a foundational concept with applications far beyond curve fitting, including machine learning, recommender systems, and state estimation in navigation via the Kalman filter.

Introduction

In nearly every field of science and engineering, we are faced with a common challenge: extracting a clear signal from noisy data. Whether tracking a planet, modeling a biological process, or analyzing market trends, our measurements are rarely perfect, and we often have far more data than we have parameters in our theoretical model. This leads to overdetermined systems of equations—systems that have no exact solution. The central question then becomes: if a perfect fit is impossible, what is the "best possible" fit? This gap, between the desire for an exact answer and the reality of messy data, is where the theory of least squares problems provides a powerful and elegant solution.

This article explores the fundamental concepts behind least squares, transforming an intuitive problem into a precise mathematical framework. It is structured to guide you from the foundational theory to its powerful real-world impact. The first section, "Principles and Mechanisms," will unpack the geometric beauty of least squares as a projection, derive the famous normal equations, and confront the practical dangers of numerical instability, leading us to more robust algorithms. Following this, the "Applications and Interdisciplinary Connections" section will showcase the remarkable versatility of this method, demonstrating how the same core idea empowers everything from curve fitting and machine learning to spacecraft navigation.

Principles and Mechanisms

The Problem of the Best Fit

Imagine you are an engineer tracking a small object flying through the air. You've collected a few data points of its height at different times, but the measurements are a bit noisy, maybe due to air currents or imperfections in your measurement device. You have a theoretical model in mind, perhaps a simple quadratic equation like $y(t) = c_0 + c_1 t + c_2 t^2$ , but you don't know the values of the coefficients $c_0, c_1, c_2$ . Your task is to find the "best" coefficients that describe the object's flight path.

If you had exactly three data points, you could try to find a quadratic curve that passes precisely through all of them. This would give you a system of three linear equations in three unknowns. But what happens when you have four, five, or a hundred data points? You'd have a hundred equations but still only three unknown coefficients. This is an overdetermined system. Unless you are extraordinarily lucky and all your data points fall perfectly on a single quadratic curve, there is simply no choice of $c_0, c_1, c_2$ that will satisfy all the equations simultaneously. The system $A\mathbf{x} = \mathbf{b}$ , where $\mathbf{x} = \begin{pmatrix} c_0 \\ c_1 \\ c_2 \end{pmatrix}$ , has no solution.

So, what do we do? We give up on perfection. Instead of demanding an exact fit, we look for a fit that is "good enough"—or, even better, the "best possible" fit. This requires us to define what we mean by "best". A wonderfully effective and mathematically beautiful idea is to find the coefficients that minimize the sum of the squares of the errors. For each data point $(t_i, y_i)$ , the error, or residual, is the vertical distance between our model's prediction, $p(t_i) = c_0 + c_1 t_i + c_2 t_i^2$ , and the actual measurement, $y_i$ . We want to minimize the total squared error:

$\sum_{i=1}^{m} (y_i - p(t_i))^2$

If we assemble our measurements into a vector $\mathbf{b}$ and our model's predictions into a vector $A\mathbf{x}$ , this sum of squares is nothing more than the squared length of the residual vector, $\mathbf{r} = \mathbf{b} - A\mathbf{x}$ . In the language of linear algebra, we are trying to solve the least squares problem:

$\min_{\mathbf{x}} \|\mathbf{b} - A\mathbf{x}\|_2$

This approach seeks the vector $\mathbf{x}$ that makes $A\mathbf{x}$ as close as possible to $\mathbf{b}$ . It's a compromise, but it's the best possible compromise in the Euclidean sense.

It's crucial to understand what makes this a linear least squares problem. It is not because the model function is a straight line (our example is a parabola!). It's because the model is linear in its parameters. Our model $y = c_1 x + c_2 x^2$ is a linear combination of the unknown coefficients $c_1$ and $c_2$ . In contrast, a model like $y = c_1 (x - c_2)^2$ is non-linear, because the parameter $c_2$ is squared and multiplied by $c_1$ . This seemingly small change makes the problem vastly more complex to solve. For the rest of our discussion, we will focus on the elegant world of linear least squares.

The Geometry of a Solution

The true beauty of the least squares problem is revealed not through algebra, but through geometry. Let's think about the matrix $A$ . The product $A\mathbf{x}$ is a linear combination of the columns of $A$ . As we try every possible vector $\mathbf{x}$ of coefficients, the resulting vector $A\mathbf{x}$ traces out a subspace within the higher-dimensional space where our data lives. This subspace is called the column space or range of $A$ , denoted $\operatorname{range}(A)$ . You can think of it as the "universe of possibilities" for our model—it contains every possible prediction our model could ever make.

Now, picture our vector of observations, $\mathbf{b}$ . If a perfect solution existed, $\mathbf{b}$ would lie inside the $\operatorname{range}(A)$ . But we've already established that in an overdetermined system, it almost certainly doesn't. Our data vector $\mathbf{b}$ is floating somewhere outside our model's universe.

The least squares problem, $\min \|\mathbf{b} - A\mathbf{x}\|_2$ , now has a stunningly simple geometric interpretation: Find the point in the subspace $\operatorname{range}(A)$ that is closest to the point $\mathbf{b}$ .

How do you find the closest point on a plane to a point floating above it? You drop a perpendicular! This point, the "shadow" of $\mathbf{b}$ on the subspace, is its orthogonal projection. Let's call this projection $\hat{\mathbf{b}}$ . This $\hat{\mathbf{b}}$ is the best approximation to $\mathbf{b}$ that our model can possibly produce.

The vector connecting $\mathbf{b}$ to its projection $\hat{\mathbf{b}}$ is the residual vector, $\mathbf{r} = \mathbf{b} - \hat{\mathbf{b}}$ . By the very nature of an orthogonal projection, this residual vector must be orthogonal (perpendicular) to the entire subspace $\operatorname{range}(A)$ . This is the cornerstone of least squares, the orthogonality principle. It tells us that for the best-fit solution, the errors are not just small, they are uncorrelated with the model's predictions in a very specific way.

Finding the Magic Parameters: The Normal Equations

This geometric insight gives us a powerful algebraic tool. If the residual vector $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ must be orthogonal to the entire column space of $A$ , it must be orthogonal to each and every column of $A$ . We can express this condition for all columns at once using the transpose of $A$ :

$A^T \mathbf{r} = \mathbf{0}$

Substituting $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ , we get:

$A^T (\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0}$

A little rearrangement gives us the celebrated normal equations:

$A^T A \hat{\mathbf{x}} = A^T \mathbf{b}$

We have transformed an unsolvable rectangular system $A\mathbf{x}=\mathbf{b}$ into a solvable square system. Because the orthogonal projection always exists and is unique, we know that a least squares solution $\hat{\mathbf{x}}$ is guaranteed to exist.

But is the solution $\hat{\mathbf{x}}$ unique? This depends on the matrix $A^TA$ . If the columns of our original matrix $A$ are linearly independent (meaning none of them can be written as a combination of the others), then the matrix $A^TA$ will be invertible, and we can find a unique solution: $\hat{\mathbf{x}} = (A^TA)^{-1}A^T\mathbf{b}$ . This is the case in most well-posed fitting problems.

However, if the columns of $A$ are linearly dependent (the matrix is rank-deficient), then $A^TA$ is singular and not invertible. In this case, there are infinitely many solutions for $\hat{\mathbf{x}}$ ! If you find one particular solution $\hat{\mathbf{x}}_p$ , then any vector of the form $\hat{\mathbf{x}}_p + \mathbf{z}$ , where $\mathbf{z}$ is any vector in the null space of $A$ (i.e., $A\mathbf{z} = \mathbf{0}$ ), is also a perfect least squares solution. The set of all solutions forms a line, a plane, or a higher-dimensional affine space. But here is a wonderful fact: even though there are infinitely many solutions for the coefficients $\hat{\mathbf{x}}$ , they all produce the exact same best-fit vector. The projection $\hat{\mathbf{b}} = A\hat{\mathbf{x}}$ is unique, even if the coefficients that generate it are not.

The Perils of Reality: Numerical Instability

The normal equations look like a triumph of mathematics—and they are. But in the world of finite-precision computers, they hide a dark secret. Let's consider an experiment where we measure a quantity at very closely spaced times, say $t=100.0, 101.0, 102.0$ . If we fit a line $y = c_0 + c_1 t$ , the columns of our matrix $A$ will be $\begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix}$ and $\begin{pmatrix} 100 \\ 101 \\ 102 \end{pmatrix}$ . These two vectors are almost pointing in the same direction; they are nearly linearly dependent. The matrix $A$ is said to be ill-conditioned.

When we form the matrix $A^TA$ , this ill-conditioning gets much, much worse. A measure of this is the condition number, $\kappa(A)$ , which you can think of as a "wobbliness amplifier". It tells you how much the solution $\mathbf{x}$ might change for a small wobble in the input data $\mathbf{b}$ . A large condition number means your problem is sensitive, and small computer rounding errors can lead to huge errors in the final answer.

The devastating mathematical fact is that the condition number of the normal equations matrix is the square of the original:

$\kappa(A^T A) = (\kappa(A))^2$ So if your original matrix was already a bit wobbly, with $\kappa(A) = 10,000$ , the matrix you actually have to solve, $A^TA$ , will be extremely wobbly, with $\kappa(A^TA) = 100,000,000$ . For the simple time-series example above, the condition number of $A^TA$ is a monstrous $1.561 \times 10^8$ . Solving the normal equations directly is like performing surgery with a shaky hand—a recipe for disaster.

A More Elegant Path: Orthogonalization and SVD

So, how can we solve the least squares problem without this catastrophic loss of precision? The key is to avoid forming $A^TA$ and instead work more directly with the elegant geometry of the problem.

The QR Factorization

The trouble with the columns of $A$ is that they might be a "bad" basis for the column space—they might be nearly parallel. The QR factorization is a procedure (like the Gram-Schmidt process) that takes the columns of $A$ and finds an equivalent orthonormal basis—a set of perfectly perpendicular, unit-length vectors that span the exact same subspace. This factorization writes $A=QR$ , where $Q$ is a matrix whose columns are this nice new orthonormal basis, and $R$ is an upper-triangular matrix that records how the old basis relates to the new one.

Since the columns of $Q$ are an orthonormal basis for $\operatorname{range}(A)$ , the projection of $\mathbf{b}$ onto this space is easily written as $\hat{\mathbf{b}} = QQ^T\mathbf{b}$ . The least squares problem becomes solving $A\hat{\mathbf{x}} = QR\hat{\mathbf{x}} = QQ^T\mathbf{b}$ . By multiplying by $Q^T$ on the left and using the fact that $Q^TQ=I$ , this simplifies beautifully to:

$R\hat{\mathbf{x}} = Q^T\mathbf{b}$

This is a simple triangular system that is easily solved for $\hat{\mathbf{x}}$ . The vector $Q^T\mathbf{b}$ has a lovely geometric meaning: its components are the coordinates of the projection $\hat{\mathbf{b}}$ in our new, pristine orthonormal basis. Most importantly, the condition number of $R$ is the same as the original matrix $A$ , $\kappa(R) = \kappa(A)$ . We have completely sidestepped the squaring of the condition number.

The Singular Value Decomposition (SVD)

If QR factorization is a skilled craftsman's tool, the Singular Value Decomposition (SVD) is the ultimate master key of linear algebra. It provides the deepest possible insight into a matrix. SVD factors any matrix $A$ into three other matrices: $A = U\Sigma V^T$ . Here, $U$ and $V$ are orthogonal matrices whose columns provide perfect orthonormal bases for the four fundamental subspaces associated with $A$ . The matrix $\Sigma$ is diagonal, and its diagonal entries, the singular values, tell you the "strength" or "importance" of each dimension of the space.

For least squares, SVD is the perfect tool. It tells you the true rank of your matrix by counting the number of non-zero singular values. It gracefully handles rank-deficient problems. Most beautifully, it gives you the solution directly. For a rank-deficient problem where infinitely many solutions exist, SVD automatically gives you the one that is "most natural": the solution vector $\hat{\mathbf{x}}$ that has the smallest possible length. This is known as the minimum-norm solution.

In practice, SVD's power lies in its robustness. When a singular value is extremely small (but not exactly zero due to floating-point errors), we can treat it as zero by setting a tolerance. This prevents division by a tiny number and stabilizes the solution, effectively finding the best solution in a lower-rank approximation of the problem. This is the most stable and reliable way known to solve least squares problems, taming even the nearly-collinear cases that are so dangerous for the normal equations.

From a simple desire to fit a line to noisy data, we have journeyed through the geometry of high-dimensional spaces, uncovered the algebraic power of the normal equations, faced the practical dangers of numerical instability, and arrived at elegant and robust algorithms like QR and SVD. This progression, from an intuitive need to a beautiful geometric picture to a powerful computational reality, is a hallmark of the deep unity of mathematics and science.

Applications and Interdisciplinary Connections

We've spent some time exploring the gears and levers of the least squares method, seeing it as a geometric projection or an algebraic solution. But a tool is only as good as the problems it can solve. And it turns out that this one idea—finding the "best" approximate solution to a system that has no exact solution—is one of the most versatile and powerful tools in the entire toolbox of science. It’s not just a mathematical curiosity; it’s a fundamental principle for reasoning in the face of uncertainty and complexity. Let’s go on a journey to see where this simple-sounding idea takes us. You might be surprised by the destinations.

The Art of Fitting Curves: From Lines to Life Sciences

Perhaps the most classic use of least squares is one you've already guessed: drawing the best possible line through a scatter of data points. But the world is rarely so linear. What if the relationship we are trying to model is more complex? Suppose we are tracking the concentration of a new medicine in a patient's bloodstream over time. The data might suggest a curve that decays exponentially. A physicist might be tracking a damped oscillation. The beauty of the least squares framework is that it isn't limited to straight lines. We can fit polynomials, exponentials, sinusoids—practically any functional form we can write down.

This process is the bread and butter of experimental science. You have a theoretical model, and you want to find the parameters of that model that best match your hard-won data. Sometimes, this is straightforward. If our model is a "linear" combination of basis functions—like $p(x) = c_0 + c_1 x + c_2 x^2$ is a linear combination of $1, x, x^2$ —then the problem, no matter how wiggly the resulting curve, boils down to a linear least squares problem.

But life often gives us data with a bit more character. What if we know that one of our measurements is far more reliable than the others? Maybe it was taken with a superior instrument or under ideal conditions. It would be foolish to treat all data points as equals. And we don't have to! We can introduce the concept of weighted least squares, where we tell our algorithm to 'pay more attention' to the trustworthy points by assigning a larger weight to their squared errors. This allows us to incorporate our expert knowledge about the data's quality directly into the mathematical formulation, leading to a more robust and accurate model fit.

The real fun begins when the model is truly non-linear in its parameters, like the function $f(x; a, b, c) = a e^{bx} + c$ . Here, the parameters $a$ and $b$ are tangled together in a way that resists a simple linear setup. You might think we need a whole new theory. But no! In a beautiful display of 'using what you know,' we can tackle this problem iteratively. We start with a guess for the parameters, and then we do something clever: we approximate our complicated non-linear model with a linear one that is valid in the immediate neighborhood of our guess. This reduces the problem to a familiar linear least squares problem, which we solve to find a small correction to our parameters. We update our guess and repeat the process, each time linearizing and resolving, inching closer and closer to the best-fit solution. This powerful technique, a cornerstone of optimization known as the Gauss-Newton method, allows us to use our trusted linear least squares machinery to conquer a vast wilderness of non-linear problems. This very method is used in fields like pharmacokinetics to determine how a drug is absorbed and eliminated by the body, fitting complex exponential decay models to concentration data and helping to design effective dosing regimens.

Beyond the Data Points: Generalization, Regularization, and Machine Learning

So far, we've focused on finding a curve that describes the data we have. But modern science, especially in the age of 'big data' and machine learning, is often more concerned with prediction. We want a model that not only fits the data we've seen but also generalizes to make accurate predictions about data we haven't seen. Here, the world of least squares gets even more interesting, and we begin to see its connections to some of the most advanced ideas in computer science.

Let's consider a seemingly simple task: classification. We have data points belonging to one of two classes, say, 'tumor is malignant' (let's call it 1) or 'benign' (call it 0). Can we use least squares to build a predictor? A naive approach would be to fit a linear model to this data, hoping the output will be close to 1 for the malignant class and 0 for the benign. But there's a catch! Least squares doesn't know our outputs are supposed to be probabilities. It's an unconstrained optimizer, free to do what it must to minimize squared error. As a result, it can cheerfully predict a 'probability' of 1.5 or -0.2, which is nonsense. This failure is wonderfully instructive: it teaches us that choosing the right tool for the job is critical. The squared error loss of least squares is ill-suited for the statistics of binary outcomes, which is why methods like logistic regression were invented.

However, this doesn't mean least squares is out of the game. We can force its hand. We can solve a constrained least squares problem, telling the algorithm: 'Find the best-fit parameters, but under the strict condition that all your predictions for the training data must lie between 0 and 1'. This turns our simple problem into a more complex optimization known as a Quadratic Program, but it respects the physical reality of the problem. This idea of adding constraints is a powerful theme that extends the reach of least squares into complex engineering design and control.

Now for something truly modern. How does a streaming service recommend movies you might like? At its core is a giant, mostly empty matrix of users and their ratings for movies. The challenge is to predict the missing entries. This is the 'matrix completion' problem. We can approach this by assuming that people's tastes aren't random; there are underlying factors, or 'latent features,' like a preference for comedies, action movies, or a particular director. We can try to model the full rating matrix $R$ as the product of two much thinner matrices, $U$ and $V^T$ , where $U$ represents the users' affinities for these latent features and $V$ represents the movies' expression of them. How do we find $U$ and $V$ ? You guessed it: we set up a gigantic least squares problem! We seek the matrices $U$ and $V$ such that their product $U V^T$ best matches the ratings we do know. To prevent the model from becoming absurdly complex and just 'memorizing' the data, we add a penalty term—a form of Tikhonov regularization—that keeps the entries of $U$ and $V$ from getting too large. This elegant formulation, minimizing a combination of squared error and a regularization term, is at the heart of many modern recommender systems.

Let's change gears and look at one of the most profound and beautiful applications of least squares: finding your way in the world. From a GPS in your phone to a rover navigating the surface of Mars, the problem is always the same: you have a model of how you think you're moving, but it's imperfect. You also have noisy measurements from sensors—GPS satellites, star trackers, wheel odometers. How do you fuse these two streams of information to get the best possible estimate of your true state (position and velocity)?

This is the domain of the Kalman filter, one of the crowning achievements of 20th-century engineering. And at its heart lies a least squares problem. At each moment in time, the filter has a prior belief about the state, represented by a mean and a covariance matrix (which quantifies its uncertainty). Then, a new measurement arrives. The Kalman update step is nothing more than solving a weighted least squares problem to find the state that best reconciles the prior belief with the new measurement. The 'data' to be fit are the prior state and the new measurement. The 'weights' are the inverses of their respective uncertainties (covariances). Information you are more certain about gets a higher weight. The solution to this small optimization problem gives you the new, updated state estimate, which is provably the best possible estimate under the modeling assumptions. It is a stunning realization: the dynamic, recursive process of tracking and navigation can be seen as a sequence of static, optimal 'best compromise' solutions. This deep connection reveals the least squares principle not just as a curve-fitter, but as a fundamental rule for optimal information fusion.

This perspective is grounded in the geometry of least squares. When we solve $A\mathbf{x}=\mathbf{b}$ , we are projecting the vector $\mathbf{b}$ onto the column space of $A$ . The solution $A\hat{\mathbf{x}}$ is the closest point in the model space to our data. The leftover part, the residual $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ , is orthogonal to that space. It's the part of the data our model simply cannot explain. The Kalman filter, in a sense, is doing this in a statistical space, finding a projection that minimizes uncertainty.

A Universal Language

Finally, the language of least squares is not even confined to real numbers. In fields like signal processing and quantum mechanics, we work with complex numbers. AC circuits are analyzed with complex phasors, and quantum states are described by complex wavefunctions. The least squares principle extends seamlessly to this world, simply by replacing the transpose with the conjugate transpose. This allows us to solve estimation problems in these domains with the same conceptual framework, demonstrating again the profound generality of the core idea.

From fitting a simple line to data, to recommending movies, to navigating spacecraft, the least squares principle is a golden thread running through science and engineering. It is far more than a numerical algorithm; it is a philosophy for dealing with the messy, overdetermined, and noisy reality we inhabit. It gives us a clear and powerful way to extract meaningful signals from noise, to make the best possible inferences from limited data, and to build models that work. It is a testament to the power of a simple, beautiful mathematical idea to make sense of a complex world.

Least Squares Problems

Introduction

Principles and Mechanisms

The Problem of the Best Fit

The Geometry of a Solution

Finding the Magic Parameters: The Normal Equations

The Perils of Reality: Numerical Instability

A More Elegant Path: Orthogonalization and SVD

The QR Factorization

The Singular Value Decomposition (SVD)

Applications and Interdisciplinary Connections

The Art of Fitting Curves: From Lines to Life Sciences

Beyond the Data Points: Generalization, Regularization, and Machine Learning

Finding Your Way: Least Squares in Navigation and Control

A Universal Language

Least Squares Problems

Introduction

Principles and Mechanisms

The Problem of the Best Fit

The Geometry of a Solution

Finding the Magic Parameters: The Normal Equations

The Perils of Reality: Numerical Instability

A More Elegant Path: Orthogonalization and SVD

The QR Factorization

The Singular Value Decomposition (SVD)

Applications and Interdisciplinary Connections

The Art of Fitting Curves: From Lines to Life Sciences

Beyond the Data Points: Generalization, Regularization, and Machine Learning

Finding Your Way: Least Squares in Navigation and Control

A Universal Language