Best Approximation

SciencePedia

Definition

Best Approximation is a mathematical principle defined by orthogonality where the error between an object and its approximation must be perpendicular to the approximation subspace. This concept belongs to fields such as linear algebra and signal processing, where the choice of norm determines the specific definition of the best solution and its uniqueness. In practical applications like data compression, the Singular Value Decomposition (SVD) is used to provide the optimal low-rank approximation for a matrix.

Key Takeaways

The principle of best approximation is defined by orthogonality, which requires that the error between an object and its approximation be perpendicular to the approximation subspace.
The choice of norm (e.g., $L_2$ for least squares, $L_\infty$ for minimax) fundamentally changes the definition of "best" and affects the solution's uniqueness and computational cost.
In linear algebra, the Singular Value Decomposition (SVD) provides the theoretically optimal low-rank approximation for a matrix, a cornerstone of modern data compression.
Best approximation acts as a unifying concept, framing problems in diverse fields like quantum mechanics, signal processing, and control theory as geometric projection problems.

Introduction

In nearly every field of study, we face a common challenge: reality is overwhelmingly complex, while our models must be simple enough to work with. How do we replace a complicated function, a noisy dataset, or a massive matrix with a more manageable substitute without losing its essential character? This leads to a crucial question: what makes an approximation the "best" one possible? The theory of best approximation provides a powerful and elegant answer, establishing a universal geometric framework for finding the optimal representation of an object within a given set of constraints. This article uncovers this fundamental principle, revealing how a single idea connects abstract mathematics to practical problems in science and technology. The first chapter, "Principles and Mechanisms," will lay the theoretical groundwork, introducing the core concept of orthogonality and exploring how different ways of measuring error lead to different kinds of "best" solutions. Subsequently, "Applications and Interdisciplinary Connections" will demonstrate the astonishing reach of this theory, showing how it is used to compress digital images, find signals in noisy data, and even describe the laws of quantum mechanics.

Principles and Mechanisms

Imagine you are standing in a large, flat field, and somewhere above you, a bird is hovering in the air. What is the point on the ground directly beneath the bird? You would instinctively drop a pebble and see where it lands. The line the pebble travels, from the bird to the ground, is the shortest possible path. It is, in a word, perpendicular to the ground. This simple, intuitive idea—that the shortest distance from a point to a plane involves a perpendicular line—is the very heart of what we mean by "best approximation." It's a concept of profound beauty and power that extends far beyond simple geometry, reaching into the abstract worlds of functions, data, and even the laws of physics.

The Power of Orthogonality

Let's formalize our intuition. The ground is a flat plane, a "subspace." The bird is a point outside that subspace. The point on the ground closest to the bird is its orthogonal projection. The key feature is that the "error"—the line segment connecting the bird to its projection on the ground—is perpendicular, or orthogonal, to every possible line you could draw on the ground itself. If it weren't, you could always move the point on the ground slightly to get a little closer to the bird, and it wouldn't be the "best" approximation anymore.

This orthogonality principle is the master key. But how can we apply it to something more complex, like approximating a complicated function with a simpler one? For example, how can we approximate the elegant curve of $f(x) = x^2$ with a straight line? Or the function $h(x) = \sqrt{x}$ with just a single constant value?

To do this, we need a way to talk about functions being "perpendicular." We need to generalize the idea of the dot product, which we use for geometric vectors, to the world of functions. This generalization is called an inner product. For two functions, $f(x)$ and $g(x)$ , defined on an interval, say from 0 to 1, a very common inner product is defined by the integral:

\langle f, g \rangle = \int_0^1 f(x)g(x) \, dx

This might look a bit abstract, but the idea is simple. Instead of multiplying corresponding components and adding them up (like in a dot product), we multiply the functions' values at every single point $x$ and "add them all up" using an integral. With this tool, we can now say two functions $f$ and $g$ are "orthogonal" if their inner product is zero: $\langle f, g \rangle = 0$ . The "length" or norm of a function becomes $\|f\| = \sqrt{\langle f, f \rangle} = \sqrt{\int_0^1 f(x)^2 \, dx}$ . Minimizing the distance between two functions, $\|f - g\|$ , is now equivalent to minimizing this integral of the squared error, hence the name "least squares" approximation.

Now, let's go back to approximating $h(x)=\sqrt{x}$ with a constant function, $g(x)=c$ . We are looking for the best constant $c$ . The space of all constant functions is our "subspace" (a simple, one-dimensional line). According to the orthogonality principle, the error, $e(x) = \sqrt{x} - c$ , must be orthogonal to our entire subspace. Since the subspace is spanned by the single function $u(x)=1$ , this simply means the error must be orthogonal to $u(x)=1$ .

\langle \sqrt{x} - c, 1 \rangle = 0 \implies \int_0^1 (\sqrt{x} - c)(1) \, dx = 0

Solving this simple equation gives $c = \int_0^1 \sqrt{x} \, dx = \frac{2}{3}$ . This is a beautiful result! The best constant approximation to a function in this "least squares" sense is simply its average value over the interval. You have projected the rich, curving function $\sqrt{x}$ onto the space of constants, and its shadow is its average value.

When we want to approximate a function like $f(x)=x^2$ with a more complex subspace, like the space of all linear functions $p(x) = a_0 + a_1x$ , the principle remains the same. The error, $x^2 - (a_0 + a_1x)$ , must be orthogonal to every function in the linear subspace. Since the subspace is spanned by the basis functions $\{1, x\}$ , it's enough to require the error to be orthogonal to each basis function separately:

\langle x^2 - (a_0 + a_1x), 1 \rangle = 0

\langle x^2 - (a_0 + a_1x), x \rangle = 0

These are called the normal equations. They give us a system of two linear equations for the two unknown coefficients, $a_0$ and $a_1$ . Solving them yields the unique best linear approximation. For $f(x)=x^2$ on $[0,1]$ , this turns out to be the line $p^*(x) = x - \frac{1}{6}$ . This orthogonality condition is a universal requirement for any best approximation in an inner product space, regardless of the basis you choose for your subspace.

A Matter of Taste: Which "Best" is Best?

So far, we have been using a specific way to measure distance, the one that comes from the integral inner product, called the $L_2$ norm. This norm is wonderful because it creates a "strictly convex" space; its "unit sphere" is perfectly round like a basketball, with no flat spots or corners. This geometric property guarantees that for any function, there is always one, and only one, best approximation in a given (closed) subspace.

But is minimizing the average squared error always what we want? What if we are designing a machine part and we care most about the worst-case error? We don't want the part to fail at its weakest point, even if the error is small everywhere else. In this case, we would want to minimize the maximum absolute error. This gives rise to a different way of measuring distance, the $L_\infty$ norm (or uniform norm):

\|f - g\|_\infty = \sup_{x} |f(x) - g(x)|

This changes the game entirely. The "unit sphere" in an $L_\infty$ space is not a sphere at all, but a hypercube. It has flat faces and sharp corners. Because of these flat spots, it's possible for a function to have multiple, or even infinitely many, best approximations! For instance, when approximating the odd function $f(t)=4t$ on $[-1, 1]$ with an even function, not only is the zero function a best approximation, but a whole family of tent-shaped functions also achieve the same minimum worst-case error. Uniqueness, which we took for granted in $L_2$ , is no longer guaranteed.

The choice of norm is not just a mathematical curiosity; it can be dictated by the physics of the problem. In fields like solid mechanics and heat transfer, problems are often described by minimizing a quantity called "energy." This naturally defines an energy norm. For a vibrating string or a heated plate, nature's solution—the actual physical state—is the one that minimizes this energy. Remarkably, the mathematical technique used to solve these problems, the Galerkin method, automatically finds the best approximation in this specific energy norm. The Galerkin orthogonality condition, $a(u - u_h, v_h) = 0$ , is a restatement of our core principle, where $a(\cdot, \cdot)$ is the energy inner product. This leads to a beautiful Pythagorean-like identity for the energy norm:

\|u - w_h\|_a^2 = \|u - u_h\|_a^2 + \|u_h - w_h\|_a^2

This equation, which holds because of symmetry and Galerkin orthogonality, shows with perfect clarity that the error $\|u - u_h\|_a$ is minimized when the approximation is $u_h$ . The best approximation under one norm is generally not the same as under another. A concrete calculation shows that for the function $u(x)=x(1-x)$ , the best approximation using piecewise linear functions is different depending on whether we measure "best" with the $L_2$ norm or the energy ( $H^1$ ) norm. The "best" answer depends entirely on the question you ask.

From Theory to Reality: Finding and Guaranteeing the Best

Now for the practical questions. Can we always find a best approximation? Is it unique? The Projection Theorem provides the theoretical bedrock. In a complete inner product space (a Hilbert space), for any point and any closed subspace, a unique best approximation is guaranteed to exist. Fortunately, the finite-dimensional subspaces we use in most computational work, like the space of polynomials up to a certain degree, are always closed.

But finding the approximation is another story. The beauty of the $L_2$ norm is that the normal equations give us a straightforward system of linear algebra to solve. For a polynomial of degree $n$ , this is typically an $(n+1) \times (n+1)$ system. We can solve it directly in a number of steps that scales with $n^3$ .

Finding the $L_\infty$ (minimax) approximation is far trickier. There is no simple system of equations to solve. Instead, algorithms like the Remez algorithm must iteratively "hunt" for the solution, adjusting an estimate until the error equillibrates at its maximum value at a specific number of points. This process is more complex and its cost depends not only on the degree $n$ but also on the desired accuracy. The choice of norm has profound consequences for computational cost.

Finally, what happens when the problem is so enormous—like analyzing a matrix with billions of entries from an internet company's user data—that even the "direct" $L_2$ calculation is impossibly slow? The Eckart-Young-Mirsky theorem tells us that the absolute best rank- $k$ approximation of a matrix is found by computing its Singular Value Decomposition (SVD) and truncating it. This is the theoretical gold standard. But computing the full SVD is one of the most expensive tasks in numerical computing.

This is where the modern spirit of approximation shines. Randomized algorithms for SVD (rSVD) take a clever shortcut. Instead of processing the entire monstrous matrix, they use randomness to take a small, intelligent "sketch" of it. This sketch is a much smaller matrix that, with high probability, captures the essential action of the original. We can then perform the expensive SVD on this tiny sketch to get a low-rank approximation. Is it the optimal one? No. But the goal of the theory is to prove that it is provably close to the optimal one, with a high degree of probability, while being orders of magnitude faster to compute. Here we see the ultimate engineering trade-off: sacrificing a tiny, controllable amount of optimality for a colossal gain in speed. This is the art and science of approximation in the 21st century—finding an answer that is not just the best in theory, but the best in practice.

Applications and Interdisciplinary Connections

We have spent some time exploring the elegant geometry of best approximation—the idea of finding the "shadow" of a vector on a subspace, where the line connecting the vector to its shadow is perfectly perpendicular to the subspace itself. This principle of orthogonality, as we have seen, guarantees that this shadow is the single closest point in the subspace. This might seem like a neat mathematical trick, a clever piece of geometry. But the astonishing truth is that this one simple idea echoes through nearly every branch of science and engineering. It is a master key that unlocks problems of astonishing variety, from the pixelated world of digital images to the fuzzy, probabilistic realm of quantum mechanics. Let us now embark on a journey to see just how far this concept of a "best guess" can take us.

The Digital Canvas: Compressing Reality

In our modern world, we are surrounded by data. An image on your screen, a piece of music, or a video stream is, at its core, a gigantic matrix of numbers. A high-resolution color photograph, for instance, can easily contain millions of values representing the brightness of red, green, and blue at each pixel. Transmitting or storing this full matrix is often impractical. We need a way to capture its essence without keeping every single number. How can we create a "good enough" version of the image that is much smaller in size?

This is a problem of best approximation. The Singular Value Decomposition (SVD) provides a breathtakingly effective tool. It tells us that any matrix—any image—can be written as a sum of simple, "rank-one" matrices. You can think of these rank-one matrices as the fundamental brushstrokes of the image, each contributing a basic pattern. What's more, the SVD arranges these brushstrokes in order of importance, a hierarchy dictated by numbers called singular values. The first term, associated with the largest singular value, captures the most dominant feature of the image; the second term adds the next most important feature, and so on.

The best rank- $k$ approximation of our image is then found by simply keeping the first $k$ most important brushstrokes and discarding the rest. The magic is that for most natural images, the singular values decrease very rapidly. A handful of terms often suffice to create an image that is visually indistinguishable from the original, even though it may require only a fraction of the data to store. This is the fundamental principle behind many data compression techniques. We are, quite literally, projecting the high-dimensional, complex original image onto a much lower-dimensional subspace spanned by its most important features. And the theory doesn't just tell us this works; it provides a precise measure of the cost. The error of our approximation—how much "fuzziness" we've introduced—is directly related to the magnitude of the singular values we chose to throw away.

The Experimentalist's Dilemma: Finding Signals in the Noise

Science rarely presents us with the clean, perfect data of a textbook. When an electrical engineer measures the voltage across a resistor for a given current, the points never fall perfectly on a line as Ohm's law ( $V=IR$ ) would suggest. Measurement devices have noise, experiments have fluctuations. We are left with a cloud of data points that seem to suggest a trend. What, then, is the "true" resistance $R$ ?

The celebrated method of "least squares" is the answer, and it is nothing other than our principle of best approximation in disguise. Imagine each of our $n$ voltage measurements forms a component of a vector $\mathbf{v}$ in an $n$ -dimensional space. Our model, $V=IR$ , constrains the possible "perfect" outcomes to lie along a single line in this space—the line defined by the vector of currents $\mathbf{i}$ . The noisy data vector $\mathbf{v}$ will not be on this line. The best estimate for the resistance $R$ is the one that defines a point on the line ( $R\mathbf{i}$ ) that is closest to our data vector $\mathbf{v}$ . This is precisely the orthogonal projection of the data vector onto the line defined by the model. The simple formula that pops out of this geometric picture is the same one used by scientists and data analysts every day to fit lines to noisy data.

This idea extends far beyond simple lines. In materials science, one might model the stiffness of a reinforced composite material with a matrix. Again, we can use the principle of best approximation to find a simpler, rank-one matrix that captures the single most important directional stiffness of the material, providing a computationally cheap and insightful physical model. In all these cases, we are projecting the complex, messy reality of our measurements onto a simplified, idealized subspace defined by our scientific theory. The projection gives us the best possible version of our theory in light of the data.

The Infinite Realm: From Vectors to Functions

Now, let's take a leap of imagination. What if our "vector" had not just three, or a million, but infinitely many components? This is the world of functions. A function like $f(x) = \exp(x)$ can be thought of as a point in an infinite-dimensional space, where its "coordinates" are its values at every single point $x$ . Can we still speak of projections and best approximations?

Absolutely. The inner product, which we used to define orthogonality for vectors, can be generalized to functions via an integral. The best approximation of a complex function within a subspace of simpler functions (say, the space of all linear polynomials $p(x) = a+bx$ ) is found by the very same principle: project the complex function onto that subspace. This forms the foundation of approximation theory, where we approximate bewildering functions with manageable ones like polynomials or trigonometric series (like Fourier series).

Sometimes, this projection reveals a surprising and beautiful structure. Consider approximating the function $f(t) = \exp(t)$ on the interval $[-1, 1]$ using only even functions (functions where $g(-t) = g(t)$ ). The space of all even functions is a subspace of the larger space of all functions. What is the projection of $\exp(t)$ onto this subspace? The answer is profoundly simple. Any function can be uniquely split into an even part and an odd part: $f(t) = f_{\text{even}}(t) + f_{\text{odd}}(t)$ . It turns out that the even and odd parts are orthogonal to each other! Therefore, the projection of $f(t)$ onto the even subspace is simply its even part, which for $\exp(t)$ is the hyperbolic cosine, $\cosh(t) = (\exp(t) + \exp(-t))/2$ . The intimidating machinery of projection theory elegantly reduces to the simple act of symmetrization.

A Common Thread: Unifying Fields

This single geometric concept serves as a powerful unifying thread, weaving together disparate fields of thought.

Quantum Mechanics: In the quantum world, the state of a particle is described by a "wavefunction," which is a vector in an infinite-dimensional Hilbert space. The foundational Schrödinger equation is often impossible to solve exactly for real-world atoms and molecules. A cornerstone of quantum chemistry is to approximate the true, complex wavefunction as a combination of simpler, known solutions, such as the wavefunctions of a simple "particle-in-a-box." Finding the best approximation amounts to projecting the true state onto the subspace spanned by these simpler basis functions. The coefficients of this combination, which give the best possible energy estimate for the ground state, are found by computing these projections.

Probability and Finance: What is the best prediction for tomorrow's stock price, given all the information we have today? This question of prediction and filtering is central to fields from econometrics to signal processing. In the language of probability theory, the "best guess" for a future random outcome, given present knowledge, is called the conditional expectation. It minimizes the mean squared error of the prediction. Astonishingly, in the Hilbert space of random variables, conditional expectation is precisely an orthogonal projection. Estimating the future is projecting a future random variable onto the subspace representing all of today's available information.

Control Theory: An engineer designing a flight controller for an aircraft might have an "ideal" mathematical model for how the system should respond. This ideal response, however, might be noncausal—it might require reacting to events before they happen, a physical impossibility. The engineer's task is to design a real-world, causal controller that mimics this ideal behavior as closely as possible. This is framed as finding the best approximation of the ideal (but unrealizable) system within the space of all stable, physically realizable systems. The theory not only provides the optimal design but also tells us the minimum possible error. This value represents a fundamental performance limit, a hard boundary set by the laws of causality that no amount of clever engineering can overcome.

From the mundane act of taking a digital photo to the esoteric challenge of predicting a quantum state, the principle of best approximation stands as a testament to the power of geometric intuition. It teaches us that in a world of overwhelming complexity, the best we can often do—and a remarkably good "best" at that—is to find the shadow of truth within a simpler, more manageable reality.