The 1-Norm: From Taxicab Geometry to Machine Learning

SciencePedia

Key Takeaways

The 1-norm, or Manhattan distance, measures distance by summing the absolute values of a vector's components, modeling movement constrained to a grid.
Unlike the round geometry of the Euclidean norm, the 1-norm's world is "pointy," with a diamond-shaped unit ball that leads to its unique properties.
The 1-norm's most significant application is promoting sparsity, which allows algorithms to automatically select important features and create simpler, more interpretable models.
Despite its differences from the Euclidean norm, it is "equivalent" in finite dimensions, ensuring that concepts like convergence often hold regardless of which norm is used.

Introduction

How we measure distance is a fundamental concept that shapes our understanding of the world. We are intuitively familiar with the straight-line "as the crow flies" distance, known in mathematics as the Euclidean or $L_2$ -norm. But what happens when our world isn't an open field, but a city grid where movement is restricted to perpendicular streets? This simple, practical constraint gives rise to a different and powerful way of measuring distance: the 1-norm, or Manhattan distance. This article delves into this fascinating concept, revealing how a seemingly simple change in perspective unlocks a completely different geometry with profound consequences.

This exploration is divided into two main parts. In the first chapter, "Principles and Mechanisms," we will lay the groundwork by defining the 1-norm, examining its unique geometric properties, and contrasting it with its Euclidean counterpart. We will discover why its "unit circle" is a diamond and how this shape is the key to its most celebrated ability: promoting sparsity. Following this, the chapter on "Applications and Interdisciplinary Connections" will showcase how this humble metric has become an indispensable tool across a vast scientific landscape, from data science and machine learning to solid-state physics and systems biology, enabling everything from automated scientific discovery to the creation of simpler, more robust predictive models.

Principles and Mechanisms

How do we measure distance? The question seems so simple, so elementary, that we barely think to ask it. If you want to know the distance between two trees in a field, you take out a measuring tape and find the straight-line path between them. This is the world as described by Euclid, the geometry we all learn in school. It’s intuitive, it’s elegant, and it’s governed by the famous Pythagorean theorem: $a^2 + b^2 = c^2$ . This familiar "as the crow flies" distance is what mathematicians call the Euclidean norm, or the  $L_2$ -norm.

But what if you're not a crow? What if you are a taxi driver in Manhattan, a city laid out on a strict grid? You can’t drive through buildings. To get from one point to another, you must travel along the north-south avenues and the east-west streets. Your path is a series of right-angled turns. How do you measure your distance then? You don't care about the straight-line path; you care about the total number of blocks you have to drive.

This simple, practical constraint gives birth to a completely different way of measuring distance, a different kind of geometry. We call this the Manhattan distance, taxicab distance, or more formally, the 1-norm (or  $L_1$ -norm).

Life on the Grid: Defining the 1-Norm

Let's get precise. If you have a vector, say $\vec{v} = (x, y)$ , which represents a displacement from an origin, its familiar Euclidean length, or $L_2$ -norm, is $||\vec{v}||_2 = \sqrt{x^2 + y^2}$ .

The 1-norm, by contrast, simply sums the absolute values of the components. For our vector $\vec{v} = (x, y)$ , the 1-norm is $||\vec{v}||_1 = |x| + |y|$ . If you're moving from a point $A=(x_A, y_A)$ to a point $B=(x_B, y_B)$ , the Manhattan distance is simply $|x_B - x_A| + |y_B - y_A|$ . This extends perfectly to any number of dimensions. For a vector in $n$ -dimensional space, $\vec{v} = (v_1, v_2, \dots, v_n)$ , the 1-norm is: $||\vec{v}||_1 = \sum_{i=1}^n |v_i|$

This isn't just a quirky mathematical game. It's a practical model for many real-world scenarios. Imagine a robotic arm moving parts in a warehouse on a system of perpendicular tracks. To calculate the total distance traveled from a starting point $A$ through an intermediate point $B$ to a final destination $C$ , you simply sum the Manhattan distances for each leg of the journey, just as a taxi meter would sum the fares for different legs of a trip. Or consider a bio-inspired robot whose leg movements are restricted to motions parallel to the coordinate axes; its energy consumption might be better estimated by the 1-norm than the Euclidean norm.

A Tale of Two Geometries

The difference between these two norms is not just a matter of calculation; it leads to two profoundly different geometric worlds. The most striking way to see this is to ask a simple question: What does a "circle" look like? A circle, by definition, is the set of all points that are an equal distance from a central point.

In our familiar Euclidean world, setting $||\vec{v}||_2 = 1$ gives us the equation $\sqrt{x^2 + y^2} = 1$ , or $x^2 + y^2 = 1$ . This is the unit circle we all know and love.

But what happens in the taxicab world? What is the set of all points $(x,y)$ that are a distance of 1 from the origin? We set $||\vec{v}||_1 = 1$ , which gives the equation $|x| + |y| = 1$ . What does this shape look like? Let's trace it out.

In the first quadrant ( $x \ge 0, y \ge 0$ ), it's $x+y=1$ , a line segment connecting $(1,0)$ and $(0,1)$ .
In the second quadrant ( $x \lt 0, y \ge 0$ ), it's $-x+y=1$ , a line segment connecting $(0,1)$ and $(-1,0)$ .
And so on for the other two quadrants.

When you put these four line segments together, you don't get a smooth, round circle. You get a diamond—a square tilted by 45 degrees, with its vertices sitting squarely on the coordinate axes at $(1,0), (0,1), (-1,0),$ and $(0,-1)$ . This "unit ball" of the 1-norm is sharp, pointy, and angular.

This fundamental difference in shape tells us something deep. When are these two ways of measuring distance ever the same? When does $|x|+|y| = \sqrt{x^2+y^2}$ ? If we square both sides, we get $(|x|+|y|)^2 = x^2+y^2$ , which expands to $x^2 + 2|x||y| + y^2 = x^2+y^2$ . This simplifies to $2|x||y|=0$ , which can only be true if either $x=0$ or $y=0$ .

This is a remarkable result! The only non-zero vectors for which the Manhattan distance and the Euclidean distance agree are those that lie exactly on one of the coordinate axes. For any other vector, the path "as the crow flies" is always shorter than the path "as the taxi drives." The 1-norm is always greater than or equal to the 2-norm, and they are only equal in these very specific directions.

The Missing Parallelogram

Why is the Euclidean world so smooth and the taxicab world so sharp? The answer lies in a beautiful geometric property called the parallelogram law. In Euclidean space, if you take any two vectors, $\vec{u}$ and $\vec{v}$ , and form a parallelogram with them, the sum of the squares of the lengths of the two diagonals is equal to twice the sum of the squares of the lengths of the two sides. Formally: $||\vec{u}+\vec{v}||_2^2 + ||\vec{u}-\vec{v}||_2^2 = 2(||\vec{u}||_2^2 + ||\vec{v}||_2^2)$ This law is intimately connected to the existence of an inner product (or dot product), which allows us to talk about angles and orthogonality. The Euclidean norm is born from an inner product: $||\vec{v}||_2 = \sqrt{\vec{v} \cdot \vec{v}}$ .

Does the 1-norm obey this law? Let's test it with the simplest possible vectors: the standard basis vectors $\vec{u} = (1,0)$ and $\vec{v}=(0,1)$ .

$||\vec{u}||_1 = |1|+|0|=1$ .
$||\vec{v}||_1 = |0|+|1|=1$ .
$\vec{u}+\vec{v} = (1,1)$ , so $||\vec{u}+\vec{v}||_1 = |1|+|1|=2$ .
$\vec{u}-\vec{v} = (1,-1)$ , so $||\vec{u}-\vec{v}||_1 = |1|+|-1|=2$ .

Now let's plug these into the parallelogram law equation:

Left side: $||\vec{u}+\vec{v}||_1^2 + ||\vec{u}-\vec{v}||_1^2 = 2^2 + 2^2 = 8$ .
Right side: $2(||\vec{u}||_1^2 + ||\vec{v}||_1^2) = 2(1^2 + 1^2) = 4$ .

Clearly, $8 \ne 4$ . The parallelogram law fails spectacularly. This tells us that the 1-norm does not come from an inner product. There is no consistent notion of "angle" in taxicab geometry in the way we're used to. This is the deep, structural reason for its "sharpness."

Bound but Not Broken: The Equivalence of Norms

Despite their differences, the 1-norm and 2-norm are not complete strangers. In a finite-dimensional space, they are tethered to each other. You can't make one arbitrarily large while keeping the other small. This relationship is captured by the idea of norm equivalence. For any vector $\vec{v}$ in an $n$ -dimensional space, it can be shown that: $||\vec{v}||_2 \le ||\vec{v}||_1 \le \sqrt{n} ||\vec{v}||_2$ The second part of this inequality, $||\vec{v}||_1 \le \sqrt{n} ||\vec{v}||_2$ , can be elegantly proven using the Cauchy-Schwarz inequality. It tells us that the 1-norm of a vector is at most $\sqrt{n}$ times its 2-norm.

This might seem like an abstract mathematical curiosity, but it has enormous practical consequences. Consider an optimization algorithm in machine learning that iteratively refines a set of parameters. If we can prove that the sequence of parameter vectors gets progressively closer together (forming a Cauchy sequence) using the familiar Euclidean norm, this inequality guarantees that the sequence is also getting closer together when measured by the Manhattan norm. This equivalence ensures that for many theoretical purposes, like proving convergence, the choice of norm doesn't change the final answer, giving mathematicians and engineers immense flexibility.

The Superpower of the 1-Norm: A Love for Zeros

We now arrive at the most exciting part of our story. In the last few decades, the 1-norm has become a superstar in fields like machine learning, statistics, and signal processing. Why? Because it has a peculiar and incredibly useful bias: it promotes sparsity. Sparsity means that many components of a solution vector are exactly zero.

Imagine you are trying to build a predictive model with thousands of potential features (e.g., trying to predict house prices using everything from square footage to the color of the front door). A sparse model is one that concludes most of these features are irrelevant and sets their corresponding weights to zero. Such a model is simpler, more interpretable, and often more robust.

The 1-norm provides a magical way to find such sparse solutions. This is often done through a process called  $L_1$ regularization. Geometrically, we can picture it this way: imagine minimizing some function, whose level curves are smooth ovals, but with the constraint that our solution must lie on the $L_1$ "unit ball" (our diamond). The solution will be the first point on the diamond that a shrinking level curve touches. Because the diamond has sharp corners that lie on the axes, it's highly probable that the contact point will be one of these corners. And what is special about a corner like $(0,1)$ ? One of its components is exactly zero! Contrast this with an $L_2$ constraint (a circle). The smooth boundary of the circle means the contact point can be anywhere, and it's highly unlikely to be exactly on an axis.

We can also understand this superpower through the lens of calculus. The derivative of the absolute value function $|x|$ is $-1$ for $x < 0$ and $+1$ for $x > 0$ . But at $x=0$ , the function has a sharp corner, and the derivative is undefined. Instead of a single value for the slope, we have a range of possibilities, from $-1$ to $1$ . This set of possible slopes is called the subdifferential.

When we use an optimization algorithm like the subgradient method to minimize the 1-norm, we have a choice to make whenever a component hits zero. Consider a point where one component is zero, like $(v_1, 0, v_3)$ . The subgradient (the "slope" vector) will be $(1, g_2, -1)$ , where $g_2$ can be any value between -1 and 1.

If we choose a non-zero $g_2$ (say, $g_2=1$ ), the update step will push the second component away from zero.
But if we choose $g_2=0$ , the update step for that component is zero! The component that is zero stays zero.

This ability to "stick" at zero is the core mechanism by which $L_1$ optimization actively seeks out and preserves sparse solutions. It is a beautiful confluence of geometry (the corners of the diamond) and analysis (the nature of the subdifferential at zero). This very property is what enables technologies like compressed sensing, which allows us to reconstruct high-resolution images from remarkably few measurements, and it is a cornerstone of modern data science. It also finds use in many other areas, such as in analyzing linear systems, where the corresponding operator 1-norm (the maximum absolute column sum of a matrix) provides a simple way to check if an iterative process will converge.

So, the humble taxicab distance, born from the simple idea of navigating a grid, turns out to possess a deep and powerful structure. It gives us a different geometry, a different world with sharp corners and broken laws, but one whose unique properties are precisely what we need to solve some of the most challenging problems of the information age.

Applications and Interdisciplinary Connections

You might be tempted to think that our choice of how to measure distance is a settled matter. After all, we live in a world where Pythagoras's theorem is king; the shortest path between two points is a straight line, its length given by the familiar Euclidean norm. We learn it in school, we use it to navigate, and it feels deeply, physically true. But what if we were to walk a different path? What if we decided to measure distance not as the crow flies, but as a taxi drives through the grid of a city? This simple change of perspective, from the Euclidean ( $L_2$ ) norm to the "taxicab" or $L_1$ norm, doesn't just give us a new number for distance. It unlocks a whole new universe of geometry, with profound and often surprising applications across the entire landscape of science.

Our journey begins not with a grand theory, but with a simple, moving particle. Imagine a firefly flitting about in the first quadrant of a garden at night. We can track its straight-line distance from us (at the origin), but we could also ask: how fast is its "taxicab distance," $d_1 = |x| + |y|$ , changing? If the firefly has a velocity $\vec{v}$ at some angle $\theta$ , a little bit of calculus shows that the rate of change of its taxicab distance is not simply its speed $v$ , but rather $v(\cos\theta + \sin\theta)$ . Right away, we see something strange. The rate of change depends on the direction of travel in a way that is utterly foreign to Euclidean distance. Moving diagonally ( $\theta = \pi/4$ ) maximizes this rate, while moving parallel to an axis does not. This is our first clue: the $L_1$ world has preferred directions. It's not the uniform, isotropic space we're used to.

This anisotropy, this preference for the axes, has stunning consequences when we use it to define "territory." In solid-state physics, the Wigner-Seitz cell of an atom in a crystal lattice is the region of space closer to that atom than to any other. It's the atom's personal turf. When "closer" means Euclidean distance, we get beautifully symmetric polygons. But what if the interactions that define proximity in a crystal followed the $L_1$ norm instead? The resulting Wigner-Seitz cell transforms completely. For a centered rectangular lattice, what might have been a simple rectangle or hexagon warps into a new shape, a hexagon whose proportions are dictated by the axis-aligned distances to its neighbors. The fundamental domain of physical space itself is altered by our choice of metric. This isn't just a game; it's a profound demonstration that the mathematical rules we choose can redefine our picture of physical reality.

This idea of a feature space with preferred directions is not an esoteric quirk of physics; it's a central challenge in data science. When we have a dataset where features have vastly different scales—say, one feature is income in dollars and another is years of experience—the Manhattan distance will be utterly dominated by the income feature. The distance calculation becomes almost blind to the other axes. This "axis-aligned sensitivity" means that techniques like clustering can give nonsensical results unless the data is first normalized to give each feature a comparable voice. The $L_1$ norm forces us to think carefully about what our axes mean and whether they are on an equal footing. This sensitivity extends to the design of algorithms themselves. A classic algorithm like the one for finding the closest pair of points must be re-evaluated and its geometric proofs re-derived when we switch from the smooth, round world of $L_2$ to the sharp, diamond-like world of $L_1$ . The same logic even applies in highly specialized domains like materials chemistry, where scientists engineer custom coordinate systems for representing ternary compositions on a triangular diagram. Calculating the distance between two material compositions requires a careful transformation into a 2D plane, and the choice of the $L_1$ norm provides a specific, axis-sensitive way to quantify how "different" two materials are. The lessons learned in one field—like the non-rotationally invariant nature of the $L_1$ norm—can have deep implications in another, such as in the study of stochastic processes, where a random field whose correlations are defined by the $L_1$ norm will be stationary (the statistics don't change as you move around) but not isotropic (the statistics do change as you rotate).

Perhaps the most revolutionary application of the $L_1$ norm, however, comes from embracing its pointy geometry. In machine learning and statistics, we often face a problem: we have far more possible explanations (features) than we have data to support them. Think of trying to predict a disease from thousands of genes, or the stock market from endless indicators. How do we find the few factors that truly matter? This is a search for a sparse solution—a model with most of its coefficients set to zero. This is the principle of Occam's Razor: the simplest explanation is often the best.

And here is where the $L_1$ norm performs its magic. When we try to find a solution to a problem, like fitting a line to data, we often add a penalty term to prevent the model from getting too complex. If we use an $L_2$ penalty (Tikhonov or Ridge regression), we encourage the coefficients to be small, but they rarely become exactly zero. The solution is "dense." But if we use an $L_1$ penalty (LASSO, for Least Absolute Shrinkage and Selection Operator), something remarkable happens. The optimization process is naturally drawn to solutions where many coefficients are exactly zero. Geometrically, the "ball" of constant $L_1$ norm in two dimensions is a diamond. When you try to find where a data-fitting plane first touches this diamond, it will almost always be at one of the sharp corners. And at the corners, one of the coordinates is zero. The $L_1$ norm is a machine for producing simplicity.

This principle is so powerful that it is now being used to automate scientific discovery itself. Imagine you have data from a complex physical system, like a turbulent fluid or a chemical reaction, and you want to discover the underlying partial differential equation (PDE) that governs it. The traditional approach requires human genius and intuition. A modern approach, however, casts the problem as a sparse regression. You create a huge library of all possible mathematical terms that could appear in the equation ( $u$ , $u^2$ , $u_x$ , $u_{xx}$ , etc.) and then use $L_1$ regularization to find the sparsest combination of these terms that fits the data. The algorithm automatically "discovers" that the dynamics are governed by, say, a diffusion term and a simple reaction term, while discarding hundreds of other irrelevant possibilities. This is the scientific method, automated and powered by the humble sum of absolute values.

Finally, the $L_1$ norm can transcend its geometric origins entirely. In systems biology, if we have a vector representing the changes in concentration of various metabolites after a drug is administered, what does its norm mean? The $L_2$ norm, the vector's geometric length, tells us the magnitude of the overall displacement of the cell's metabolic state. But the $L_1$ norm tells us something different and arguably more intuitive: it is the total magnitude of metabolic turnover, the sum of the absolute changes of every individual metabolite. It's not about the net displacement, but the total "effort" or "activity" within the system.

From city blocks to crystal structures, from clustering data to discovering the laws of nature, the $L_1$ norm reveals itself not as a mere mathematical curiosity, but as a fundamental tool with a distinct philosophy. It is the metric of the axes, the engine of sparsity, and the measure of total effort. Its inherent beauty lies in this very diversity—how a single, simple idea, the sum of absolute values, can provide such a rich and powerful lens through which to view, interpret, and shape our world.