Symbolic Regression

SciencePedia

Key Takeaways

Symbolic regression automates the discovery of both the structure and parameters of an equation from data, going beyond traditional curve-fitting.
Its core mechanism, often genetic programming, evolves equations by optimizing a fitness function that balances accuracy with simplicity to avoid overfitting (Occam's Razor).
Modern approaches use reinforcement learning to "write" equations or Bayesian methods that inherently penalize complexity to find the most probable model.
Symbolic regression is a powerful hypothesis generator, but its outputs must be vetted by human experts to ensure physical plausibility and distinguish true laws from data artifacts.

Introduction

For centuries, the discovery of the fundamental mathematical laws that govern our universe has been the hallmark of human genius. From Kepler's planetary motions to Maxwell's electromagnetism, these elegant equations represent the pinnacle of scientific understanding. But what if we could build a tool to assist in this process, an AI that could sift through mountains of data and propose the underlying formula? This is the ambitious goal of symbolic regression, a method that seeks not just to fit data to a known equation, but to discover the equation itself. It addresses the fundamental limitation of standard data analysis, where the form of the model must be assumed beforehand, by creating a framework where the mathematical structure itself is a variable to be found.

This article will guide you through the exciting world of automated scientific discovery. First, in "Principles and Mechanisms," we will look under the hood to understand how symbolic regression works, exploring the evolutionary ideas of genetic programming, the probabilistic rigor of Bayesian methods, and the intelligent agents of reinforcement learning. Then, in "Applications and Interdisciplinary Connections," we will see these principles in action, witnessing how algorithms can rediscover the laws of celestial mechanics, uncover fundamental conservation laws, and decode the complex dance of chemical reactions, all while highlighting the indispensable partnership between the machine and the scientist.

Principles and Mechanisms

So, we have this marvelous idea of teaching a computer to be the next Newton or Maxwell—to look at a pile of data and pluck from it the elegant, simple law that governs the whole affair. But how, precisely, does one go about this? It’s one thing to recognize a beautiful equation when you see one, but it’s quite another to build a machine that can create it from scratch. This is where we get our hands dirty and look under the hood at the principles and mechanisms of symbolic regression.

The Quest for a Formula: Beyond Mere Curve-Fitting

Let's begin with a simple scenario. Imagine you're a materials scientist studying how a gas, say nitrogen, dissolves in a new kind of polymer. You’ve run experiments, varied the pressure $P$ and the temperature $T$ , and dutifully recorded the resulting solubility $S$ . You have a mountain of data points.

Now, a standard approach, what we might call "curve-fitting," would be to assume a fixed form for the equation. For example, you might guess that the relationship is a simple plane, $S = aP + bT + c$ , and your job is merely to find the best values for the coefficients $a$ , $b$ , and $c$ . This is useful, but it's not discovery. You've already hard-coded the structure of the law.

Symbolic regression is far more ambitious. It doesn’t just want to find the coefficients; it wants to find the very symbols in the equation. Suppose your physicist's intuition tells you that solubility should be proportional to pressure, but the temperature dependence is a mystery. Your model is $S(P, T) = C \cdot P \cdot g(T)$ . The challenge now is to discover the function $g(T)$ . Is it $g(T) = 1/T$ ? Maybe $g(T) = \exp(-B/T)$ ? Or perhaps something else entirely?

To make progress, you could try a few plausible candidates for $g(T)$ , and for each one, find the best-fitting constant $C$ that minimizes the overall error—say, the sum of the squared differences between your model's predictions and your experimental data. By comparing the minimum error for each candidate function, you can declare a winner. This is precisely the kind of manual model selection exercise that forms the conceptual core of symbolic regression: systematically testing different mathematical structures to find the one that best explains the evidence.

The trouble, of course, is that the space of possible functions is not just four candidates, but a sprawling, infinite jungle of mathematical expressions. We can’t test them one by one. We need a clever guide, an automatic explorer to navigate this wilderness for us.

The Art of Exploration: Evolving an Equation

One of the most intuitive and powerful ways to search for equations is to borrow a brilliant idea from nature: evolution. This is the heart of a technique called Genetic Programming (GP). Instead of evolving creatures, we evolve a population of mathematical expressions.

It works something like this: we start with a population of completely random, often nonsensical equations. Then, we let them "compete." The "fittest" equations get to survive and "reproduce," creating the next generation of equations, which are hopefully a little bit better. This cycle repeats, and over many generations, complex and accurate equations can emerge from primitive chaos.

But this immediately raises the crucial question: what does it mean for an equation to be "fit"? This is defined by a fitness function, a scoring rule that guides the entire evolutionary process. Designing a good fitness function is an art, and it typically involves balancing a delicate trilemma.

First, the equation must be accurate. It has to actually match the data. We measure this with an error metric. If we assume the noise in our measurements is random and bell-shaped (a Gaussian distribution), the best metric to minimize is the Mean Squared Error (MSE). But if we suspect our data might have some wild outliers, a better choice is the Mean Absolute Error (MAE), which is less sensitive to them. This choice is not arbitrary; it's rooted in statistics. For instance, assuming noise follows a Laplace distribution directly leads you to prefer MAE.

Second, the equation must be simple. This is a quantitative embodiment of Occam's Razor: among competing hypotheses, the one with the fewest assumptions should be selected. A monstrously complex equation that fits the data perfectly is often a sign of overfitting. It has learned the noise and quirks of your specific dataset, but it won't generalize to new data. It's not a law of nature; it's a tailor-made suit that only fits one person. So, we penalize complexity. We might measure this by simply counting the number of nodes or operations in the expression's tree structure. The fitness score is thus a trade-off: a combination of the error term and a complexity penalty.

Finally, the equation must be robust. In the wild jungle of mathematical expressions, it's easy to create things that "break"—like dividing by zero, or taking the logarithm of a negative number. A program that produces non-finite outputs for our inputs is useless. So, our fitness function must also heavily penalize such ill-behaved expressions, ensuring that the survivors are not only accurate and simple but also mathematically sound.

Once we have our fitness function, the "reproduction" happens through operations like crossover (where two parent equations swap sub-expressions to create children) and mutation (where a small part of an equation is randomly changed). It's a beautiful, chaotic, and often surprisingly effective process for discovering hidden mathematical gems.

Principled Paths to Discovery

While GP is a powerful workhorse, it’s not the only path up the mountain. More modern approaches frame the search in different, equally compelling ways.

One path uses the machinery of Reinforcement Learning (RL). Imagine an "agent"—in this case, a sophisticated program like a Recurrent Neural Network (RNN)—that learns to "write" an equation, one token at a time. It might first select a variable $x$ , then an operator +, then a constant 2, and so on. After it completes an expression, it receives a reward, which is calculated much like the fitness in GP: a high reward for an accurate and simple formula, a low reward for a poor one. The agent's goal is to learn a policy—a strategy for choosing tokens—that maximizes its expected future rewards. Through trial and error, guided by an algorithm like REINFORCE, the agent refines its policy, effectively learning the "art" of crafting good equations.

Another, more stately, path is the Bayesian Approach. Instead of a frantic, heuristic search, this method is about a careful, probabilistic weighing of evidence. Here, we restrict our search to a predefined "dictionary" of possible building blocks (e.g., $x$ , $y$ , $x^2$ , $\sin(y)$ , etc.). We then consider every possible model that can be built from subsets of this dictionary. For each model, we ask: "What is the probability of observing our data, given this particular model?" This quantity, the marginal likelihood or model evidence, is the heart of the Bayesian method.

What's magical about the marginal likelihood is that it automatically enacts a form of Bayesian Occam's Razor. A simple model that fits the data poorly will have low evidence. A very complex model can fit the data well, but it could also have generated many other possible datasets. The fact that it generated our specific dataset isn't as impressive; its predictive power is diluted, and its evidence is consequently penalized. The best model is often one that is "just right"—complex enough to capture the pattern, but simple enough to make the observed data a strong and specific consequence of its structure. The task then becomes a systematic calculation: compute the posterior probability for every model in our space and pick the one with the highest value (the maximum a posteriori, or MAP, model). This provides a rigorous, first-principles way to perform model discovery.

Taming the Infinite: A Pragmatic Strategy

The Bayesian method is elegant, but it requires us to enumerate every possible model, which is only feasible for a small dictionary of features. What if the true law involves a complex combination of dozens of primary physical variables (like atomic number, electronegativity, covalent radius, etc.)? Recursively applying operators like +, *, exp, and sqrt can cause the number of candidate features to explode into the billions or trillions. This is a common challenge in fields like materials science.

To tackle this, a powerful and practical framework called Sure Independence Screening and Sparsifying Operator (SISSO) uses a clever two-step strategy.

First, Generate and Screen. Let the machine go wild and generate a colossal feature space. Don't worry about quality yet; just create a huge library of possibilities. Then, apply a fast, cheap filter to this library. This "screening" step typically involves checking the correlation of each candidate feature with the target property. It’s like a casting call: any feature that has at least some relevance gets to audition for the main role. This drastically cuts down the number of candidates from billions to perhaps a few thousand.

Second, Sparsify. Now, with this much more manageable set of promising features, we can bring in the heavy machinery. We use a method designed to find a sparse solution—that is, a linear model that uses the fewest possible features to explain the data. This is often done by solving a regression problem with an $\lVert w \rVert_0$ constraint, which explicitly seeks a model with a specific number of non-zero coefficients. This final step is like the director making the final casting decision, selecting a small ensemble of actors who work together perfectly to tell the story. This generate-then-sift approach is a remarkably effective way to find a needle of truth in a haystack of complexity.

A Word of Caution: On Data and Reality

With these powerful tools at our disposal, it can be tempting to see symbolic regression as an infallible oracle of truth. You feed it data, and it gives you a Law of Nature. But we must end with a crucial word of caution, a lesson about the gap between data and reality.

Imagine a physicist simulating a simple wave that moves at a constant speed, governed by the advection equation $u_t + c u_x = 0$ . To keep the simulation from blowing up, they add a tiny bit of blurring, or smoothing, at each time step. Now, a student, unaware of this numerical trick, takes the simulation data and feeds it into a symbolic regression tool. The tool dutifully analyzes the data and proudly announces its discovery: the data is perfectly described by an advection-diffusion equation, $u_t + c u_x = D u_{xx}$ .

The tool is not wrong. The diffusion term $D u_{xx}$ is the mathematical signature of the smoothing filter the physicist used. The algorithm has successfully discovered a law, but it's a law governing the simulation data, not the underlying physical reality. It has found an artifact of the data generation process.

This is perhaps the most important principle of all. Symbolic regression finds patterns in data. It is a mirror that reflects the information we provide it, including any biases, artifacts, or hidden processes embedded within that information. It does not possess a physicist's intuition or a chemist's judgment. The beautiful equations it uncovers are hypotheses, not gospels. The role of the scientist is more critical than ever: to design clean experiments, to understand the provenance of their data, and to interrogate the discovered laws, testing whether they reflect a deep truth about the world or simply a quirk of our measurements. The quest for the formula is, and will always be, a partnership between the tireless exploration of the machine and the indispensable wisdom of the human mind.

Applications and Interdisciplinary Connections

Now that we have some idea of the gears and levers inside a symbolic regression machine, the most exciting question arises: What can we do with it? If the previous chapter was about the engine, this one is about the journey. We find that this tool is not merely a specialized gadget for one corner of science. Instead, it offers a new perspective on the very act of discovery, a common thread weaving through seemingly disparate fields. It acts as a kind of universal translator, taking the raw, messy language of experimental data and finding its underlying mathematical poetry.

Let us see how this plays out, from the majestic dance of planets in the cosmos to the frantic microscopic ballet of molecules in a flask.

Rediscovering the Harmony of the Spheres

Imagine Johannes Kepler in the early 17th century. He sits surrounded by mountains of astronomical tables, the life's work of Tycho Brahe. For years, he toils, driven by a deep conviction that a simple, elegant mathematical harmony must govern the motion of the planets. He tries circles, ovals, and all manner of relationships, scribbling down calculations, failing, and trying again. Finally, after immense labor, he uncovers his three laws of planetary motion. His third law, in particular, is a gem: the square of a planet's orbital period, $P$ , is proportional to the cube of its semi-major axis, $a$ . Or, as we would write it, $P^2 \propto a^3$ .

What if we could give Brahe's data to a modern-day Kepler—a symbolic regression algorithm? We would feed it the paired lists of periods and axes, and ask, "What is the law connecting these numbers?" The machine, knowing nothing of gravity or celestial mechanics, would begin its search. It might try adding them, dividing them, taking logarithms, sines, and cosines. In a simplified case, it might notice that if you plot the logarithm of the period against the logarithm of the semi-major axis, the points fall on a near-perfect straight line. And the slope of that line? It would be almost exactly $1.5$ . A simple translation back from the logarithmic world reveals the law: $\ln(P) = 1.5 \ln(a) + C$ , which is another way of writing $P \propto a^{1.5}$ , or $P^2 \propto a^3$ .

The true marvel here is not just that the machine can fit a curve to the data. Standard regression can do that. The marvel is that it can discover the form of the equation itself. It tells us that the relationship is a power law, and it finds the exponent. It does in minutes what took a human genius years of painstaking effort. It rediscovers a piece of the universe's source code.

This ability to find hidden relationships extends to even more fundamental concepts. Consider one of the cornerstones of physics: conservation laws. These laws tell us what stays the same even when everything else is changing. In a chaotic collision of billiard balls, the positions and velocities are a mess, but the total momentum—the sum of each ball's mass times its velocity—remains stubbornly constant before and after the collision.

Could an algorithm discover this? Let's try. We can set up an experiment, perhaps a computer simulation, of particles crashing into each other. We don't tell the machine about momentum or energy. We just give it the "before" and "after" velocities and masses for a series of different collisions. We then build a list of candidate quantities for each particle: its momentum $m_i v_i$ , its kinetic energy $\frac{1}{2}m_i v_i^2$ , and perhaps other, more complex expressions.

The task for our symbolic regression algorithm is now slightly different. It's not looking for an equation that predicts one variable from another. It's looking for a combination of quantities that does not change from "before" to "after". It searches for a weighted sum of our candidate quantities, $\sum c_i \phi_i$ , whose total value is conserved. And what does it find? Given data from collisions where momentum is conserved but kinetic energy is not (like "inelastic" collisions where things stick together), the algorithm will unerringly return a simple rule: the quantity you get by adding up $1 \times (m_1 v_1) + 1 \times (m_2 v_2) + \dots$ is the same before and after. It has discovered, from scratch, the law of conservation of linear momentum. It has found a deep symmetry of nature, not by deriving it from abstract principles, but by observing its consequences in raw data.

Decoding the Dance of Molecules

Let's now shrink our perspective, from planets and particles down to the world of chemistry. Inside a beaker, a complex network of reactions is taking place. The concentrations of different chemical species, let's call them X, Y, and Z, might oscillate in time, sometimes in a predictable, periodic way, and sometimes in a wild, chaotic fashion. A chemist wants to understand this system, to write down the set of differential equations—the "rate laws"—that govern how fast each species is created or destroyed. This is a formidable task.

Here again, symbolic regression can act as a powerful assistant. We can measure the concentrations $[\text{X}]$ , $[\text{Y}]$ , and $[\text{Z}]$ over time and then ask the algorithm: "Find an equation for the rate of change of Y, $\frac{d[\text{Y}]}{dt}$ , as a function of the concentrations $[\text{X}]$ , $[\text{Y}]$ , and $[\text{Z}]$ ."

The machine whirs away and, after analyzing the data, might propose several candidate models that all seem to fit the observations quite well. This is where the story gets interesting, as it reveals the crucial partnership between the algorithm and the scientist. The machine, guided only by mathematics, might present an equation like: $\frac{d[\text{Y}]}{dt} = k_1 [\text{X}] - k_2 \frac{[\text{Y}]}{[\text{Z}]}$ This equation might fit the data beautifully. But a chemist would raise an eyebrow. Elementary chemical reactions happen when molecules collide. The rate of a reaction, according to the principle of mass-action kinetics, should be proportional to the probability of the necessary reactants meeting. This means the terms in our rate laws should look like products of concentrations, like $k[\text{X}][\text{Y}]$ , corresponding to a collision of an X and a Y molecule. A term like $[\text{Y}]/[\text{Z}]$ doesn't correspond to a simple physical collision. It's chemically implausible.

The algorithm might also propose another model: $\frac{d[\text{Y}]}{dt} = k_1 [\text{X}]^2 - k_2 [\text{X}][\text{Y}]$ Now the chemist smiles. This is a language she understands. The term $+k_1[\text{X}]^2$ could represent a reaction where two molecules of X collide to produce a molecule of Y ( $2\text{X} \to \text{Y} + \dots$ ). The term $-k_2[\text{X}][\text{Y}]$ could mean that a molecule of X and a molecule of Y collide and consume Y ( $\text{X} + \text{Y} \to \dots$ ). Both terms represent physically plausible bimolecular reactions.

This example teaches us a profound lesson. Symbolic regression is not an oracle that simply hands us truths. It is a hypothesis generator of extraordinary power. It can explore a vast space of mathematical possibilities, many of which a human would never have the time or imagination to investigate. But its output must be filtered through the lens of human knowledge and physical intuition. The algorithm sees patterns; the scientist sees meaning. The process is a dialogue, a synergy between the unbiased brute-force search of the machine and the deep, principled understanding of the expert.

From the cosmos to the test tube, we see a unifying theme. The universe is governed by mathematical laws, some simple, some complex. For centuries, uncovering these laws has been the exclusive domain of the human mind. Now, we have a new kind of partner in this grand quest. By giving machines the ability to speak the language of mathematics and search for its grammar in the book of nature, we haven't replaced the scientist—we have empowered them. We have built a tool that can help us find the hidden harmonies in the data, leaving us to do what we do best: wonder what they mean.