Data-Driven Parameterization

SciencePedia

Key Takeaways

Data-driven parameterization uses machine learning to model unresolved subgrid-scale processes, addressing the "closure problem" in complex simulations.
Effective models integrate known physical laws, like conservation principles, directly into the machine learning architecture to ensure physically consistent results.
The Mori-Zwanzig formalism shows that unresolved scales manifest as memory effects and stochastic noise, guiding the design of advanced parameterizations.
This approach is applied across diverse fields, including climate modeling, materials science, drug design, and control theory, to create faster and more accurate models.

Introduction

In scientific modeling, a fundamental tension exists between the completeness of our physical laws and the computational limits of our machines. While equations like the Navier-Stokes equations can describe fluid motion perfectly, simulating every molecule in the ocean or atmosphere is impossible. This gap forces us to simplify, leading to the "closure problem": how do we account for the effects of small-scale processes that our models cannot see? This article explores a powerful solution: data-driven parameterization, a revolutionary approach that synergizes established physical principles with modern machine learning. In the following chapters, we will first dissect the "Principles and Mechanisms," exploring why this technique is necessary and how it can be implemented with physical consistency. Subsequently, the "Applications and Interdisciplinary Connections" chapter will showcase its transformative impact across a vast scientific landscape, from climate prediction to materials design.

Principles and Mechanisms

To build a data-driven parameterization, we must first understand what a parameterization is and why we need it. At its heart, the problem is one of scale. The laws of physics, like the Navier-Stokes equations for fluid motion, are universal. They describe the grand swirl of a hurricane and the fleeting turbulence of a tiny dust devil with equal impartiality. But in a computer model of the Earth's climate, we can't possibly simulate every dust devil, every wisp of a cloud, or every ripple on the ocean. Our computational grid is a coarse net thrown over reality; a grid cell in a climate model might be 50 kilometers on a side. Anything smaller is "subgrid"—unresolved and unseen.

A Parable of the Spinning Wheel

Imagine you are watching an old movie of a horse-drawn carriage. As the carriage picks up speed, the wagon wheels, with their distinct spokes, begin to do something strange. They seem to slow down, stop, and even spin backward. You know the wheel is spinning forward furiously, but your eyes—or rather, the movie camera's limited frame rate—can't keep up. This illusion is called aliasing. The high-frequency rotation of the spokes is being misinterpreted, or "aliased," as a low-frequency motion.

Our climate models face the very same problem. The "camera" is our model's grid, which only takes snapshots of the atmosphere at discrete points in space and time. The fast, small-scale physical processes—like small turbulent eddies or the rapid updrafts inside a thunderstorm—are the furiously spinning spokes. Because we don't resolve them, they don't simply vanish. Instead, their energy and influence alias onto the large-scale motions we can see. A swarm of small, fast eddies might be misinterpreted by the model as a slow, large-scale drift in the entirely wrong direction.

This is the closure problem: the filtered equations that govern the resolved scales contain terms that depend on the unresolved scales. For our model to be accurate, we must find a way to represent the net effect of all this unseen, subgrid activity on the resolved flow. This representation is called a parameterization. It is a "closure" that completes our equations.

Two Paths to Enlightenment: Physics vs. Data

How, then, do we build these parameterizations? Historically, scientists have followed two main philosophies, a split that reflects a broader duality in scientific modeling itself.

The first path is mechanistic or physically-based parameterization. Here, we try to derive a simplified physical theory for the unresolved process. We can't model every single cloud droplet, but perhaps we can write down tractable equations for the bulk properties of the cloud, like the total mass of liquid water and the average number of droplets, and how they evolve due to processes like condensation and collision. This approach has been the bedrock of climate modeling for decades. Its strength is its grounding in first principles, but its weakness is that we must be clever enough to invent a simplified theory that is both accurate and efficient, a task that for some processes, like turbulence, has proven monumentally difficult.

The second path is empirical or data-driven parameterization. The idea here is simple and powerful: if we can't derive the subgrid effect from theory, let's learn it from data. Suppose we run an incredibly expensive, high-resolution simulation that does resolve the tiny eddies. We can use the output of this simulation as our "ground truth". We can then train a machine learning model, such as a neural network, to find the mapping: given the state of the large-scale flow (which our coarse model can see), what is the corresponding net effect from the small-scale flow (which our coarse model cannot)? The neural network becomes a learned surrogate for the complex, unresolved physics.

The Ghosts of Departed Scales

What exactly must this machine learning model learn? Is it just a simple, instantaneous function, like a lookup table that says "if the large-scale temperature is $T$ and the humidity is $q$ , the subgrid heating must be $S$ "? The reality is far more subtle and beautiful.

A profound theoretical framework known as the Mori-Zwanzig formalism gives us a glimpse into the true nature of the problem. It tells us that when we average over or "project out" the fast, unresolved variables of a system, their influence on the slow, resolved variables doesn't just disappear. It is transmuted into three distinct forms:

A Markovian Term: This is the instantaneous part, the component of the subgrid effect that depends only on the current state of the resolved variables. This is the simple relationship a basic feed-forward neural network might learn.
A Memory Term: The subgrid world has a memory. A turbulent eddy that is generated now might take some time to decay and transfer its energy to the large-scale flow. The effect is not instantaneous. The state of the resolved flow now depends on the history of what has happened before. To capture this, a data-driven parameterization might need an architecture that can process sequences and remember past states, like a Recurrent Neural Network (RNN).
A Noise Term: Even if the underlying laws of physics are perfectly deterministic, the unresolved scales are chaotic. From the perspective of the resolved flow, their influence appears as a series of random "kicks". This isn't external noise or measurement error; it's an intrinsic, unavoidable consequence of model reduction. An ideal parameterization might therefore need to be stochastic, providing not just a single best-guess tendency but a distribution of possible tendencies.

So, the task for our data-driven model is not merely to fit a static curve, but to learn a complex, dynamic process that may possess both memory and a stochastic character, embodying the ghosts of the departed scales.

Teaching Physics to Silicon Brains

A neural network trained on data alone is a powerful pattern-matcher, but it is also a blissful idiot when it comes to the fundamental laws of nature. It has no innate concept of the conservation of mass or energy. Left to its own devices, a learned parameterization could introduce spurious sources of heat or create water from nothing, causing a long-term climate simulation to drift into utter nonsense.

This is where the most exciting frontier in scientific machine learning lies: building physics directly into the architecture of our models. This is the idea of hybrid physics-machine learning modeling.

A wonderfully elegant way to enforce a conservation law is to design the neural network's output not as the final quantity of interest, but as an intermediate object from which the final quantity can be derived in a physically consistent way. Consider the conservation of mass in a fluid. The total mass changes only due to fluxes across the boundary. A data-driven correction term, $g_\phi$ , that adds or removes mass arbitrarily within the domain will violate this principle. However, instead of learning the source term $g_\phi$ directly, we can train the network to learn a corrective flux, $\boldsymbol{J}_\phi$ . We then define our source term as the divergence of this flux: $g_\phi = - \nabla \cdot \boldsymbol{J}_\phi$ . Now, the magic happens. A fundamental theorem of calculus, the Divergence Theorem, states that the integral of a divergence over a volume is equal to the flux across its boundary. If we enforce the condition that the corrective flux is zero at the domain's boundaries, the total mass is guaranteed to be conserved, regardless of what the neural network learns for the flux inside.

This same philosophy can be applied to other laws, such as ensuring the model respects the second law of thermodynamics by enforcing detailed balance. By encoding these principles in the structure of the model, we are not just asking the network to learn physics; we are forcing it to obey.

The Art of the Interface

Having designed a clever, physics-informed neural network, how do we "plug it in" to the larger host model? The network's output must conform to the model's internal bookkeeping, which distinguishes between two types of variables.

Diagnostic variables are computed "on the fly" at each time step. For example, a radiation parameterization typically calculates the radiative heating rate based on the instantaneous state of the atmosphere. The heating rate has no memory of its own from one moment to the next. An ML model could learn to emulate this instantaneous calculation.
Prognostic variables have a memory. They are state variables that are integrated forward in time. A turbulence scheme, for instance, might predict the evolution of the turbulent kinetic energy, $\text{TKE}$ . The value of $\text{TKE}$ at the next time step depends on its value at the current time step. A learned parameterization could be embedded within such a scheme, predicting, for example, the production or dissipation terms in the $\text{TKE}$ equation.

Furthermore, the mapping we need to learn might be more complex than a simple vector-to-vector function. Often, we need to learn an operator—a mapping from a function to another function. For example, the radiative heating at a certain altitude depends on the entire vertical profile of temperature and water vapor above and below it. We need a parameterization that takes a whole function (the temperature profile) as input and produces another whole function (the heating rate profile) as output. Architectures like the Deep Operator Network (DeepONet) are explicitly designed for this task. They use a "branch" network to "read" the input function and a "trunk" network to "write" the output function, elegantly capturing these nonlocal interactions.

The Gauntlet of Validation

We have a clever model, informed by physics, that has learned from vast amounts of data. But how do we know it actually works? How do we trust it not to make our climate simulation explode? The answer is a grueling, multi-stage validation process that we can think of as a gauntlet.

The Offline Test: First, we test the learned parameterization in isolation. Using a held-out test dataset (data the model has never seen during training), we check if it can accurately predict the subgrid effects. This is like checking a student's homework. It's a necessary first step, but passing it is no guarantee of success in the real world.
The Partially Coupled Test: Next, we embed the parameterization into a simplified, controlled version of the full model. For example, we might test a new cloud parameterization in a model of a single atmospheric column, where the large-scale winds are prescribed. This is like a quiz or a lab experiment. It allows us to see how the parameterization behaves when it starts to interact with other physical components, and to diagnose instabilities before they get masked by the complexity of the full system.
The Online Test: Finally, the ultimate trial. The parameterization is placed inside the complete, chaotic, fully coupled Earth system model, and we hit "run". We let the simulation evolve for years or decades of model time. Does the simulation remain stable? Does it conserve energy and mass over the long haul? Most importantly, does the model's "climate"—its average state, its variability, its extreme events—look like the real world? This is the final exam.

Many promising data-driven parameterizations that look perfect offline fail spectacularly online. The path from a low offline error to a stable, realistic online climate simulation is a treacherous one. Only a model that can run the entire gauntlet and emerge intact can be considered robust and trustworthy, ready to help us tackle some of the most challenging scientific questions of our time.

Applications and Interdisciplinary Connections

Now that we have explored the principles of data-driven parameterization, let us embark on a journey to see this powerful idea in action. We will find that it is not some esoteric technique confined to a single laboratory, but a universal tool of the modern scientist. Its fingerprints are everywhere: from the quantum dance of electrons in a metal to the grand, swirling storms of our planet; from the design of life-saving medicines to the control of complex, autonomous machines. This journey will reveal a remarkable unity, showing how a single, elegant concept provides a new lens through which to understand and engineer the world around us. It is a way to build bridges between the physics we can describe perfectly on paper and the complex reality we can only observe.

The Known, the Unknown, and the Climate

Perhaps no challenge facing science today is as vast as understanding and predicting Earth's climate. The fundamental laws governing the atmosphere and oceans—the conservation of mass, momentum, and energy—are well-known. We can write them down as beautiful, compact partial differential equations. The trouble is, they are impossibly complex to solve in their entirety. A computer model cannot track every single water droplet in a developing thunderhead or every tiny eddy in an ocean current.

Instead, we must simplify. We lay a coarse grid over the planet and solve for the average properties of the air and water within each large grid cell. But this act of averaging, of filtering, creates a new problem. The small-scale phenomena we've averaged away—the turbulent gusts, the convective plumes, the formation of clouds—still have a huge effect on the large-scale climate. These effects appear as new terms in our averaged equations, the "unresolved tendencies," for which we have no perfect laws. They represent the uncertain, the messy, the "subgrid-scale" physics.

This is where data-driven parameterization makes a dramatic entrance. The guiding philosophy is what we might call a "hybrid model": don't throw away the physics you know for certain! The parts of the equations that represent the fundamental conservation laws, the resolved dynamics we can trust, are kept. They form the rigid, reliable backbone of the model. The uncertain parts—the parameterizations for cloud formation, turbulence, and radiation—become the target for learning. Using data from ultra-high-resolution simulations that can resolve these processes in a small box, or from satellite observations, scientists train machine learning models to map the coarse-grained state of the atmosphere to the correct subgrid-scale tendency.

This is a beautiful marriage of principle and pragmatism. We use our knowledge of physics to structure the problem, and we use data to fill in the details where our knowledge is incomplete. The learned component isn't a wild guess; it is often constrained to obey physical principles, like ensuring that energy and water are conserved over the entire column of the atmosphere.

The sophistication doesn't stop there. Data-driven methods can also be used to accelerate the known parts of the physics. A technique called a Reduced-Order Model (ROM) can take the complex equations for the resolved fluid dynamics and project them onto a small number of dominant patterns, or modes, which are themselves learned from data. A well-constructed ROM, for instance, can be designed to perfectly preserve invariants like kinetic energy, just as the original equations do. So we see a two-pronged attack: Statistical Emulators learn the unknown physics of parameterizations, while Reduced-Order Models provide a fast, physically-consistent approximation for the known dynamics. Both are essential tools for building the next generation of faster and more accurate climate models.

The World of Materials: From Atoms to Architecture

Let's shrink our view from the scale of a planet to the scale of an atom. Here, too, we find that our most fundamental theories become computationally intractable when applied to the real world. According to quantum mechanics, the properties of a material emerge from the collective behavior of its electrons. But solving the Schrödinger equation for the trillions of interacting electrons in even a tiny sliver of metal is simply impossible.

Once again, physicists turn to data-driven parameterization. One of the cornerstones of modern computational physics is Density Functional Theory (DFT), which reformulates the problem in a clever way. All the impossibly complex quantum interactions are bundled into a single term, the "exchange-correlation energy." If we knew the exact formula for this energy, we could calculate the properties of any material perfectly. But we don't. So, what do we do? We generate data. Using extremely expensive, high-accuracy methods like Quantum Monte Carlo on a very simple system (a uniform gas of electrons), we can compute this energy for different densities. This data then becomes the ground truth for parameterizing a simpler, analytical function. A famous example is the PZ81 functional, which is a mathematical formula carefully fitted to this numerical data, while also being forced to obey known theoretical constraints at very high and very low densities. This parameterized function, born from data, is now used in countless simulations to predict the properties of new materials.

Now let's zoom out to the scale we can see and touch—the world of continuum mechanics. Imagine designing a new lightweight composite material for an airplane wing. Its overall strength and stiffness emerge from the intricate arrangement of its microscopic fibers and matrix. To simulate how the wing bends, we could try to model every single fiber, but this would be like trying to model every water molecule in a hurricane. It's computationally hopeless.

The solution is multiscale modeling, powered by data-driven surrogates. Instead of modeling the whole wing at the micro-level, we model just one tiny, "Representative Volume Element" (RVE). We subject this RVE to a variety of stretches and shears in a computer simulation and record the resulting stress. This process generates a dataset of stress-strain pairs. We then train a data-driven model—a surrogate—that learns this relationship. This learned model becomes the effective "constitutive law" for the material, which can then be used in the large-scale simulation of the entire wing.

But, and this is a crucial point, we cannot use just any black-box regression. The laws of physics impose strict rules on any material's constitutive law. For an elastic material, the stiffness tensor $\mathbb{C}_{ijkl}$ that relates stress and strain has beautiful internal symmetries. Furthermore, for the material to be stable, the tensor must be "positive-definite," a property that ensures the material stores energy when deformed. A successful data-driven model must have these physical laws built into its very structure, for example, by using a specific mathematical parameterization that automatically guarantees symmetry and positive-definiteness.

The frontier of this field is even more ambitious. Rather than just fitting a simple function, researchers are now building the constitutive law itself as a neural network. It might seem that this would discard physics entirely, but the opposite is true. By carefully designing the network's architecture and constraining its parameters (for instance, by requiring its weights to be positive), one can construct a model that is guaranteed to obey the Second Law of Thermodynamics, ensuring that it never predicts a material will do something physically impossible, like spontaneously generating energy. This is the ultimate goal: not just models that fit data, but learning machines that think like physicists.

The Chemistry of Life and the Environment

The thread of data-driven parameterization runs just as deeply through chemistry and biology. Consider the challenge of designing a new drug. A drug molecule works by fitting into a specific pocket on a target protein, like a key into a lock. A central tool in this process is "molecular docking," a simulation that predicts the best "pose," or orientation, of the drug in the protein's pocket and estimates the strength of their binding. The heart of this simulation is the "scoring function," which calculates the energy of any given pose.

What is fascinating here is that the field has evolved a whole spectrum of data-driven philosophies for creating these scoring functions:

Force-field-based functions are the most physically rigorous. They use classical physics potentials (like Coulomb's law) with parameters derived from high-level quantum chemistry, independent of any binding data. They are general but often inaccurate because they struggle to capture all the complex effects of water and entropy.
Knowledge-based functions are a marvel of statistical physics. They derive their parameters by analyzing huge databases of known protein structures. By observing how often certain types of atoms are found near each other, they use Boltzmann statistics to infer an effective potential of mean force. They are literally built from the "knowledge" embedded in existing structural data.
Empirical functions take a more direct approach. They start with a simple functional form with several terms (for hydrogen bonds, hydrophobic contact, etc.) and use regression to fit the weights of these terms directly to a training set of experimental binding affinities.
Machine-learning functions are the modern evolution of empirical functions, using powerful algorithms like deep neural networks to learn a complex, non-linear mapping from the structure to the binding affinity. Their power is immense, but their reliability is deeply tied to the quality and diversity of the training data.

This single application provides a beautiful microcosm of the entire field, showcasing the trade-offs between physical rigor, data dependency, and the ability to generalize to new, unseen molecules.

This same logic applies to challenges in environmental science. To predict the fate of a heavy metal pollutant like zinc in the soil, we need to know how it partitions between the water and the soil particles. We can go into the lab and measure this, generating data points that relate the concentration in the water, $C$ , to the amount adsorbed on the soil, $q$ . We can then fit a simple, empirical law, like the Freundlich isotherm $q = K C^n$ , to this data. Once the parameters $K$ and $n$ are learned, this simple model becomes a predictive tool, allowing us to estimate how much pollutant will be immobilized under different conditions, guiding efforts in remediation and environmental protection.

A Glimpse into Certainty and Control

So far, we have seen how data-driven parameterization helps us build faster and more accurate models of complex systems. But in fields like engineering and control theory, we often need more than just a good prediction—we need a guarantee. Can we trust a control system built on a learned model? Can we prove it will be stable?

Even in seemingly simple problems, data-driven thinking provides new insights. When modeling a real fluid for an engineering application, we might have two ways to parameterize an equation of state. One method forces the model to be perfectly accurate at the fluid's critical point, while another uses data from a wide range of temperatures to get a better overall fit, even if it's slightly off at the critical point. The data-driven approach often reveals that sacrificing perfection at a single point to achieve better global performance yields a more robust and useful model in practice. This is a profound lesson about the nature of modeling itself.

The truly futuristic application, however, lies in using data-driven methods not just to model a system, but to prove things about it. Imagine we have an unknown dynamical system—perhaps a robot or a chemical process—and we've collected data on how its state evolves over time. We can first use this data to learn a polynomial model of its dynamics. Then comes the brilliant part: we can formulate a search for a proof of stability as an optimization problem. In control theory, a Lyapunov function is a mathematical object whose existence guarantees a system is stable. Using a powerful technique called Sum-of-Squares (SOS) programming, we can search for the parameters of a polynomial that satisfies the conditions of a Lyapunov function for our data-driven model.

Think about what this means. We are using data-driven optimization not just to parameterize the system's behavior, but to parameterize a certificate of its stability. This creates a powerful, formal bridge between machine learning and control theory, between empirical observation and rigorous proof. It shows that the journey that began with fitting simple curves can lead us to the very frontier of building machines we can provably trust.

From the heart of the atom to the edge of the climate, from designing drugs to verifying robots, the principle remains the same. Data-driven parameterization is not the act of discarding physics for black boxes. It is the art and science of a new, powerful synergy—using data as a lantern to illuminate the parts of our theories that are dark, incomplete, or too complex to navigate on our own, allowing us to build models that are at once grounded in physical law and sharpened by empirical truth.