Compositional Reasoning

SciencePedia

Key Takeaways

The constant-sum constraint inherent in compositional data (parts of a whole) creates spurious correlations and misrepresents changes in individual components.
Pioneered by John Aitchison, log-ratio analysis provides the correct mathematical framework by focusing on the ratios between components, which are invariant to scale.
The Centered Log-Ratio (CLR) transform is a powerful technique that standardizes data by comparing each component to the geometric mean of the entire composition.
Compositional reasoning is a fundamental principle for both analysis and design, with critical applications in diverse fields from microbiome analysis to synthetic biology.

Introduction

Data representing parts of a whole—from the proportion of genes expressed in a cell to the allocation of a family's budget—are ubiquitous in science and everyday life. However, analyzing this "compositional data" presents a profound challenge. Because the parts must sum to a fixed total, a change in one component forces an artificial change in others, leading to spurious correlations and fundamentally flawed conclusions. This article confronts this statistical pitfall head-on by introducing the principles of compositional reasoning. In the first section, "Principles and Mechanisms," we will delve into the mathematical illusions created by the constant-sum constraint and explore the elegant log-ratio framework developed to overcome them. Subsequently, "Applications and Interdisciplinary Connections" will demonstrate how this powerful mode of thinking is not just a statistical correction but a foundational concept that provides clarity and enables progress across diverse fields, from microbiology and materials science to the engineering of complex biological and cyber-physical systems.

Principles and Mechanisms

A Tale of Two Budgets: The Tyranny of the Whole

Imagine you are an economist studying a family's spending habits. You don't have their bank statements, so you can't see the absolute dollar amounts. All you have is a pie chart of their monthly expenditures: say, $30\%$ on housing, $20\%$ on food, $10\%$ on entertainment, and so on. Now, suppose the next year you get a new pie chart. Housing is now $25\%$ , food is $15\%$ , but entertainment has ballooned to $20\%$ . What would you conclude? A naive interpretation would be that the family has started to value food and housing less, and entertainment more. You might even say their interest in food has "decreased."

But what if I told you that during that year, the family's total income doubled? They might have kept their housing and food spending exactly the same, or even increased it slightly. But with all the new disposable income, they decided to take lavish vacations, causing their entertainment spending to increase tenfold. In absolute terms, their spending on everything we care about either stayed the same or went up. But because the whole budget grew, and one part grew disproportionately, the proportions, or relative shares, of the other parts inevitably shrank. Your conclusion that they lost interest in food was not just wrong; it was the opposite of the truth.

This is the central paradox of compositional data. These are data that represent parts of a whole, where the only information we have is the relative proportions. We find them everywhere in science. In microbiology, sequencing the DNA from a gut sample doesn't tell us the total number of bacteria, only the relative abundance of each species. In genomics, standard methods for measuring gene activity give us a vector of Transcripts Per Million (TPM), which, by definition, sums to a constant, fixed total for every cell or tissue sample we measure.

In all these cases, we are stuck with the pie chart, not the bank statement. The data are bound by a closure constraint or constant-sum constraint: all the parts must add up to $1$ (or $100\%$ , or $10^6$ TPM). This simple constraint is a tyrant. It forces the parts into a mathematical relationship that can be profoundly misleading, creating illusions that look like real science.

The Spurious Correlation Machine

The tyranny of the constant sum does more than just mislead our interpretation of "up" or "down." It actively manufactures false relationships out of thin air. It is a powerful machine for generating spurious correlations.

Let's go back to our gene expression data. Imagine we are studying a group of cells. In these cells, the absolute expression of Gene A and Gene B is completely independent and constant. They have nothing to do with each other. However, there is a third gene, Gene C, which is highly active and its expression level varies wildly from cell to cell. When we process our data, we normalize it to TPM, forcing the total expression of all genes in each cell to sum to one million.

What happens? In a cell where Gene C is extremely active, it takes up a huge slice of the pie. To make the total sum to one million, the slices for Gene A and Gene B must become smaller. In a cell where Gene C is less active, the slices for A and B have more room and become larger. If we now plot the expression of Gene A against Gene B across all our cells, what will we see? We'll see a beautiful, positive correlation! When A is high, B is high; when A is low, B is low. We might be tempted to publish a paper on the exciting co-regulation of A and B. It would be a complete fiction, an artifact of Gene C's variability and the constant-sum constraint. At the same time, we would find a strong negative correlation between Gene A and Gene C, even if none exists biologically.

This isn't just a hand-wavy argument; it's a mathematical certainty. For data that arise from a sampling process like sequencing, the covariance between the counts of any two distinct components, $x_i$ and $x_j$ , is intrinsically negative: $\mathrm{Cov}(x_i, x_j) = -L p_i p_j$ , where $L$ is the total number of counts and $p_i$ and $p_j$ are the true proportions. The constraint of the whole forces every part to be, in a sense, in competition with every other part.

The Freedom of Ratios: A New Geometry

If absolute values are lost to us and relative abundances are treacherous, are we doomed? For a long time, it seemed so. Scientists would try to get around the problem with ad-hoc corrections, but the core issue remained. The breakthrough came from an unlikely field: geology. In the 1980s, a Scottish mathematical geologist named John Aitchison realized that we were asking the wrong questions and, in fact, using the wrong algebra.

Aitchison's profound insight was this: in a composition, the fundamental, trustworthy information is not in the values of the components themselves, but in the ratios between them.

Let's return to the family budget. Instead of the percentage spent on food, what if we looked at the ratio of "dollars spent on housing" to "dollars spent on food"? This quantity tells us something about the family's priorities. If their income doubles and they spend twice as much on both housing and food, this ratio remains unchanged. It is invariant to a change in the overall scale, which is exactly the property we need.

The natural language for dealing with ratios is the logarithm, because it turns multiplication and division into addition and subtraction. The log of a ratio, $\ln(A/B)$ , is simply $\ln(A) - \ln(B)$ . This simple mathematical trick is the key to unlocking the geometry of compositions. By taking log-ratios, we can transport the data from the strange, constrained world of the simplex (the geometric space where proportions live) into the familiar, unconstrained Euclidean space of real numbers, where we can use all the powerful tools of standard statistics like correlation, regression, and PCA without fear.

This is not just a convenient trick; it is the only way to operate that is consistent with the nature of compositional data. The core axioms of a sensible analysis—that our conclusions shouldn't change if we measure in "reads per million" versus "reads per billion" (scale invariance) or if we decide to ignore one of the components (subcompositional coherence)—uniquely force us into a world of log-ratios.

Finding the Center: The Centered Log-Ratio

So, we must analyze ratios. But which ratios? For $D$ genes, there are $D(D-1)/2$ possible pairwise ratios, a bewildering number. A more elegant solution is to compare each component to a common reference. A natural choice for this reference is the "center" of the composition.

But what is the center? For familiar, real-numbered data, we would use the arithmetic mean (add everything up and divide by the count). But for compositional data, where the essential operations are multiplicative, the natural center is the geometric mean. The geometric mean, $g(\mathbf{x}) = (\prod_{i=1}^D x_i)^{1/D}$ , is the quintessential multiplicative average.

This leads us to a beautiful and powerful transformation: the centered log-ratio (CLR). The CLR value for a component $x_i$ is simply the logarithm of its ratio to the geometric mean of the entire composition:

\mathrm{clr}(x_i) = \ln\left(\frac{x_i}{g(\mathbf{x})}\right) = \ln(x_i) - \frac{1}{D}\sum_{j=1}^D \ln(x_j)

What does this value represent? It tells us whether a component's abundance is high or low relative to the typical abundance of all components in that specific sample. It is an internally standardized measure. The most crucial property of the CLR transform is that it is scale-invariant. If you take a sample and multiply all the raw counts by some factor $c$ (say, by sequencing twice as deep), the geometric mean also gets multiplied by $c$ , and this factor cancels out perfectly in the ratio. The CLR values remain unchanged. This gives us a stable basis for comparing across samples that may have been measured with different efficiencies.

The spurious correlations we worried about are now properly handled. The CLR transform does induce its own correlation structure—the CLR values for any given sample always sum to zero—but this structure is simple and well-behaved. For a composition made of many independent underlying parts, the correlation between any two CLR-transformed components is simply $-1/(D-1)$ , where $D$ is the number of parts. As we measure more and more parts, this weak negative correlation melts away towards zero.

Navigating the Void: The Problem of Zeros

This elegant theory runs into a very practical and sharp-edged problem: in the real world, our data contains zeros. What is $\ln(0)$ ? It is undefined. A single zero in our composition makes the geometric mean zero, and the entire CLR transformation breaks down.

High-throughput biological data, in particular, is riddled with zeros. But not all zeros are the same. In a microbiome sample, a zero for a particular bacterial species might just mean it was too rare to be caught in our finite sequencing net; it's a "sampling zero." In single-cell RNA sequencing, a gene that is actively producing mRNA might still yield a zero count because of technical failures in capturing and amplifying that specific molecule; this is a "dropout" zero.

The most common approach to this problem is to add a tiny, non-zero number, often called a pseudocount, to all the values before taking logarithms. This feels a bit like cheating, and we must be extremely careful. The principle of scale invariance is our guiding light. If we add a fixed pseudocount (say, $1$ ) to raw count data, we violate this principle. A pseudocount of $1$ is huge for a sample with only $100$ total reads, but negligible for a sample with $10$ million reads. Such a procedure reintroduces the very kind of sample-specific bias we sought to eliminate. A principled approach requires either applying the pseudocount after normalizing to relative abundances, or using a more sophisticated scheme where the pseudocount itself is chosen based on the statistical properties of the data-generating process.

Beyond Composition: When Ratios Aren't Enough

Is this log-ratio framework, then, the final word? Not quite. The magic of log-ratios works because it perfectly cancels out biases that are multiplicative—that is, biases that affect all genes by a common scaling factor, like sequencing depth.

But what if a bias is more sinister? Imagine a technical artifact in our sequencing process that is sensitive to the GC-content (the proportion of G and C nucleotides) of a gene. Perhaps genes with very high or very low GC content are captured less efficiently. Worse, imagine this effect is non-linear: the efficiency loss is more severe for genes that are highly abundant to begin with. This is a non-multiplicative distortion.

In this case, the log-ratio trick is no longer sufficient. The bias term will not be a simple constant that cancels out. It will be a complex function of the gene's own properties (its GC content) and its true abundance. Applying CLR directly to this data will fail to correct the distortion.

The lesson here is not that compositional analysis is wrong, but that it is one tool—albeit a very powerful one—in a larger hierarchy of reasoning. The solution in this case is to first build a specific model to correct the non-linear GC bias, estimating the "true" underlying counts from the distorted observed counts. Then, on this corrected, non-multiplicative-bias-free data, we can and should apply the principles of compositional analysis to handle the remaining multiplicative scaling issues like sequencing depth.

Compositional Reasoning in Action: A Clinical Puzzle

Let's conclude by seeing how these principles come together to solve a complex problem with life-or-death stakes. A cancer patient receives a new immunotherapy treatment. To see if it's working, we take a biopsy of the tumor before and after treatment and perform single-cell RNA sequencing.

The initial analysis is alarming: a set of genes related to the cell cycle appears to be strongly upregulated after treatment. The most direct interpretation is that the treatment is making the tumor cells divide faster—a catastrophic failure.

But a compositional thinker pauses. The tumor is not a uniform bag of cancer cells; it's a complex ecosystem of cancer cells, immune cells, blood vessel cells, and more. What if the treatment didn't change the cancer cells at all, but instead caused a massive influx of rapidly-dividing immune cells to attack the tumor? This would be a resounding success! An aggregate analysis, which averages gene expression across all cells in the biopsy, cannot distinguish between these two dramatically different scenarios. Both a change in the behavior of the components and a change in the composition of the components can lead to the same aggregate signal.

The solution is to deconstruct the problem using compositional reasoning.

Analyze the Composition of Cell Types: First, we treat the cell types themselves as a composition. We ask: did the proportion of immune cells relative to cancer cells change after treatment? This is a compositional analysis problem that can be tackled with log-ratio methods.
Analyze Expression Within Each Component: Next, we look within each cell type separately. Inside the population of cancer cells, did the expression of cell cycle genes change after treatment (while controlling for confounding factors)? And inside the T-cells? And the B-cells? This is a series of standard differential expression analyses, now unconfounded by changes in cell type proportions.

By applying this layered reasoning, we can distinguish a change in the tissue's cellular makeup from a change in the intrinsic behavior of its cellular constituents. We can tell the difference between the treatment failing and the treatment succeeding. This is the power of compositional reasoning: it gives us the clarity to dissect complexity, to see through illusion, and to arrive at a truer understanding of the systems we study.

Applications and Interdisciplinary Connections

Having journeyed through the abstract principles of compositional reasoning, we might feel like we've been climbing a mountain of pure mathematics. But now, as we reach the summit and look out, we see that this is no barren peak. Below us lies a vast and fertile landscape of science and engineering, and the principles we've just learned are the rivers that give it life. Compositional reasoning is not an isolated academic exercise; it is a lens, a tool, a way of thinking that unlocks profound insights across an astonishing range of disciplines. It is the common thread that connects the analysis of our inner microbial worlds to the design of spacecraft and the very creation of synthetic life. In this chapter, we will tour that landscape, seeing how the same fundamental ideas manifest in biology, chemistry, materials science, and engineering.

The Composition of Life: From Gut Microbes to Gene Editors

Perhaps nowhere has the compositional revolution been more impactful than in modern biology. For decades, biologists have been grappling with a deluge of data from high-throughput sequencing, which allows us to read the genetic "parts list" of complex biological systems. The challenge, however, has been to interpret this list.

Consider the bustling ecosystem of microbes in the human gut. We can sequence their DNA and get a table of relative abundances: 30% Bacteroides, 20% Prevotella, and so on. A naive approach would be to treat these percentages as simple measurements. But as we now know, this is a trap. An increase in Bacteroides must be accompanied by a decrease in the percentages of other microbes, even if their absolute numbers haven't changed. This is the tyranny of the constant-sum constraint. Comparing two microbial communities using standard tools like Euclidean distance is like comparing two pies by the raw size of their slices without considering the size of the pies themselves; it's fundamentally misleading.

Compositional data analysis provides the correct geometric "spectacles" to see this world clearly. By using log-ratio transformations, we move from the constrained world of the simplex to the familiar, unconstrained Euclidean space where distances and changes have real meaning. The Aitchison distance, for instance, becomes the true measure of difference between two microbial ecosystems.

This is not just a statistical nicety; it has profound medical implications. When searching for microbial signatures of disease, we are rarely interested in the absolute change of a single taxon. More often, disease is a story of imbalance. In a study of the gut-brain axis, for instance, we might be interested in the balance between beneficial, anti-inflammatory bacteria (like SCFA producers) and detrimental, pro-inflammatory ones. Compositional reasoning gives us the perfect tool for this: the log-ratio of the geometric means of the two groups. This single number captures the push-and-pull between entire functional guilds. Tracking how this "balance" shifts after an intervention, like a new diet, provides a direct, meaningful measure of its effect on the ecosystem's function.

This principled approach forms the foundation of a whole suite of modern bioinformatics tools—like ANCOM and ALDEx2—that are designed to robustly identify which microbes are truly changing between healthy and diseased states. A complete analytical pipeline, from raw counts to a list of statistically significant microbial shifts, is a direct implementation of these ideas: regularize the data (to handle zeros), apply a log-ratio transform (like the centered log-ratio, or CLR), and then use standard statistical tests in this valid new space. Furthermore, this is not limited to explanation; it is crucial for prediction. A machine learning model trained to diagnose a disease from a microbiome sample will only be reliable if it is built upon a compositionally sound foundation. Classifiers trained on raw proportions are building on sand; those trained on log-ratio transformed data are built on the solid rock of Aitchison geometry.

The power of this thinking extends beyond ecology. Consider a cutting-edge genome editing experiment using CRISPR base editors. After the experiment, sequencing reveals a mixture of outcomes: the desired edit, an undesired edit, an indel, or no change at all. These outcomes form a composition—their proportions must sum to 100%. To compare the efficiency of the editing process under two different conditions (e.g., with and without a key supplement), we cannot simply compare the percentage of desired edits. A change in one outcome forces a change in the others. The rigorous way to quantify the effect is to use log-ratios—for instance, the log-ratio of "desired edits" to "no edit." This correctly isolates the relative change of interest from the constant-sum constraint, providing a true measure of improved efficiency. This shows that "composition" is a deep concept, applying not just to a list of ingredients, but to the distribution of outcomes of any process.

The Analyst's Dilemma: The Art of Seeing What's Really There

In biology, we often have count data and our main task is to analyze it correctly. In the physical sciences, the challenge is often a step earlier: can we trust that our measurement devices are giving us a signal that is truly proportional to the composition in the first place? Here, compositional reasoning is not just about data analysis, but about experimental design.

Imagine a chemist with a state-of-the-art Nuclear Magnetic Resonance (NMR) spectrometer, trying to determine the composition of a crude reaction mixture. The resulting spectrum shows beautiful, sharp peaks, one for each type of carbon atom. It seems simple: the area under each peak should correspond to the amount of that carbon, right? Wrong. This is a classic trap. The intensity of a carbon's signal is affected by its local context. For one, different carbon atoms "relax" back to their equilibrium state at vastly different rates (a property called $T_1$ ), and a short delay between experimental pulses can dramatically suppress the signal from slow-relaxing carbons. For another, during the standard experiment, a phenomenon called the Nuclear Overhauser Effect (NOE) boosts the signal of carbons that have protons nearby, but not those that don't. The result is that the measured "composition" of peak integrals is a distorted funhouse-mirror reflection of the true chemical composition. The measurement itself is not compositionally sound. The solution is to change the experiment: by using a technique called inverse-gated decoupling to suppress the NOE and waiting a very long time between pulses (at least 5 times the longest $T_1$ ), the chemist can force the machine to treat all carbons equally. This is compositional reasoning in action: designing an experiment to break the context-dependencies and yield a truly proportional measurement.

A strikingly similar story unfolds in materials science. An analyst using an electron microscope wants to measure the precise composition of a nickel-based superalloy. They use Energy Dispersive X-ray Spectroscopy (EDS), which detects characteristic X-rays emitted by each element when bombarded by electrons. Again, the signal from an aluminum atom, for example, is not a pure measure of the amount of aluminum. It is affected by the entire "matrix" of surrounding atoms—mostly nickel—which absorb some of the aluminum X-rays before they can reach the detector. This is the dreaded "ZAF" matrix effect. To correct for it, one could use complex physical models, but they have their own uncertainties. A more elegant solution, born of compositional thinking, is to use a "matrix-matched" certified reference material. This is a standard whose composition is already known to be very close to the unknown sample. By calibrating against a standard with nearly the same context (matrix), the complex correction factor $[ZAF]$ becomes very close to 1, and a simple, direct comparison of intensities becomes valid. This is a beautiful physical analogy to the mathematical transforms used in biology. In both NMR and EDS, the goal is the same: to create the conditions, either through experimental design or the choice of reference, for a compositionally valid comparison.

The Engineer's Blueprint: Building Worlds That Work

So far, we have used compositional reasoning to analyze systems that nature or chance has given us. But its deepest power may lie in its ability to help us build new things. It is the fundamental design principle for any complex system.

The field of synthetic biology, which aims to engineer biological organisms with new functions, is a perfect example. The dream is to assemble genetic "parts"—promoters, ribosome binding sites (RBS), and coding sequences—into "devices" and "systems," much like an electrical engineer assembles circuits from resistors and capacitors. This requires a clear distinction between two crucial concepts: modularity and composability.

Modularity means a part behaves predictably across different contexts. A modular promoter, for instance, should initiate transcription at the same rate no matter what gene you place downstream.
Composability means that when we connect modular parts, the behavior of the resulting system is predictable from the properties of its components.

These are not the same thing. You can have a collection of perfectly modular parts that fail to compose. Imagine a modular promoter and a modular RBS. When you assemble them to drive a gene, the mRNA molecule that connects them—the interface—can form an unexpected hairpin loop that physically blocks the RBS, killing protein production. The parts were fine in isolation, but their composition failed because of an unintended interaction at their interface. The central challenge of synthetic biology, and indeed all engineering, is mastering these interfaces to achieve true composability.

This line of thought reaches its most abstract and powerful form in the design of cyber-physical systems—complex integrations of software and physical components like satellites, power grids, or autonomous vehicles. To manage this complexity, engineers use Architecture Description Languages (ADLs). An ADL is a formal language that forces the designer to separate concerns into orthogonal "views": the system's structure (what are the components and how are they connected?), its behavior (what do the components do?), its timing (how fast do they do it?), and its allocation (what resources do they consume?).

The ADL then provides a set of mathematical rules for composing components within each view. The composition of behaviors is governed by the logic of interacting state machines; the composition of timing constraints is governed by temporal logic. The language is designed to have a wonderful property, known in mathematics as a homomorphism: the analysis of the composed system is guaranteed to be the same as the composition of the individual analyses. This allows engineers to verify properties of a massive, complex system by analyzing its small pieces and their interactions in a rigorous, stepwise manner. It is the ultimate expression of compositional reasoning. The same core idea that lets us understand a shift in our gut flora is what allows us to formally prove that a flight control system is safe.

From the microscopic to the macroscopic, from analysis to design, compositional reasoning is a golden thread. It is the discipline of managing context. It teaches us that while the whole is often more than the sum of its parts, it is not a mystery. It is a predictable, knowable function of its parts and their interfaces. Whether we are wielding log-ratios, designing clever experiments, or defining formal languages, we are all on the same quest: to master the art and science of composition.