Compositional Data Analysis

SciencePedia

Key Takeaways

Standard statistical methods fail on compositional data due to the constant-sum constraint, leading to spurious correlations and false conclusions.
The only meaningful and stable information within a composition is contained in the ratios between its components.
Log-ratio transformations (like CLR and ILR) are essential tools that project compositional data from its constrained space to standard Euclidean space, enabling valid statistical analysis.
Compositional data analysis is crucial in fields like genomics (RNA-seq) and microbiome studies to correctly identify differential abundance and infer interaction networks.

Introduction

In many scientific fields, from genomics to geology, data often comes not as absolute measurements but as proportions—parts of a whole. This type of data, known as compositional data, is ubiquitous in modern research, describing everything from the microbial makeup of our gut to the chemical composition of a rock sample. However, the seemingly simple nature of percentages and relative abundances hides a significant statistical trap. Due to a property called the constant-sum constraint, where all parts must sum to a total, applying standard statistical methods can create illusions, leading to spurious correlations and fundamentally flawed conclusions.

This article tackles this challenge head-on, serving as a guide to understanding and correctly analyzing compositional data. The first chapter, Principles and Mechanisms, delves into the theoretical pitfalls of working with raw proportions and introduces the groundbreaking solution developed by John Aitchison, which focuses on the stable information contained in ratios. We will explore the unique geometry of compositional data and the log-ratio transformations that allow us to see through the mathematical distortions. Following this, the chapter on Applications and Interdisciplinary Connections demonstrates how these principles are put into practice, showcasing their transformative impact in fields like microbiome research, RNA-sequencing, and ecology, and providing a framework for robust, reproducible science.

Principles and Mechanisms

Imagine you're at a party with a single pizza cut into slices. The pizza represents the total output of some system—perhaps the total messenger RNA molecules in a cell, or the entire community of microbes in a gut sample. Each slice represents a component: one type of RNA, one species of bacteria. Now, if one greedy guest takes an enormous slice of pepperoni, what happens? By necessity, there is less pizza available for the mushroom, olive, and plain cheese slices. This isn't because the pepperoni slice "competed" with the others in some biological sense; it's a simple, mathematical consequence of the whole pizza being a fixed size.

This simple analogy captures the central challenge of compositional data. In many modern biological measurements, from genomics to metabolomics, we don't measure absolute quantities. Instead, we measure proportions, percentages, or relative abundances. The data are compositional: the components are parts of a whole, and they are forced to sum to a constant (like 1, or 100%). This seemingly innocent constraint, which statisticians call closure, has profound and often counterintuitive consequences. It creates a mathematical "tyranny of the whole" that can deceive us if we are not careful. Naively applying standard statistical methods to these proportions is not just incorrect; it can lead us to entirely false conclusions.

The Hall of Mirrors: Spurious Correlations and False Conclusions

Let's step out of the pizzeria and into the laboratory to see the real-world danger. Consider a simplified gene expression experiment using RNA-sequencing. We have two conditions, A and B. In our cell, there's one special gene, Gene X, and 990 other genes. In condition A, everything is at a baseline level. In condition B, a cellular signal causes Gene X to become massively over-expressed, while the other 990 genes remain completely unchanged in their absolute molecular output.

We sequence the RNA from both conditions. The sequencing machine, like our pizza, has a fixed capacity—it can only generate a certain total number of reads. These reads are distributed among the genes in proportion to their molecular abundance. In condition B, Gene X is now hogging a much larger fraction of the total molecules. Consequently, it also hogs a much larger fraction of the sequencing reads. Since the total number of reads is fixed, every other gene must receive a smaller fraction of the reads, even though their absolute biological activity hasn't changed one bit.

When we analyze the data using standard normalization (which works with these relative proportions), we see a shocking result. Gene X is, correctly, found to be up-regulated. But all 990 other genes appear to be down-regulated. An unsuspecting researcher might conclude that Gene X's activation triggers a massive, system-wide suppression of other genes. It's a compelling story, but it's a complete illusion—an artifact created by the constant-sum constraint. This induced negative association is a classic example of spurious correlation.

This phenomenon is pervasive. In microbiome studies, two bacterial species that have absolutely no interaction and occupy entirely different niches can appear to be fierce competitors (a negative correlation) simply because a third, unrelated species bloomed and took up more of the "compositional space". Interpreting such an edge in a network as "competition" would be a fiction created by the mathematics of proportions, not the biology of the microbes.

A Tale of Two Brines: The Subcompositional Quicksand

The problems with raw proportions run even deeper. A fundamental principle of sound scientific reasoning is that our conclusions should be consistent. If we compare two things, the result shouldn't flip-flop depending on what other, unrelated things are in the background. Yet, this is exactly what can happen with compositional data.

Let's examine two samples of brine, S and T, each composed of NaCl, KCl, $\text{MgCl}_2$ , and $\text{H}_2\text{O}$ . The raw mass fractions are:

Sample S: (NaCl 0.10, KCl 0.05, $\text{MgCl}_2$ 0.10, $\text{H}_2\text{O}$ 0.75)
Sample T: (NaCl 0.12, KCl 0.12, $\text{MgCl}_2$ 0.06, $\text{H}_2\text{O}$ 0.70)

Looking at the full composition, Sample T has a higher mass fraction of NaCl than Sample S ( $0.12 > 0.10$ ). Simple enough. But what if we are only interested in the salts, and specifically the relationship between NaCl and KCl? A reasonable-sounding step would be to look at the subcomposition containing only these two salts, and re-normalize their fractions to sum to 1.

In Sample S, the subcompositional fraction of NaCl is $\frac{0.10}{0.10 + 0.05} = \frac{2}{3}$ .
In Sample T, the subcompositional fraction of NaCl is $\frac{0.12}{0.12 + 0.12} = \frac{1}{2}$ .

Suddenly, the conclusion is inverted! In the context of just NaCl and KCl, Sample S is now relatively richer in NaCl than Sample T ( $\frac{2}{3} > \frac{1}{2}$ ). Our answer depends on whether we consider the other components or not. This bizarre property is called subcompositional incoherence. It's as if asking "who is taller, Alice or Bob?" gets a different answer depending on whether their friend Carol is standing in the room. This tells us that raw proportions are built on a foundation of statistical quicksand.

Escaping the Flatland: The Genius of Ratios

So, if the parts themselves are treacherous, where can we find solid ground? The brilliant insight, developed by the late Scottish mathematician John Aitchison, is that in a composition, the only information that is truly meaningful and stable is contained in the ratios between the parts.

Let's return to our brines. Instead of looking at the fractions of NaCl, let's look at the ratio of NaCl to KCl within each sample.

In Sample S, the ratio is $\frac{\text{NaCl}}{\text{KCl}} = \frac{0.10}{0.05} = 2$ .
In Sample T, the ratio is $\frac{\text{NaCl}}{\text{KCl}} = \frac{0.12}{0.12} = 1$ .

This tells a clear and consistent story: relative to its potassium-based counterpart, sodium chloride is twice as abundant in Sample S as it is in Sample T. This conclusion remains true whether we consider the other salts and the water or not. The ratios are immune to the paradox of subcompositional incoherence. They are the bedrock upon which a valid statistical framework can be built.

A New Geometry for Parts of a Whole

The discovery that ratios are key is a monumental step. But working with a jumble of all possible pairwise ratios can be cumbersome. We need a more elegant and unified mathematical language. Aitchison realized that the problem was fundamentally geometric. Proportions that sum to 1 don't live in the familiar, infinite Euclidean space of high school geometry. They live on a constrained surface called a simplex. For three parts, this is a triangle; for four parts, a tetrahedron. Trying to apply standard statistics (which assume Euclidean space) to data on the simplex is like trying to use a flat map to navigate the globe—distances and angles get distorted.

Aitchison's solution was to invent a new geometry, now called Aitchison geometry, specifically for compositions. The central idea is to define a transformation that can take our data from the curved, constrained simplex and project it onto a standard, flat Euclidean space, where all our familiar tools—calculating means, variances, correlations, and performing regressions—work correctly. The magic that powers this projection is the logarithm, because it turns ratios (multiplication/division) into differences (subtraction/addition), which is the language of Euclidean geometry.

This leads us to the powerful toolkit of log-ratio transformations.

The Log-Ratio Toolkit: Our Lenses for Seeing Clearly

These transformations are the practical workhorses of compositional data analysis. They allow us to peer through the distorting mirror of raw proportions and see the true relationships within our data.

The Centered Log-Ratio (CLR): A Panoramic View

The first and most intuitive of these is the centered log-ratio (CLR) transformation. For each component in a composition, we take its logarithm and then center it by subtracting the average of all the log-components in that sample. This "average" is not the typical arithmetic mean, but the geometric mean, which is the natural way to find the center of multiplicative data like ratios. The formula for the $i$ -th CLR component $y_i$ of a composition $\mathbf{x}$ is:

y_i = \ln\left(\frac{x_i}{g(\mathbf{x})}\right) \quad \text{where} \quad g(\mathbf{x}) = \left(\prod_{j=1}^{D} x_j\right)^{1/D}

Each $y_i$ value tells us how enriched or depleted that component is relative to the overall "baseline" of the sample. A positive value means it's above the sample's geometric mean; a negative value means it's below. This is perfect for exploratory visualization.

The CLR transform has a peculiar and important property: the transformed components for any given sample always sum to exactly zero. This means the data, once transformed, lie on a $(D-1)$ -dimensional flat plane within the $D$ -dimensional space. This constraint means the covariance matrix of CLR data is singular, which poses a problem for some statistical models. However, this space is where Aitchison geometry truly comes alive. The "true" distance between two compositions, the Aitchison distance, is simply the standard Euclidean distance between their CLR-transformed vectors. A calculation shows this distance can be dramatically different from the naive distance calculated on the raw proportions, revealing relationships that were previously hidden.

The Isometric Log-Ratio (ILR): Ready for Heavy Lifting

To overcome the singularity of the CLR transform and prepare our data for demanding statistical models like regression, we need the isometric log-ratio (ILR) transformation. The ILR transform is a clever way to take a $D$ -part composition and map it to exactly $D-1$ new coordinates that are fully independent and live in a standard, unconstrained $\mathbb{R}^{D-1}$ space.

One way to think about ILR coordinates is as a series of balances. Imagine taking your $D$ components and splitting them into two groups. The first ILR coordinate, or balance, is essentially the log-ratio of the geometric means of these two groups. You then take one of the groups and split it again, forming the second balance, and so on, until you have $D-1$ such balances. The resulting coordinates form a full-rank, orthonormal set, ready for any standard multivariate method. While the interpretation of individual ILR coordinates depends on how you define the splits, the geometric relationships between the samples are perfectly preserved, making ILR the gold standard for modeling compositional data.

Practical Realities: Zeros and Red Herrings

Of course, the real world is messy. The "log" in log-ratios means we have a problem when our data contains zeros, which is extremely common in sequencing data. We can't take the logarithm of zero. This requires a careful pre-processing step to replace the zeros with small, plausible values, a process often done using methods like adding a tiny pseudocount to all components before transformation.

It's also crucial to recognize flawed "solutions" that can seem appealing. A common practice in microbiology, for example, is rarefaction, where all samples are down-sampled to the same sequencing depth to "equalize" them. But as we've seen, sequencing depth is itself a source of information. If one group of patients has a systematically lower microbial load, this may result in lower sequencing depth. Rarefying throws away vast amounts of data from the high-depth samples and can systematically erase true biological signals, especially for rare species. It's a method that tries to solve the problem of unequal library sizes but fails to address the deeper, more fundamental problem of compositionality.

By understanding these principles, we can move beyond the deceptive simplicity of proportions. We can learn to use the right tools—the log-ratio transformations—to navigate the unique geometry of compositional data, ensuring that the stories we tell are a true reflection of the underlying biology, not just an artifact of the numbers themselves.

Applications and Interdisciplinary Connections

Now that we have grappled with the principles of compositional data, we might feel like we've just been handed a new pair of glasses. The world of percentages and proportions, once a blurry landscape of potential illusions, is starting to come into focus. But what can we do with this new clarity? Where does this journey of log-ratios take us? The answer, it turns out, is everywhere. From the inner workings of our cells to the grand tapestry of ecosystems, the principles of compositional analysis are not just an academic curiosity; they are an essential toolkit for modern discovery.

Let's imagine you're listening to an orchestra on a radio with an automatic volume control. If the brass section suddenly plays a fortissimo passage, the radio's compressor will turn the overall volume down to protect your speakers. As a result, the woodwinds, even if they haven't changed their dynamics at all, will sound quieter. If you didn't know about the automatic volume control, you might wrongly conclude that the woodwinds decided to play more softly. This is the compositional constraint in action. The fixed "total volume" of the radio forces a change in one section to affect the perceived volume of all others. Our log-ratio transformations are the tools that allow us to disable this automatic compressor and hear the true, independent dynamics of each section. Now, let’s venture into the concert hall of science and see what we can hear.

The Modern Biologist's Toolkit: From Genes to Ecosystems

Perhaps nowhere has compositional thinking been more revolutionary than in the life sciences. The advent of high-throughput sequencing has inundated biology with data that is, by its very nature, compositional. We rarely count every single molecule or microbe; instead, we sample from a complex mixture and get a list of proportions.

Deciphering the Blueprint of Life

Consider the analysis of gene expression using RNA sequencing (RNA-seq). This technology gives us a snapshot of which genes are active in a cell by counting the number of messenger RNA (mRNA) transcripts. A common goal is to compare a treatment group to a control group to see which genes are "up-regulated" or "down-regulated." Naively, one might just compare the proportion of reads for each gene. But this is a trap!

Imagine an experiment where a treatment causes a single, highly expressed gene to become ten times more active. This single gene now hogs a much larger fraction of the total mRNA pool in the cell. When we sequence this pool, that one gene will also hog a much larger fraction of our sequencing reads. Because the total proportion must sum to one, the relative abundance of every other gene must go down, even if their absolute number of mRNA molecules per cell remained identical. A naive analysis would report thousands of genes as being down-regulated, an illusion created by the single overachiever. This compositional bias is not a small error; it is a fundamental artifact of the measurement.

This is where our new tools show their power. By applying a log-ratio transformation, such as the centered log-ratio (CLR), we re-center the data. Instead of comparing gene $g$ in the treatment group to gene $g$ in the control group on a shifting scale, we compare it to the geometric mean of all genes within its own sample. When we then look at the difference between the transformed values in the treatment and control groups, the shared bias term, which arose from the change in the total mRNA pool, magically cancels out. We are left with the gene's true change relative to the average trend, a much more honest and interpretable measure.

Once the data is in this log-ratio space, the entire world of classical statistics opens up. We can use standard tests like the Welch's t-test to rigorously assess whether a change in the ratio of two groups of genes—say, naive versus memory T-cells in an aging immune system—is statistically significant. We can build complete, robust pipelines for differential abundance analysis: add a small pseudocount to handle zeros, apply a CLR or ILR transform, run a statistical test for each feature, and then apply corrections for testing thousands of hypotheses at once. This principled workflow is now the gold standard in the field.

The Symphony Within: The Microbiome

If a single cell is an orchestra, the human gut microbiome is a bustling metropolis, teeming with thousands of species of bacteria, archaea, and fungi. When we study the microbiome by sequencing, we are again faced with compositional data of the highest order. We don't know the absolute number of bacteria, only the relative proportions of their DNA in our sample.

Suppose we want to find a link between a specific microbe and a disease. A common but flawed approach is to simply correlate the microbe's relative abundance with the disease. But this is doomed to fail. As we've seen, the constant-sum constraint forces the covariance structure of the data into a straitjacket. In any composition, the sum of a component's covariances with all other components (including itself) must be zero. This means that even if all microbes were living in complete independence, the closure operation would induce a web of spurious negative correlations.

To untangle this web, we must again turn to log-ratios. By transforming the data with an Isometric Log-Ratio (ILR) transform, we can create a set of $D-1$ new variables, or "balances," from our $D$ taxa. These balances are mathematically independent (orthogonal) and live in a standard, unconstrained Euclidean space. They are the perfect input for downstream statistical models, allowing us to properly test for associations between microbial features and host characteristics, like genetic variants in a GWAS-like study, without falling into the traps of collinearity or spurious correlation.

This framework extends to one of the most exciting goals in microbial ecology: inferring the "social network" of microbes. Who collaborates with whom? Who competes? Simply correlating abundances is misleading. The modern approach is to estimate the conditional dependence network. The question is not "Does Bacteroides go up when Faecalibacterium goes down?" but "Does Bacteroides go up when Faecalibacterium goes down, all other microbes being held constant?" This question is answered by the inverse covariance matrix, or precision matrix.

The pipeline is as elegant as it is powerful: transform the compositional data using CLR, estimate the covariance matrix (using some form of regularization to handle the high dimensionality and the inherent singularity), and then invert it to get the precision matrix. The non-zero entries in this matrix reveal the edges in our ecological network, the direct links of cooperation or competition. This approach is at the heart of state-of-the-art machine learning pipelines that can predict disease status from a person's microbiome, provided all steps, including the log-ratio transforms, are performed correctly within a cross-validation framework to prevent information leakage. And throughout these complex analyses, we must never forget the practical realities of large-scale experiments, such as batch effects. Fortunately, compositional thinking provides a clear path here as well: we perform batch correction on the log-ratio transformed data, carefully preserving the biological signal of interest while removing the technical noise.

The Grand Play of Evolution and Ecology

The utility of compositional analysis extends far beyond the microscopic world. Consider an ecologist studying a predator that eats three types of prey. The predator's diet is a composition. To understand if the predator "switches" to prefer more abundant prey—a key ecological behavior—one might try to relate the proportion of a prey in the diet to its proportion in the environment. Again, this is complicated by the compositional constraints. But by applying an ILR transform to both the diet and the environmental compositions, a beautifully simple, linear relationship emerges, and its slope directly reveals the predator's switching behavior.

This same logic applies to evolutionary biology. Imagine we are comparing the composition of fatty acids in the milk of different mammal species to understand its evolution. The milk composition is one trait, but the species themselves are not independent data points—they are related by a phylogeny. To perform a phylogenetic comparative analysis, we must account for both sources of non-independence. The solution is sequential: first, we use a log-ratio transform (like ILR) to convert the compositional milk data into a set of unconstrained, real-valued traits. Then, on these well-behaved new variables, we can apply standard phylogenetic tools like Phylogenetic Independent Contrasts (PICs) to study their evolution across the tree of life.

Unmixing Signals in Engineering and Science

The power of compositional thinking is not limited to biology. It is, at its heart, a tool for understanding mixtures. Consider a problem in synthetic biology where we use CRISPR genome editing to modify a cell. The editing process can result in several different outcomes: the desired precision edit (HDR), or various types of errors (indels, deletions). The raw output from a sequencing experiment is a vector of counts for each of these outcomes—a composition.

We may have good models for the "pure" outcome distributions of different underlying DNA repair pathways (c-NHEJ, MMEJ, etc.). The question then becomes: what mixture of these pathways produced the final outcome composition we observed? This is a "deconvolution" or "unmixing" problem. By framing it as the estimation of mixture weights that, when applied to the pure pathway signatures, best explain the observed outcome composition, we can use maximum likelihood methods to infer the activity of the cell's fundamental repair machinery. This same principle can be applied to unmix geological sediment compositions, chemical reaction products, or any other phenomenon where the data we observe is a mixture of well-defined sources.

A Philosophy of Robust Science

Finally, understanding compositional data analysis elevates us to a higher level of scientific practice. In any complex analysis, there are dozens of choices to make: which pseudocount to use, which log-ratio transform, which covariates to adjust for. This "garden of forking paths" can lead different researchers to different conclusions from the same data.

How can we be sure a finding is real and not just an artifact of one particular analytical pipeline? The most robust approach is to perform a "multiverse analysis." We must first pre-register our hypothesis and our criteria for success. Then, we deliberately re-analyze the data using a wide array of scientifically justifiable pipelines—different denoising methods, different compositional transformations, different statistical models. If the finding, say an association between a microbe and a disease, remains consistent in its direction and significance across the vast majority of these different analytical worlds, we can be far more confident that we have discovered a genuine piece of reality.

Compositional data analysis does not give us the single right answer. Instead, it gives us the set of valid questions we are allowed to ask of our data. It provides the principled framework that helps us distinguish between sound analytical choices and flawed ones. It is a philosophy of rigor, a tool for taming complexity, and a way of ensuring that the stories we tell about the world are as true as we can possibly make them. It is, in the end, a crucial chapter in the book of the scientific method itself.