try ai
Popular Science
Edit
Share
Feedback
  • Compositional Data Analysis: A Guide to Ratios and Log-Ratio Transformations

Compositional Data Analysis: A Guide to Ratios and Log-Ratio Transformations

SciencePediaSciencePedia
Key Takeaways
  • Analyzing compositional data (parts of a whole) with standard statistics is misleading due to the constant-sum constraint, which creates spurious correlations and analytical paradoxes.
  • The only stable and coherent information within compositional data is found in the ratios between its components, not the raw proportions themselves.
  • Log-ratio transformations, such as CLR and ILR, are essential for converting compositional data into a standard Euclidean space where statistical tools can be correctly applied.
  • Proper compositional analysis is crucial in fields like microbiology, genomics, and ecology to distinguish true biological signals from mathematical artifacts in relative abundance data.

Introduction

From the microbial species in our gut to the mineral content of a rock, many scientific datasets consist of proportions that represent parts of a whole. This is known as compositional data, and its defining feature—that all parts must sum to a constant—poses a profound challenge to conventional statistical analysis. Ignoring this constraint can lead to paradoxical results and spurious conclusions, a critical knowledge gap that can misdirect scientific inquiry. This article provides a comprehensive guide to understanding and correctly analyzing this unique data type. First, the "Principles and Mechanisms" chapter will deconstruct why standard methods fail, introducing the foundational concept of ratio-based analysis and the transformative power of log-ratio transformations. Following this, the "Applications and Interdisciplinary Connections" chapter will showcase how these methods are providing clearer, more accurate insights across diverse fields, from decoding microbiome interactions to understanding evolutionary processes.

Principles and Mechanisms

Imagine you're at a party with a single pizza cut into eight slices. If you take two slices instead of one, there's one less slice available for everyone else. Your gain is their loss. This might seem obvious, but this simple, inescapable constraint is the source of deep and fascinating challenges when we analyze data that represents parts of a whole. Whether we're looking at the proportions of different bacteria in your gut, the relative expression of genes in a cell, or the mass fractions of minerals in a rock, we are dealing with ​​compositional data​​. The defining feature of such data is that the components are not independent; they must sum to a constant (like 100% or a total of 1). This is known as the ​​closure​​ constraint, and it forces our data to live in a constrained geometric space called a ​​simplex​​, not the wide-open Euclidean space our usual statistical tools were designed for. Ignoring this fact can lead us to some very strange and misleading conclusions.

The Tyranny of the Whole

Let's dive into a real-world scenario from microbiology. Imagine a researcher studying a simple microbial community with four species. In a control environment, all four species are present in equal numbers. Then, the researcher adds a new drug. In the treated environment, two of the original species are wiped out, a third remains completely unaffected, and a fourth, new species that is resistant to the drug thrives, growing to an enormous population size.

Now, a standard procedure is to analyze the relative abundance of each species, since the total number of reads from a gene sequencer can vary for technical reasons. So, what happens to the species that was completely unaffected by the drug? Its absolute count remained identical in both the control and treatment flasks. Yet, when the researcher calculates its relative abundance, its proportion plummets in the treated sample. Why? Because the massive bloom of the drug-resistant species dramatically increased the total size of the community (the denominator), forcing the relative share of our stable species to shrink.

This isn't a biological effect; it's a mathematical artifact. Our stable species didn't decline; it was simply crowded out in the final tally. This is the tyranny of the whole: an aggressive change in one part of the composition forces an apparent change in other, unrelated parts. This effect is pervasive. In single-cell biology, a cell might ramp up the expression of a few specific genes. Due to closure, this can make thousands of other, independently-regulated genes appear to be suppressed, creating a web of ​​spurious correlations​​ where none exist biologically. The constant-sum constraint acts like a hidden force, coupling all the components together and inducing negative correlations that can fool us into seeing antagonism where there is only arithmetic.

The Paradox of the Missing Parts

The strange behavior of compositional data gets even more perplexing. Let's step away from biology for a moment and into a chemistry lab, analyzing the composition of two brine samples, SSS and TTT. We measure the mass fractions of four components: NaCl\mathrm{NaCl}NaCl, KCl\mathrm{KCl}KCl, MgCl2\mathrm{MgCl_2}MgCl2​, and water.

  • Sample SSS: 10%10\%10% NaCl\mathrm{NaCl}NaCl, 5%5\%5% KCl\mathrm{KCl}KCl, 10%10\%10% MgCl2\mathrm{MgCl_2}MgCl2​, 75%75\%75% H2O\mathrm{H_2O}H2​O.
  • Sample TTT: 12%12\%12% NaCl\mathrm{NaCl}NaCl, 12%12\%12% KCl\mathrm{KCl}KCl, 6%6\%6% MgCl2\mathrm{MgCl_2}MgCl2​, 70%70\%70% H2O\mathrm{H_2O}H2​O.

If we compare the two, it's clear: Sample TTT (12%12\%12%) has a higher mass fraction of NaCl\mathrm{NaCl}NaCl than Sample SSS (10%10\%10%).

But what if we are only interested in the salt subsystem of NaCl\mathrm{NaCl}NaCl and KCl\mathrm{KCl}KCl? A natural thing to do would be to ignore the other components and re-normalize the fractions of NaCl\mathrm{NaCl}NaCl and KCl\mathrm{KCl}KCl so that they sum to 100%100\%100% within that subsystem. Let's see what happens.

  • In Sample SSS, the NaCl\mathrm{NaCl}NaCl to KCl\mathrm{KCl}KCl part of the brine is 10:510:510:5. Renormalized, the new fraction of NaCl\mathrm{NaCl}NaCl is 0.100.10+0.05=23\frac{0.10}{0.10 + 0.05} = \frac{2}{3}0.10+0.050.10​=32​.
  • In Sample TTT, the NaCl\mathrm{NaCl}NaCl to KCl\mathrm{KCl}KCl part is 12:1212:1212:12. Renormalized, the new fraction of NaCl\mathrm{NaCl}NaCl is 0.120.12+0.12=12\frac{0.12}{0.12 + 0.12} = \frac{1}{2}0.12+0.120.12​=21​.

Suddenly, the conclusion is inverted! In the NaCl\mathrm{NaCl}NaCl-KCl\mathrm{KCl}KCl subsystem, Sample SSS (23\frac{2}{3}32​) has a higher relative concentration of NaCl\mathrm{NaCl}NaCl than Sample TTT (12\frac{1}{2}21​). This paradox, where our conclusions change depending on which components we include in our analysis, is known as a violation of ​​subcompositional coherence​​. Our standard way of thinking about "more" or "less" has failed us.

The Freedom of Ratios

This is the kind of puzzle that signals we are missing a fundamental piece of the picture. The scientist who put the pieces together was a Scottish mathematical geologist named John Aitchison. He realized that the problem lay in our focus on the absolute values of the proportions themselves. In a compositional world, he argued, the only information that is stable, coherent, and immune to these paradoxes is the ​​ratio​​ between the components.

Let's go back to our brines. In Sample SSS, the ratio of NaCl\mathrm{NaCl}NaCl to KCl\mathrm{KCl}KCl is 0.100.05=2\frac{0.10}{0.05} = 20.050.10​=2. In Sample TTT, the ratio is 0.120.12=1\frac{0.12}{0.12} = 10.120.12​=1. The statement "The NaCl/KCl\mathrm{NaCl}/\mathrm{KCl}NaCl/KCl ratio is higher in Sample SSS than in Sample TTT" is true whether we consider the full four-part composition or just the two-part subcomposition. The ratio is the bedrock of truth in a sea of shifting proportions.

This is a profound conceptual shift. We must abandon the analysis of individual component values and instead build a new mathematical framework—a new geometry—based entirely on ratios. This framework is now known as ​​Aitchison geometry​​.

A New Space Through the Looking-Glass of Logarithms

If our world is one of ratios, how do we operate in it? Our standard statistical toolbox—linear regression, PCA, t-tests—is built on a world of addition, subtraction, and distances in Euclidean space. The bridge between the multiplicative world of ratios and the additive world of linear algebra is the logarithm. A ​​log-ratio transformation​​ is the key that unlocks the simplex.

The most intuitive of these is the ​​Centered Log-Ratio (CLR)​​ transformation. For each component in a composition, instead of looking at its value xix_ixi​, we look at the logarithm of its ratio to a common reference. What reference should we use? To treat all components equally, we use the geometric mean of the entire composition, g(x)=(∏i=1Dxi)1/Dg(\mathbf{x}) = (\prod_{i=1}^{D} x_i)^{1/D}g(x)=(∏i=1D​xi​)1/D. The transformation for each component iii is then:

clr(xi)=ln⁡(xig(x))=ln⁡(xi)−1D∑j=1Dln⁡(xj)\mathrm{clr}(x_i) = \ln\left(\frac{x_i}{g(\mathbf{x})}\right) = \ln(x_i) - \frac{1}{D}\sum_{j=1}^{D} \ln(x_j)clr(xi​)=ln(g(x)xi​​)=ln(xi​)−D1​j=1∑D​ln(xj​)

This simple act is revolutionary. By taking the ratio, we make the analysis independent of the original total amount (like sequencing depth), achieving scale invariance. By using the logarithm, we move from the simplex to a familiar Euclidean space where distances and correlations start to make sense again. Of course, we can't take the logarithm of zero, which is common in biological data. A pragmatic first step is often to replace any zeros by adding a tiny, constant "pseudocount" to all components before we begin.

Once our data is in this CLR space, we can calculate a meaningful distance between two samples, say, a healthy person and a patient. This distance, called the ​​Aitchison distance​​, is simply the standard Euclidean distance between the two CLR-transformed vectors. It's a true measure of the difference in compositional structure, free from the artifacts we saw earlier.

Composing with Balances: An Elegant Construction

The CLR transform is a massive leap forward, but it has one last quirk. The transformed components always sum to zero, meaning they are not fully independent and are constrained to lie on a hyperplane. For some advanced statistical models, this can still be a bit awkward.

This brings us to the final and perhaps most elegant tool in the compositional toolbox: the ​​Isometric Log-Ratio (ILR)​​ transformation. The ILR transform takes the idea of ratios a step further. Instead of comparing each component to a generic community-wide mean, ILR allows us to define coordinates based on specific, hypothesis-driven comparisons between groups of components. These special coordinates are called ​​balances​​.

Imagine we are studying the gut microbiome and we hypothesize that the key ecological dynamic is the competition between bacteria that ferment carbohydrates (saccharolytic) and those that break down proteins (proteolytic). We can define a balance that precisely captures this concept. This balance is essentially the scaled log-ratio of the geometric mean of the saccharolytic group to the geometric mean of the proteolytic group:

bS∣P=rsr+sln⁡(g(xSaccharolytic)g(xProteolytic))b_{S \mid P} = \sqrt{\frac{rs}{r+s}} \ln\left(\frac{g(\mathbf{x}_{\text{Saccharolytic}})}{g(\mathbf{x}_{\text{Proteolytic}})}\right)bS∣P​=r+srs​​ln(g(xProteolytic​)g(xSaccharolytic​)​)

where rrr and sss are the number of species in each group. This single number, bS∣Pb_{S \mid P}bS∣P​, becomes a new variable. A positive value might mean the carbohydrate-eaters are dominant, while a negative value means the protein-eaters are on top. We can create a set of such balances that are fully independent (orthonormal) and span a proper, unconstrained (D−1)(D-1)(D−1)-dimensional Euclidean space.

These ILR coordinates are perfect for use in any standard multivariate model, from regression to machine learning. This is the ultimate triumph of the compositional approach. We began with data that produced paradoxes and illusions. By rethinking the very nature of our measurements, we arrived at a framework that not only resolves the paradoxes but provides a powerful and elegant way to construct meaningful variables that directly test our scientific hypotheses. The apparent flaw in our data was, in fact, an invitation to a deeper and more beautiful understanding of its structure.

Applications and Interdisciplinary Connections

Now that we have grappled with the peculiar world of compositional data and the elegant key—the log-ratio transform—that unlocks its secrets, we can embark on a journey. This is where the real fun begins. We are like explorers who have just learned to read a new kind of map, and suddenly, entire continents of scientific inquiry open up to us. The principles we've discussed are not just abstract mathematical curiosities; they are essential tools for navigating some of the most exciting frontiers of modern science. From the bustling ecosystems within our own bodies to the grand tapestry of evolution, the ghost of the constant-sum constraint lurks, and our new tools are ready to exorcise it.

The World Within Us: Decoding the Microbiome

Perhaps no field has been more revolutionized by compositional data analysis than the study of the microbiome. Every one of us is a planet, teeming with trillions of microbial inhabitants. When we sequence a gut sample, we are essentially taking a census. But it’s a strange kind of census. The total number of sequencing reads we get is arbitrary—it depends on the machine, not the biology. All we can truly know are the relative proportions of different bacteria. The data is, by its very nature, compositional.

So, a simple but profound question arises: if we compare the gut microbes of a group of healthy people to a group with a disease, how can we tell which bacteria are truly more or less abundant? If we just look at the percentages, we fall into the trap we discussed. An increase in the percentage of Bacterium A must be accompanied by a decrease in the percentage of something else. Is that decrease a real biological effect, or just a mathematical necessity?

This is not a hypothetical puzzle; it is a central challenge in medical research. The solution is precisely the pipeline we have learned. By taking the raw proportions, handling the inevitable zeros, and applying a transformation like the centered log-ratio (CLR), we move the data from the constrained simplex into a proper Euclidean space. In this new space, the strange, forced dependencies vanish. We can then use standard, trusted statistical tools, like a t-test, to ask: is the average CLR-transformed value for Bacterium A genuinely different between the two groups? This procedure allows us to identify microbial signatures of disease with a statistical rigor that was previously impossible.

But we want to know more than just who is there. We want to know what they are doing. Are they cooperating? Competing? Forming a community? To answer this, biologists build "co-occurrence networks" to map the relationships between microbes. The naive approach is to calculate the correlation between the relative abundances of every pair of microbes. But we know this is a terrible idea. Because the total pie is fixed at 100%, if two very abundant and unrelated species dominate the community, their proportions will be forced into a negative correlation, suggesting they are fierce competitors when, in fact, they might be ignoring each other completely.

Once again, log-ratios save the day. A first step is to compute correlations not on the raw proportions, but on the CLR-transformed data. This simple change dissolves many of the most egregious spurious correlations. For a deeper look, we can move from simple correlation (a measure of marginal association) to partial correlation (a measure of conditional association), which asks if two microbes are associated after accounting for the influence of all other microbes. This is the domain of graphical models and sparse inverse covariance estimation. These advanced methods, when applied to log-ratio transformed data, allow us to build a much more truthful picture of the underlying microbial social network, revealing the subtle web of interactions that governs the health of our internal ecosystem.

Connecting the Dots: From Genes to the Grandeur of -Omics

The microbiome does not exist in a vacuum. It is in constant dialogue with its host—us. Our own genetic makeup can influence which microbes thrive in our gut. This leads to another fascinating application: host-microbiome genome-wide association studies (GWAS). The goal is to scan the host genome and find genetic variants (SNPs) that are associated with the composition of the microbiome.

Here we have a classic regression problem: we want to model the microbiome composition as a function of a genetic variant. But what does it mean to use a "composition" as the response variable in a regression? We can't just regress on each proportion separately. The solution is to transform the composition into a set of unconstrained, real-valued numbers. The isometric log-ratio (ILR) transform is perfect for this. It converts the DDD-part composition into D−1D-1D−1 orthogonal coordinates, or "balances," which can each be used as the response variable in a standard linear model. This allows us to rigorously test for links between our DNA and the microbes we host, opening doors to personalized medicine and a deeper understanding of host-symbiont co-evolution.

This idea of treating high-throughput biological data as compositional extends far beyond the microbiome. Think of transcriptomics, the study of all RNA molecules in a cell. When we use RNA-sequencing, we get counts of reads for thousands of genes. Often, these are normalized into "Transcripts Per Million" (TPM), which are, by definition, relative proportions. The total amount of RNA in a cell is a meaningful biological quantity, but our measurement of it is tangled up with sequencing depth. So, TPM data is also compositional. To compare gene expression profiles, to cluster cells, or to build predictive models, we are on much safer ground if we first apply a log-ratio transformation. This is especially true in the burgeoning field of single-cell RNA sequencing, where the compositional nature of the data is a well-recognized challenge that CLR and ILR transforms are helping to solve.

The Wider World: Ecology and Evolution

The beauty of a fundamental principle is its universality. The logic of compositional data is not confined to molecules and microbes; it applies just as powerfully to the macroscopic world of animals and plants.

Consider a predator that feeds on three different prey species. Its diet, the proportion of each prey it consumes, is a composition. An ecologist wants to understand its foraging strategy. A classic model of "prey switching" suggests that the predator preferentially targets whichever prey is most common, and that the proportion of prey iii in the diet, pip_ipi​, might be related to its availability in the environment, NiN_iNi​, by a rule like pi∝Nimp_i \propto N_i^mpi​∝Nim​. The exponent mmm tells us how strong this switching behavior is.

How do we estimate mmm? If we try to regress the proportions directly, we run into the usual compositional problems. But if we transform both the diet composition and the availability composition into ILR coordinates, the complex non-linear relationship untangles into a beautiful, simple linear one. The slope of the line relating the transformed diet to the transformed availability is exactly the switching exponent mmm we were looking for. It's a striking example of how the right transformation can reveal the simple law hiding beneath a complex surface.

We can even watch compositions evolve through deep time. Imagine studying the evolution of milk. The relative proportions of different fatty acids in the milk of a mammal species is a key adaptive trait. Let's say we have this compositional data for cows, goats, and sheep, and we know their evolutionary tree. We want to ask: how fast did milk composition evolve along different branches of the tree?

This requires merging two sophisticated fields: compositional data analysis and phylogenetic comparative methods. First, we use an ILR transform to convert the fatty acid proportions for each species into a set of unconstrained coordinates. These coordinates can now be treated like any other evolving trait, such as body size or tooth length. We can then apply classic phylogenetic tools, like Felsenstein's "independent contrasts," to these ILR coordinates to properly account for the shared ancestry among the species and quantify the rate of evolutionary change. This powerful synthesis allows us to study the evolution of complex, multi-part traits in a way that is both statistically and biologically sound.

The Pragmatics of Big Science

Finally, a return to the practical realities of research. Modern biology is often "big science," involving many labs, many machines, and many people. A common headache is the "batch effect," where technical variations between labs or experimental runs create systematic differences in the data that can completely obscure the real biological signal.

There are many algorithms to correct for batch effects, but they typically work by adjusting the mean and variance of each feature. If you apply these directly to compositional data like microbiome proportions, you will get nonsense. You might "correct" a proportion to be negative, or make the proportions no longer sum to one. The proper protocol is to first transform the data into a valid Euclidean space using a method like the CLR transformation. In this log-ratio space, you can safely apply your favorite batch correction algorithm. Afterwards, if needed, you can transform the corrected data back to the simplex.

This same "transform first" principle applies to the exciting world of machine learning. If you want to train a model to predict, say, a patient's disease status from their gut microbiome composition, you cannot simply feed the raw proportions into a standard algorithm like a random forest or a support vector machine. Doing so invites the model to learn from the spurious correlations induced by the constant-sum constraint. A robust, predictive, and interpretable model requires a principled pipeline: handle zeros, transform the data using ILR, and then train your classifier in the clean, unconstrained ILR space.

From medicine to ecology, from molecules to mammoths, the analysis of parts of a whole is a universal theme. For a long time, scientists in these disparate fields were unknowingly stumbling into the same statistical traps. The development of a rigorous framework for compositional data has provided a unified solution, a common language, and a powerful set of tools. It is a beautiful testament to how a deep insight into the structure of our data can illuminate the true structure of the world.