Core Genome Multilocus Sequence Typing (cgMLST)

SciencePedia

Key Takeaways

Core genome MLST (cgMLST) converts the DNA sequences of thousands of shared genes into a simple, standardized string of numbers, creating a portable "allelic profile" for each bacterium.
The "allelic distance," or the number of differing alleles between two profiles, provides a robust, quantitative measure of genetic relatedness, enabling precise identification of outbreak clusters.
By using a centralized and version-controlled schema, cgMLST creates a universal language for microbial genomics, allowing different laboratories worldwide to compare their results seamlessly.
cgMLST distinguishes between the stable core genome (used for tracking ancestry) and the flexible accessory genome (which often carries traits like antibiotic resistance), providing clarity in outbreak investigations.

Introduction

In the fight against infectious diseases, the ability to accurately identify and track the spread of bacterial pathogens is paramount. For decades, public health officials relied on methods that could offer only a silhouette of a bacterium's identity, often leading to ambiguity in outbreak investigations. The challenge has been to develop a tool with the precision of whole-genome sequencing but with the simplicity and comparability needed for global surveillance. This gap highlights the need for a standardized, high-resolution method that can definitively link cases and sources in near real-time.

This article explores core genome Multilocus Sequence Typing (cgMLST), a revolutionary approach that bridges this gap. By turning vast genomic data into a simple, universal barcode, cgMLST has transformed microbial epidemiology. The following sections will guide you through this powerful technique. First, "Principles and Mechanisms" will unpack the core concepts, explaining how cgMLST generates a stable allelic profile from a bacterium's core genome and uses it to measure genetic distance. Subsequently, "Applications and Interdisciplinary Connections" will demonstrate how this method is applied in real-world scenarios, from delineating local outbreaks to enabling global surveillance and integrating with fields like statistics and evolutionary biology to create a comprehensive picture of pathogen transmission.

Principles and Mechanisms

Imagine you are a public health detective, faced with a sudden surge of food poisoning cases across a city. Your paramount task is to find the source. Are the cases from Clinic A linked to those from Clinic B? Is the chicken from a certain processing plant the culprit? To answer these questions, you need a way to fingerprint the culprit bacterium with exquisite precision. You need to know, without a shadow of a doubt, who is related to whom.

For decades, scientists used methods that were clever but ultimately crude. One popular technique, called pulsed-field gel electrophoresis (PFGE), involved chopping up the bacterium's entire DNA chromosome with special enzymes and looking at the pattern of the resulting fragments. It was like identifying a person by their silhouette. You could easily tell a short person from a tall one, but you couldn't distinguish between identical twins. In the world of bacteria, which can be incredibly similar, this was a problem. It was not uncommon for two entirely separate outbreaks, caused by distinct bacterial lineages, to have the exact same PFGE "silhouette," leading investigators down the wrong path. To truly solve the puzzle, we needed to look deeper, to read the very blueprint of life itself: the DNA sequence.

A Universal Barcode for Bacteria

The revolution came when we gained the ability to rapidly and cheaply read a bacterium's entire genetic code, or genome. But this presented a new challenge. A bacterial genome contains millions of letters of DNA. Comparing these entire, massive texts for every single case is computationally difficult and, as we shall see, surprisingly tricky to do in a way that is consistent from one laboratory to the next. What was needed was a way to distill this vast amount of information into a simple, stable, and universally comparable "barcode."

An early and brilliant attempt was Multilocus Sequence Typing (MLST). The idea was to ignore most of the genome and focus on just a handful of genes—typically seven. These were not just any genes; they were housekeeping genes, responsible for basic, essential cellular functions. Because they are so critical, they are under strong pressure to remain unchanged and evolve very slowly. By sequencing these seven genes, scientists could assign a bacterial strain to a broad family line, or "sequence type." MLST was a phenomenal tool for understanding the global, long-term evolution and spread of bacterial lineages over many years. But for our city-wide outbreak, which unfolds over days or weeks, MLST is often too coarse. It can't tell the difference between siblings or cousins; it can only identify the family name.

Scaling Up: From Seven Genes to the Core Genome

This is where the next leap of intuition occurs. If using seven genes is good, why not use thousands? This is the central idea behind core genome Multilocus Sequence Typing (cgMLST). Instead of a tiny sample, we analyze a huge, standardized portion of the genome. But which portion?

Here lies a beautifully elegant concept: the core genome. Within any given bacterial species, there is a set of genes that are present in nearly every single member. This is the shared, essential genetic toolkit that defines the species—the genes for building cell walls, for metabolizing sugar, for replicating DNA. It is the conserved heart of the organism's identity.

The mechanism of cgMLST is a model of clarity and standardization:

First, we sequence an isolate's whole genome.
Next, our software looks for a predefined list of genes that make up the core genome for that species. This list, or schema, is standardized and can contain thousands of genes—for example, around 3,000 for Salmonella or 2,500 for Shigella.
For each of these core genes (or loci), the software determines its exact DNA sequence. Each unique sequence variant found for a given gene is called an allele.
Here's the clever part. Instead of working with long, cumbersome DNA sequences, we use a shared, public database. This database acts as a dictionary. It says, "For gene glmS, the sequence ATGCGT... shall be known as allele number 1. The sequence ATGCGA... shall be known as allele number 2," and so on.
The final result for our bacterium is no longer a massive DNA file, but a simple, compact string of integers—its allelic profile. For instance, an isolate might be defined by the profile (1, 5, 23, 4, ... , 112).

The genius of this approach is its portability. Because every laboratory in the world can agree to use the same schema (the same list of core genes) and the same allele dictionary, their results become instantly comparable. A public health lab in Brazil can determine that their isolate has the profile (1, 5, 24, 4, ...) and know that it is different from an isolate in Japan with profile (1, 5, 23, 4, ...) at precisely one core gene, without ever having to exchange raw, complicated sequence data. It creates a universal language for microbial genomics.

Measuring Distance: A Tale of Two Genomes

With this universal barcode in hand, comparing two bacteria becomes astonishingly simple. We just line up their allelic profiles and count the number of positions at which the allele numbers differ. This count is the allelic distance. If Isolate A's profile is (1, 5, 23, 4) and Isolate B's is (1, 5, 24, 4), their allelic distance is 1. This simple count, known in mathematics as the Hamming distance, is a powerful measure of relatedness.

This distance is not just an abstract number; it's a proxy for time. As bacteria replicate, their DNA polymerase occasionally makes mistakes, introducing mutations. Many of these mutations accumulate at a roughly steady rate, like the ticking of a molecular clock. A small allelic distance implies that two isolates shared a common ancestor very recently—so recently that there hasn't been much time for mutations to accumulate. This is strong evidence of a direct or near-direct transmission link. A large distance, conversely, means the isolates' lineages have been diverging for a long time.

Consider a real-world scenario from an investigation of Shigella dysentery. Isolates collected from patients at Clinic A were all within 0 to 3 allele differences of each other. Those from Clinic B were within 1 to 4 differences. This tells us we have two tight, internally consistent clusters. However, the distance between any isolate from Clinic A and any from Clinic B was a much larger 22 to 31 alleles. And the distance from any of these outbreak isolates to a background case from a traveler was over 180 alleles. The numbers paint a crystal-clear picture: we have two distinct but related outbreaks (Clusters A and B), both of which are completely unrelated to the traveler's infection.

Drawing the Line: What is a Cluster?

This naturally leads to a critical, practical question: How small must a distance be for us to declare two cases part of the same transmission cluster? We need a distance threshold. For a Shigella scheme with 2,500 loci, a public health lab might set a threshold of, say, 5 or 10 alleles. Any pair of isolates with a distance at or below this threshold is considered part of a putative cluster.

This threshold isn't pulled from a hat. It's a carefully considered balance. If the threshold is too low, we might fail to link cases that are truly related but have accumulated a few mutations (a loss of sensitivity). If it's too high, we risk lumping unrelated, sporadic cases into our outbreak (a loss of specificity).

The choice of threshold can even be derived from first principles. We can model the accumulation of mutations as a statistical process (a Poisson process). Knowing the typical mutation rate ( $\mu$ ) for a bacterium and the size of its core genome ( $L$ ), we can calculate the expected number of new mutations that would arise between two diverging lineages over a given time interval, $\Delta t$ . For example, for a bacterium like Acinetobacter baumannii, this might be about 6 new SNP differences per year. A threshold can then be set to capture the plausible range of diversity expected in a recent outbreak, while remaining far below the diversity seen between random background isolates that have been diverging for years.

We must also contend with the "fog" of technical error. No measurement is perfect. The processes of sequencing DNA and calling alleles can introduce a small number of spurious differences. Two technical replicates of the very same DNA might appear to differ by a handful of alleles due to this random noise. This is another reason we don't demand a distance of zero; a small distance is consistent with both recent transmission and the unavoidable noise floor of the technology.

The Finer Points and the Bigger Picture

The power of cgMLST becomes even clearer when we place it in context.

cgMLST vs. SNP Typing: An alternative high-resolution method is single nucleotide polymorphism (SNP) typing, which counts every single DNA letter difference across the core genome. SNP typing offers the ultimate resolution—if a single gene contains five mutations, the SNP distance is 5, whereas the cgMLST distance is just 1 (one allele changed to another). However, SNP typing suffers from a portability problem. The exact SNP distance calculated between two isolates can change depending on the choice of reference genome and the specific bioinformatics software used. cgMLST, by abstracting sequence differences into stable, centrally defined allele numbers, elegantly sidesteps this issue, prioritizing portability and reproducibility over a small gain in resolution.
Core vs. Whole Genome: Some schemes extend cgMLST to whole genome MLST (wgMLST) by including genes from the accessory genome—the flexible set of genes that are not present in all strains. These often include genes for antibiotic resistance or virulence, which are of great interest. While this provides more information, it introduces the complication of missing data (how do you compare a gene that one isolate has but the other lacks?) and can, perhaps counter-intuitively, make cluster definitions less stable over time. The beauty of the core genome is its stability and universality as a phylogenetic backbone.
The Chromosome vs. The Hitchhiker: A crucial distinction must be made. cgMLST tracks the evolution of the bacterium's chromosome, its stable inheritance. Many antibiotic resistance genes, however, live on small, mobile pieces of DNA called plasmids, which can be transferred between completely unrelated bacteria in a process called horizontal gene transfer. Therefore, finding two bacteria with the same resistance gene does not prove they are clonally related any more than finding two strangers with the same brand of phone proves they are family. To establish a transmission link, one must type the bacterium itself using a method like cgMLST.

The Unseen Hand: A Stable Universe of Rules

Finally, we arrive at a point that is less about biology and more about the sociology of science, but is no less fundamental. This entire elegant system of portable, comparable barcodes depends on one thing: a stable and shared set of rules. The schema—the list of core genes—and the allele dictionary are the bedrock of cgMLST. If a central curator were to change these rules without warning—for example, by splitting a locus they discovered was problematic, or re-numbering alleles to be more compact—the entire system would fall into chaos. A laboratory using the new rules would get different results from the same data as a laboratory using the old rules, destroying the very portability the system was designed to create.

The solution is a robust system of governance and version control. Each version of a schema and its associated allele database must be given a unique, immutable identifier. Any analysis must report the precise version it used. Changes that break backward compatibility must be clearly flagged. This may seem like bureaucratic book-keeping, but it is the invisible infrastructure that allows a global network of scientists to speak the same language, to trust each other's data, and to work together to track and control infectious diseases. It is the framework that turns a beautiful idea into a powerful, world-changing tool.

Applications and Interdisciplinary Connections

Having understood the principles that allow us to turn raw DNA sequences into a standardized, comparable language of alleles, we can now embark on a journey to see how this remarkable tool, core genome MLST (cgMLST), is put to work. It is in its application that the true beauty and power of the concept are revealed. cgMLST is not merely a method for cataloging bacteria; it is a lens through which we can witness the invisible pathways of infection, a storyteller that recounts the secret journeys of pathogens through our world. It transforms public health from a reactive discipline into a predictive, quantitative science, weaving together threads from microbiology, epidemiology, evolutionary biology, and statistics into a single, coherent tapestry.

The Detective's Magnifying Glass: Delineating Outbreaks

At its most fundamental level, cgMLST serves as a digital magnifying glass for the public health detective. Imagine an outbreak of foodborne illness. People are getting sick, and we suspect a common source. But which one? Is it the chicken, the salad, or something else entirely? By sequencing the genome of the bacterium from patients, from suspected food items, and from the environment of a processing plant, we can use cgMLST to compare their genetic fingerprints.

The logic is beautifully simple. If the isolates all originated from the same recent source, they are essentially members of the same clone. There hasn't been enough time for many mutations to accumulate in their core genomes. Consequently, they will have identical or nearly identical cgMLST profiles. Investigators establish a threshold—often a handful of allelic differences, say 7 or fewer—based on the known mutation rate of the organism. Any isolates falling within this threshold are considered part of the same genetic cluster. When we find that the isolates from two patients, a suspected food product, and an environmental swab from the factory all fall within this tight cluster, while an unrelated reference strain is dozens or hundreds of alleles different, we have found our culprit. The genetic evidence provides a powerful, objective link, confirming the epidemiological suspicion.

This process is not mere guesswork. It can be formalized into a precise algorithm. For a large number of isolates, we can compute a pairwise distance matrix, where each entry is the number of allelic differences between two isolates. The task of finding outbreak clusters then becomes equivalent to a classic problem in graph theory: finding the connected components of a network. Each isolate is a node, and an edge is drawn between any two nodes if their distance is below the threshold. The resulting clusters are the distinct outbreaks. This brings a mathematical rigor to outbreak definition, allowing computers to sift through data from hundreds of isolates to reveal hidden transmission networks that would be impossible to discern otherwise.

A Tale of Two Genomes: The Core and the Accessory

A common puzzle in outbreak investigation arises when two isolates seem to be part of the same outbreak based on their cgMLST profiles, yet they behave differently. For instance, one might be susceptible to a common antibiotic, while the other is resistant. Does this mean they are not the same strain?

Here, cgMLST reveals a deep principle of bacterial genetics: the distinction between the core genome and the accessory genome. The core genome is the stable "chassis" of the bacterium, containing the essential housekeeping genes that define the species. cgMLST focuses exclusively on this chassis. The accessory genome, by contrast, is a collection of optional parts—genes on mobile elements like plasmids—that can be gained, lost, or swapped with incredible speed. These elements often carry genes for traits like antibiotic resistance.

Consider a real-world scenario where two Salmonella isolates from patients in the same city are found to differ by only 4 alleles out of nearly 3,000 core genes—a distance entirely consistent with recent, clonal transmission. However, one is susceptible to ampicillin, while the other is resistant due to a beta-lactamase gene found on a plasmid. cgMLST tells us these are, for epidemiological purposes, the same strain. The resistant isolate is simply a member of the outbreak clone that happened to acquire a resistance plasmid. This distinction is critical. The core genome tells us about ancestry and transmission pathways, while the accessory genome tells us about the bacterium's recent adaptations. Confusing the two would be like concluding that two cars are from different manufacturers just because one has a new set of tires.

A Universal Language for a Global Village

Perhaps the most profound impact of cgMLST is its role in standardization. Before whole-genome sequencing, different laboratories used a patchwork of typing methods, making it difficult to compare results across regions or countries. cgMLST provides a stable, universal nomenclature. An allele is assigned a specific number in a curated database. This means an isolate's cgMLST profile—a simple string of numbers—is a portable and unambiguous identifier that can be shared and compared globally.

This feature leads to a strategic division of labor in genomic surveillance. For routine, multi-hospital or international surveillance, cgMLST is the ideal tool. Its standardized, portable nature ensures that a cluster identified in Paris can be immediately and reliably compared to a new case in Tokyo.

This is in contrast to another powerful technique, SNP-based analysis, which counts every single nucleotide difference across the core genome. SNP analysis offers higher resolution, which is perfect for a deep-dive investigation into a single, localized outbreak to reconstruct a precise person-to-person transmission chain. However, SNP counts are highly sensitive to methodological choices, like the reference genome used for comparison and the algorithms used to mask recombination events. Two labs analyzing the same data with different SNP pipelines can get different results. cgMLST, by collapsing the information within each gene into a single allele number, sacrifices some resolution but gains immense robustness and reproducibility—making it the lingua franca of global pathogen surveillance.

The Pathogen's Clock: From Distance to Time

The number of allelic differences is more than just a measure of similarity; it is a measure of time. Because mutations accumulate in the core genome at a roughly constant rate—a phenomenon known as the molecular clock—the genetic distance between two isolates is proportional to the time that has passed since they shared a common ancestor.

This allows us to move from simple clustering to quantitative hypothesis testing. Imagine we are investigating a Salmonella outbreak and have a clinical isolate from a patient and a matching isolate from a batch of food. The food was produced two months before the patient fell ill. Knowing the approximate mutation rate for Salmonella (e.g., $1.0 \times 10^{-6}$ substitutions per site per year) and the size of its core genome (e.g., $3 \times 10^6$ base pairs), we can calculate the expected number of SNP differences that should accumulate between the two isolates over that two-month period. The calculation might predict an average of just one SNP.

The observed distance, say 4 SNPs or 3 cgMLST allele differences, is a random variable drawn from a distribution (like a Poisson distribution) centered on this expectation. A value of 4 is stochastically plausible. However, if we compared the patient's isolate to another isolate and found a distance of 15 SNPs or 12 allele differences, this would be astronomically unlikely to occur in just two months. It is far more likely that this second isolate belongs to a different, more distantly related lineage. This turns cgMLST from a simple pattern-matching tool into a veritable stopwatch for tracking evolution on epidemiological timescales.

Weaving a Web of Evidence: The Great Synthesis

In the most advanced applications, cgMLST is not a standalone solution but one powerful thread in a much larger web of evidence. Modern infectious disease epidemiology is a science of synthesis, integrating data from genomics, statistics, and field investigation into a single, powerful inferential framework.

One such synthesis connects a pathogen's core genome to its other characteristics. We can ask, "If I know two isolates are in the same cgMLST cluster, what is the probability they also share the same antimicrobial resistance (AMR) profile?" And we can ask the reverse: "If two isolates share an AMR profile, what is the probability they are in the same cgMLST cluster?" Using a statistical framework based on conditional probability, we can calculate these two values, known as directional concordance coefficients. Often, these probabilities are not symmetric. The fact that $P(\text{same AMR} | \text{same cgMLST})$ is often higher than $P(\text{same cgMLST} | \text{same AMR})$ is a beautiful, quantitative confirmation of the "core vs. accessory" genome model: clonal relatedness is a good predictor of shared traits, but shared traits (which can be due to horizontal gene transfer) are a weaker predictor of clonal relatedness.

The grandest synthesis occurs in source attribution, where we aim to pinpoint the origin of an infection. Here, we can build a Bayesian framework that formally weighs all the evidence. Information about the prevalence of a pathogen in different environmental sources (like poultry, water, and produce) and the patient's exposure history (did they eat undercooked chicken?) allows us to form a prior belief about the likely source. The cgMLST distance between the patient's isolate and isolates from each potential source then serves as powerful new evidence. A small distance dramatically increases our belief in a source, while a large distance diminishes it. Using Bayes' theorem, we combine our prior belief with this new evidence to calculate a posterior probability for each source. This is the pinnacle of data integration, turning a complex puzzle with multiple, uncertain pieces of information into a final, quantified statement of probability.

This integrative power finds its ultimate expression in the "One Health" framework, which recognizes that the health of humans, animals, and the environment are inextricably linked. Imagine tracking an antibiotic-resistant E. coli strain. With cgMLST and plasmid typing, we can demonstrate that an isolate from a hospital patient is clonally related (e.g., 7 allele differences) to an isolate found ten days earlier in a cattle farm's manure lagoon 8 kilometers away. We might also see that both share the same resistance plasmid. At the same time, an isolate from a nearby river might be genetically distant and carry a different plasmid. By combining genomic data with spatio-temporal information, we can build a compelling case for a transmission pathway from the farm to the human, highlighting the agricultural reservoir of a clinical problem. cgMLST allows us to see these vast, interconnected networks of microbial life, revealing the hidden unity of the ecosystem we all share.

Core Genome Multilocus Sequence Typing (cgMLST)

Introduction

Principles and Mechanisms

The Problem of Identity: Who is Related to Whom?

A Universal Barcode for Bacteria

Scaling Up: From Seven Genes to the Core Genome

Measuring Distance: A Tale of Two Genomes

Drawing the Line: What is a Cluster?

The Finer Points and the Bigger Picture

The Unseen Hand: A Stable Universe of Rules

Applications and Interdisciplinary Connections

The Detective's Magnifying Glass: Delineating Outbreaks

A Tale of Two Genomes: The Core and the Accessory

A Universal Language for a Global Village

The Pathogen's Clock: From Distance to Time

Weaving a Web of Evidence: The Great Synthesis

Core Genome Multilocus Sequence Typing (cgMLST)

Introduction

Principles and Mechanisms

The Problem of Identity: Who is Related to Whom?

A Universal Barcode for Bacteria

Scaling Up: From Seven Genes to the Core Genome

Measuring Distance: A Tale of Two Genomes

Drawing the Line: What is a Cluster?

The Finer Points and the Bigger Picture

The Unseen Hand: A Stable Universe of Rules

Applications and Interdisciplinary Connections

The Detective's Magnifying Glass: Delineating Outbreaks

A Tale of Two Genomes: The Core and the Accessory

A Universal Language for a Global Village

The Pathogen's Clock: From Distance to Time

Weaving a Web of Evidence: The Great Synthesis