try ai
Popular Science
Edit
Share
Feedback
  • CIGAR string

CIGAR string

SciencePediaSciencePedia
Key Takeaways
  • The CIGAR string uses a simple set of operators (M, I, D) to precisely describe how a sequencing read aligns against a reference genome, noting matches, insertions, and deletions.
  • Specialized operators like N (skipped region) and S (soft clip) are crucial for identifying complex biological phenomena such as RNA splicing and large-scale structural variants.
  • CIGAR strings are foundational to a wide range of bioinformatics applications, including calling genetic variants (SNPs, indels), analyzing gene expression, and detecting genomic rearrangements.

Introduction

In the era of high-throughput sequencing, genomics research relies on comparing billions of short DNA fragments, or reads, to a reference genome. This massive comparative task presents a fundamental challenge: how can we efficiently and accurately report the precise relationship between each read and the reference, accounting for everything from single-letter mutations to complex structural rearrangements? The answer lies in a simple yet powerful notation known as the CIGAR string, a universal language for describing sequence alignments. This article demystifies the CIGAR string, providing a guide to its structure and use. The first chapter, ​​Principles and Mechanisms​​, will dissect the grammar of the CIGAR string, explaining its core operators and how they capture events like insertions, deletions, and RNA splicing. Building on this foundation, the second chapter, ​​Applications and Interdisciplinary Connections​​, will explore how bioinformaticians use this language to uncover the biological stories hidden in sequencing data, from identifying disease-causing variants to mapping the architecture of entire genomes.

Principles and Mechanisms

Imagine you’re a cartographer tasked with documenting a new road. You have the official government map—the ​​reference genome​​—and a GPS log from a car that just drove the route—the ​​sequencing read​​. The problem is, the driver didn’t follow the map perfectly. They took a small, unmapped scenic detour, found a bridge on the map that was washed out and had to be skipped, and their GPS log even includes their trip pulling out of their driveway, which isn't on the main map at all. How do you concisely describe the car's actual journey relative to the official map?

This is precisely the challenge of sequence alignment, and the beautifully simple solution is a language called the ​​CIGAR string​​. CIGAR, which stands for Concise Idiosyncratic Gapped Alignment Report, is a set of instructions—a "turn-by-turn" navigation log—that tells us exactly how the sequence of a read matches, mismatches, or deviates from the reference genome. It’s the Rosetta Stone that translates a raw string of genetic letters into a story of biological events.

The Basic Alphabet of Alignment

At its heart, the CIGAR language is built on a simple duality. For every step in the alignment, we must ask two questions: did this step "use up" a base from our read? And did it "use up" a base from our reference map? The answers to these questions define the core operators. Let's look at the three most common ones.

  • ​​M - Match or Mismatch​​: This is the workhorse of CIGAR. It means "walk one step forward, comparing a base from the read to a base on the reference." This alignment could be a perfect match, or it could be a mismatch—a single-letter difference, known as a ​​Single Nucleotide Polymorphism (SNP)​​. The M operator doesn't care; it simply states that the read and the reference are aligned at this position. A CIGAR string of 100M tells us that 100 consecutive bases of the read have been aligned against 100 consecutive bases of the reference. Even if one of those positions is a SNP, the alignment itself is considered a continuous block, and the CIGAR string remains 100M. This operation consumes both the read and the reference.

  • ​​I - Insertion​​: This is the scenic detour. An insertion means the read contains extra bases that do not appear in the reference genome. Imagine a CIGAR string like 15M5I10M. This translates to: "Align 15 bases, then encounter 5 bases in the read that have no counterpart on the reference, then continue aligning for another 10 bases." The I operation consumes bases from the read, but not from the reference.

  • ​​D - Deletion​​: This is the washed-out bridge. A deletion means the reference genome has bases that are missing in the read. The CIGAR string 10M5D15M means: "Align 10 bases, then skip over 5 bases that exist on the reference map but weren't part of the journey, then resume aligning for another 15 bases." The D operation consumes bases from the reference, but not from the read.

With just these three letters, we can begin to quantify the "shape" of an alignment. By summing the lengths of all operations that consume the read (M and I), we can calculate the total length of the sequencing read itself. For a CIGAR of 15M5I10M5D15M, the read length is 15(M)+5(I)+10(M)+0(D)+15(M)=4515 (\text{M}) + 5 (\text{I}) + 10 (\text{M}) + 0 (\text{D}) + 15 (\text{M}) = 4515(M)+5(I)+10(M)+0(D)+15(M)=45 bases. Conversely, by summing operations that consume the reference (M and D), we can find how much of the reference genome this alignment spans. For the same CIGAR, the reference span is 15(M)+0(I)+10(M)+5(D)+15(M)=4515 (\text{M}) + 0 (\text{I}) + 10 (\text{M}) + 5 (\text{D}) + 15 (\text{M}) = 4515(M)+0(I)+10(M)+5(D)+15(M)=45 bases. The read and reference spans aren't always equal, and this simple arithmetic reveals the presence of these small insertions and deletions, collectively known as ​​indels​​.

Reconstructing the Journey

Now that we have the basic alphabet, let's read a more complex story. Suppose a bioinformatician hands you an alignment with the CIGAR string 12M1I6M2D5M (a simplified version of the scenario in. What does this actually look like?

We can reconstruct the alignment step-by-step:

  1. 12M: We align the first 12 bases of the read to the first 12 bases of the reference.
  2. 1I: We see the read has one extra base. We place this base in the read's sequence and leave a gap in the reference sequence.
  3. 6M: We continue aligning the next 6 bases of the read and reference.
  4. 2D: Now, we find the reference has two bases that are missing from the read. We place these two bases in the reference sequence and leave a two-base gap in the read's alignment.
  5. 5M: Finally, we align the last 5 bases of both.

Visually, it looks something like this:

loading

This reconstruction makes the abstract code tangible. We can now clearly see the single-base insertion and the two-base deletion. The CIGAR string doesn't just tell us that the read aligns; it tells us how it aligns, complete with all its imperfections and deviations. It gives us a precise "edit distance"—the number of changes (substitutions, insertions, and deletions) needed to transform the reference segment into the read segment.

A Window into Life's Central Dogma: Splicing

The CIGAR language is powerful enough to describe more than just small mutations; it can reveal fundamental biological processes. Imagine you are sequencing messenger RNA (mRNA)—the "working copies" of genes—and aligning them back to the DNA genome. In organisms like humans, genes in the DNA are often fragmented into pieces called ​​exons​​, separated by non-coding regions called ​​introns​​. When a gene is expressed, the cell transcribes the whole thing into a pre-mRNA molecule and then "splices" out the introns, stitching the exons together to make the final, mature mRNA.

What happens when a sequencing read comes from one of these spliced mRNA molecules? Suppose a 100-base read aligns perfectly and contiguously to the genome. Its CIGAR would simply be 100M. But what if another read from the same gene spans an exon-exon junction? The first part of the read will align to the end of one exon, and the second part will align to the beginning of the next exon. On the DNA reference map, these two exons are separated by an intron, perhaps 50 bases long.

The aligner needs a way to say, "I aligned 25 bases, then leaped over a 50-base chasm on the reference map, and then landed and aligned another 25 bases." For this, we have the N operator. The resulting CIGAR string would be 25M50N25M. The N operation consumes the reference (it skips 50 bases) but consumes nothing from the read. Seeing a profusion of N operations in your data is a tell-tale sign that you are looking at the beautiful, dynamic process of gene splicing. The CIGAR string becomes a direct observation of the central dogma of biology in action.

Handling the Messy Edges: Clipping and Structural Variants

So far, our journey has been mostly on the map. But what about the parts that aren't on the map at all? This happens in two common scenarios. First, a technical artifact: sometimes the sequencing machine reads past the actual DNA fragment and into the ​​adapter sequences​​ used in the lab, which aren't part of the organism's genome. Second, a dramatic biological event: the read might span a ​​structural variant (SV)​​, like a translocation where a huge chunk of chromosome 1 has been broken off and attached to chromosome 8.

A read spanning such a breakpoint might have its first 75 bases perfectly matching chromosome 1, while its last 75 bases match chromosome 8. When you try to align this read to its "home" on chromosome 1, the first 75 bases will align beautifully, but the last 75 will look like complete gibberish.

An aligner's solution is ​​clipping​​. It identifies the part of the read that does align and "clips off" the part that doesn't. There are two ways to do this. ​​Hard clipping (H)​​ throws the unaligned sequence away forever—a terrible loss of information. But ​​soft clipping (S)​​ is far more clever. It keeps the unaligned bases in the data record but flags them as not being part of the primary alignment.

Why is this so brilliant? Because the clipped sequence is not garbage; it is a clue! For the read with an adapter, a CIGAR like 120M30S tells us that the last 30 bases didn't align. We can then collect all these soft-clipped sequences and see if they match known adapter sequences, giving us a powerful quality control metric.

Even more exciting is the case of the structural variant. The read spanning the translocation will get a primary alignment on chromosome 1 with a CIGAR of 75M75S. A smart SV detection program can then ask a simple question: "Where in the genome do all these 75-base soft-clipped ends belong?" When it finds that they all align perfectly to a spot on chromosome 8, it has found the smoking gun for a massive genomic rearrangement. By not throwing away the "off-map" part of the journey, soft clipping allows us to discover these large-scale changes that can be drivers of cancer and other diseases.

From a single-letter change to the grand architecture of our chromosomes, the humble CIGAR string provides a single, unified language. It is a testament to the elegance that can be found in computational biology—a simple grammar that, when read correctly, tells the intricate and ever-evolving story of the genome.

Applications and Interdisciplinary Connections

Having mastered the grammar of the CIGAR string—its operators, its rules, its structure—we can now begin to appreciate the poetry it writes. This concise notation is far more than a dry, technical detail in a data file. It is a stenographer's shorthand, meticulously capturing the story of how a fragment of DNA or RNA, plucked from the intricate machinery of a cell, measures up against our reference map of the genome. Sometimes the story is one of perfect correspondence; more often, it is a tale of subtle and profound differences. It is in these differences that the most exciting biological discoveries are made, and the CIGAR string is our primary tool for reading them. Let us embark on a journey through the myriad applications of this elegant language, from the smallest genetic typo to the grand architectural rearrangements that shape evolution and disease.

The Genetic Detective: Decoding Variation One Read at a Time

At its most fundamental level, genomics is a comparative science. We want to know how the DNA of one individual differs from another, or how a patient's genome differs from the "standard" human reference. These differences, or variants, are the raw material of heredity, evolution, and disease. The most common of these are single nucleotide polymorphisms (SNPs)—single-letter changes in the DNA code—and short insertions or deletions (indels).

How does a CIGAR string help us find them? Imagine you are a detective examining a sentence copied from a master text. The CIGAR string is your first clue. An operation like 100M tells you that a 100-character segment of the copy aligns to the master text. But does it match perfectly? The M operator is agnostic; it allows for both matches and mismatches. To solve the case, we need a second piece of evidence: the MD tag. The MD tag is like an errata sheet, explicitly noting where mismatches occur and what the original character in the master text was.

By walking along an alignment, guided by the CIGAR string, and cross-referencing with the MD tag, we can build a complete picture. A CIGAR M operation tells us a position is aligned. The MD tag then tells us if it's a perfect match (by including it in a run of numbers) or a mismatch (by specifying the reference base). If it's a mismatch, we simply look at the read's sequence at that position to find the new character. An insertion (I) or deletion (D) in the CIGAR string gives us the exact location and length of an indel. This powerful, synchronized dance between the CIGAR, the MD tag, and the read sequence is the core logic inside every "variant caller" program, which systematically sifts through billions of reads to produce a comprehensive list of genetic variations for an individual.

The Transcriptome's Tale: From Splicing to Fusions and Editing

If the genome is the cell's master cookbook, the transcriptome—the collection of all RNA molecules—represents the recipes being actively used. Aligning RNA back to the DNA genome reveals a world of dynamic activity, and the CIGAR string is our indispensable guide.

The most common event in the life of a eukaryotic gene is splicing. When a gene is transcribed into RNA, non-coding regions called introns are cut out, and the remaining coding regions, or exons, are stitched together. When we align this spliced RNA back to the genome, the read will map contiguously to the exons but will have to jump over the introns. The CIGAR string elegantly captures this with the N operator, which signifies a large skipped region on the reference. A CIGAR like 50M2000N50M instantly tells us that we have a read mapping to two exons separated by a 2000-base-pair intron.

Sometimes, this cutting-and-pasting process goes awry, with dramatic consequences. In certain cancers, two entirely different genes can become fused together, creating a "chimeric" transcript that produces a rogue protein. A classic example is the BCR-ABL fusion gene, a hallmark of chronic myeloid leukemia. A sequencing read that spans this fusion junction presents a unique alignment puzzle: the first part of the read maps perfectly to the BCR gene on one chromosome, and the second part maps to the ABL1 gene on another. An aligner resolves this by creating two alignment records for the single read. The primary alignment might describe the first part of the read matching the BCR gene, with the rest of the read marked as unaligned using the soft-clip (S) operator (e.g., 60M40S). A special tag, the SA:Z tag, then points to a "supplementary" alignment, which describes how the other part of the same read maps to the ABL1 gene (e.g., with a CIGAR of 60S40M). The CIGAR string's soft-clip operator is the crucial flag that signals a potential breakpoint, leading the investigator to discover the fusion event—a discovery with profound diagnostic and therapeutic implications.

The transcriptome holds even more subtle secrets. Beyond the central dogma, cells can perform "RNA editing," chemically modifying RNA bases after they've been transcribed from DNA. The most common form, A-to-I editing, converts an adenosine (A) into inosine (I), which sequencing machines read as a guanosine (G). This appears in our data as an A-to-G mismatch between the reference genome and the RNA read. Using the same precise logic for finding SNPs, we can systematically scan our RNA alignments, using the CIGAR and MD tags to find positions where the reference is an 'A' and the read is a 'G', allowing us to map these functional modifications across the entire transcriptome.

The Architect's Blueprint: Mapping Large-Scale Structural Changes

The genome is not a static string of letters; it is a dynamic, physical structure. Large segments can be deleted, duplicated, inverted, or moved to new locations. These "structural variants" (SVs) can have major impacts on health and are a key driving force in evolution. Detecting them requires us to think like architects, looking for changes in the genome's large-scale blueprint. The CIGAR string, especially when combined with modern long-read sequencing, provides the crucial measurements.

A single long read, thousands of bases in length, can span an entire structural variant, giving us an immediate and unambiguous picture. Imagine a read that aligns to a reference, but its CIGAR string contains a 5000D operation. This is a direct observation that 5000 bases present in the reference are simply missing from the sequenced DNA molecule. Conversely, a 2300I operation signals that the sequenced molecule contains a 2.3-kilobase insertion of new sequence, such as a "jumping gene" or transposable element, that is absent from the reference.

The real power comes from combining multiple lines of evidence, where CIGAR signatures play a central role. By analyzing a population of reads covering a genomic region, we can diagnose a whole suite of rearrangements:

  • ​​Deletions​​: The region exhibits a stark drop in read coverage (approaching zero), and reads that span the gap have large D operations in their CIGARs.
  • ​​Tandem Duplications​​: The region shows a doubling of read coverage. Furthermore, "split reads" appear, whose CIGARs show soft clipping (S) at the breakpoint, revealing a novel "head-to-tail" junction where the end of a segment has been connected back to its own beginning.
  • ​​Inversions​​: Read coverage remains normal, but split reads at the boundaries show a tell-tale switch in orientation. For example, one part of the read aligns to the forward strand, and the other part aligns to the reverse strand, signaling that the entire segment in between has been flipped.

In cancer genomics, finding tumor-specific SVs is paramount. The breakpoints of these rearrangements create a "pile-up" of soft-clipped reads right at the junction, as the aligner doesn't know how to map the part of the read that crosses into a rearranged context. By computationally scanning the genome for windows with a statistically significant excess of soft-clipped reads in a tumor sample compared to a matched normal sample, we can pinpoint the exact locations of these somatic rearrangements with high precision.

Of course, not all soft clips point to biology. They are also excellent diagnostics for technical artifacts. If a DNA fragment being sequenced is shorter than the read length, the sequencer will read through it and into the synthetic "adapter" sequence ligated to its end. The aligner, unable to match the adapter to the genome, will simply represent it as a soft-clipped (S) region. By examining the CIGAR and the alignment's strand, we can determine if an 18-base soft clip is at the beginning or end of the original molecule, helping us clean our data before biological analysis.

The Linguist's Challenge: Extending the Language of Alignment

The CIGAR string is such a successful language for nucleotide alignments that we are inspired to ask: can we adapt it for other problems? This is where we move from being users of the language to being its designers.

Consider the challenge of aligning a protein sequence (in amino acids) back to the genome (in nucleotides). A single amino acid corresponds to a three-base codon. The standard CIGAR M operator, which implies a one-to-one consumption of query and reference units, is no longer valid. Furthermore, introns can split a codon, interrupting it after the first or second base (intron phases 1 or 2). A simple N operator for the intron doesn't tell us where to resume the reading frame on the other side. To create a self-contained, CIGAR-like string for this problem, we would need to invent new operators: a "codon-match" operator that consumes 1 amino acid and 3 nucleotides per unit, and three distinct, phase-aware intron operators to ensure the correct reading frame is maintained across junctions. This thought experiment reveals the deep design principles of the CIGAR format and the challenges of extending it.

This evolution is not just a theoretical exercise. As genomics moves from a single linear reference to complex "pangenome" graphs that represent the genetic diversity of an entire species, we face a new linguistic challenge. How do we describe a read's path through a graph? The alignment is no longer just a sequence of matches and indels but also a choice of which nodes to traverse. An elegant extension might involve annotating the standard CIGAR operations with the graph node they occurred on, creating a string like v1[3=1I]>v3[1=]>v4[3=]. This format explicitly states the path taken (v1 to v3 to v4) and, by stripping the graph annotations, gracefully degrades back to the familiar linear CIGAR (3=1I1=3=), ensuring backward compatibility. Designing such formats is an active area of research, pushing the boundaries of how we map and interpret genomic information.

From the smallest SNP to the architecture of entire genomes, from the static DNA code to the dynamic world of RNA, the CIGAR string provides the essential narrative. It is a testament to the power of a well-designed notation—a simple language that, in the right hands, can be used to read the intricate and ever-evolving story of life itself.

Reference: AAAAAAAAAAAA-GGGGGGTTCCCCC Read: AAAAAAAAAAAACGGGGGG--CCCCC