McMillan Theorem

SciencePedia

Key Takeaways

The Kraft-McMillan theorem states that a set of codeword lengths can form a uniquely decodable code if, and only if, they satisfy the Kraft inequality, $\sum D^{-l_i} \le 1$ .
The theorem proves that for the purpose of finding viable codeword lengths, the simpler class of prefix codes is equivalent to the more general class of all uniquely decodable codes.
The inequality acts as a "budget" for code design, allowing engineers to validate a code's feasibility, plan for future expansion, and optimize codeword selection.
It provides the theoretical foundation for optimal compression algorithms like Huffman coding, linking practical code design to the fundamental limits of entropy.

Introduction

In the quest to represent information efficiently and without ambiguity, a fundamental question arises: given a set of desired codeword lengths, can a functional code even be constructed? Simply choosing shorter words for common symbols and longer words for rare ones seems intuitive, but an arbitrary choice can lead to impossible designs or confusing messages. This article delves into the McMillan Theorem, a cornerstone of information theory that provides a definitive answer to this question, offering a simple yet powerful mathematical test for what is possible in the world of coding.

This article will guide you through the elegant principles and practical power of this theorem. First, in "Principles and Mechanisms," we will dissect the mathematics behind the theorem—the Kraft inequality—exploring it as a "budget of bits" that governs the existence of all uniquely decodable codes. We will uncover the surprising result that simple, instantaneously decodable prefix codes are just as powerful as more complex ones in this regard. Following this, the "Applications and Interdisciplinary Connections" section will demonstrate the theorem's profound practical impact, showing how it serves as a vital sanity check for engineers, a basis for optimization algorithms, and a crucial link to the ultimate limits of data compression defined by Huffman codes and Shannon's entropy.

Principles and Mechanisms

Imagine you're inventing a new language. Not for speaking, but for machines. You have a set of ideas you want to represent—let's say, commands for a drone like 'ASCEND' or 'ROTATE'. Your new language uses a simple alphabet, perhaps just two symbols, '0' and '1'. The task is to assign a unique "word" (a codeword) to each command.

You'd probably want to make the words for common commands short and the words for rare commands longer. This is the essence of efficient communication, the heart of data compression. But here's a tricky question: if you just dream up a list of desired lengths for your words, say, one word of length 2, two of length 3, and one of length 4, can you actually create a set of words that a machine can understand without confusion?

It turns out you can't just pick any lengths you like. There's a fundamental constraint, a beautiful and simple rule that governs the very possibility of creating a sensible code. This rule is the key to understanding the architecture of information itself.

The Budget of Bits

Let's think about this visually. Imagine a tree, where the root is the starting point of any message. In a binary code, every time you add a '0' or a '1', you move down the tree, taking a left or a right branch. A codeword is simply a path from the root to a specific node. For your code to be instantaneously decodable (also called a prefix code), you have a simple rule: once you claim a node for a codeword, you cannot use any of the nodes further down that branch. If '10' is a codeword, then '100', '101', '1011', etc., are all forbidden. No codeword can be a prefix of another.

This immediately suggests a trade-off. If you choose a very short codeword, like '0', you've just pruned off half of the entire tree! All possible codewords that start with '0' are now gone. A short codeword is powerful, but it comes at a great cost. A longer codeword, like '1101', uses up a much smaller, more remote twig on the tree.

The Kraft-McMillan theorem gives us a precise way to quantify this trade-off. It presents an inequality that acts like a budget. For a code alphabet with $D$ symbols (for binary, $D=2$ ), if you want to have a set of codewords with lengths $l_1, l_2, l_3, \dots, l_M$ , they must satisfy:

$\sum_{i=1}^{M} D^{-l_i} \le 1$

Think of '1' as your total budget. Each codeword you want to create "costs" $D^{-l_i}$ . For a binary code ( $D=2$ ), a codeword of length 1 costs $2^{-1} = \frac{1}{2}$ of your budget. A codeword of length 3 costs $2^{-3} = \frac{1}{8}$ . This inequality simply says: the sum of the costs of all your desired codewords cannot exceed your budget.

Let's see this in action. Suppose a team of engineers wants to encode four symbols. They consider a few options for the lengths of the four codewords.

How about {1, 3, 3, 3}? The cost is $2^{-1} + 2^{-3} + 2^{-3} + 2^{-3} = \frac{1}{2} + \frac{3}{8} = \frac{7}{8}$ . This is less than 1, so our budget is fine. The theorem guarantees we can construct such a code.
What about {1, 2, 2, 3}? The cost is $2^{-1} + 2^{-2} + 2^{-2} + 2^{-3} = \frac{1}{2} + \frac{1}{4} + \frac{1}{4} + \frac{1}{8} = \frac{9}{8}$ . This is greater than 1. We've overspent our budget! The theorem tells us this is impossible. You simply cannot find four binary prefix codewords with these lengths.

An excited new team member might propose a code for six microprocessor instructions with three words of length 2 and three of length 3. Before they spend weeks trying to build it, we can do a quick check: $3 \times 2^{-2} + 3 \times 2^{-3} = \frac{3}{4} + \frac{3}{8} = \frac{9}{8}$ . Again, this is over budget. The proposal is theoretically impossible, and we've saved ourselves a lot of wasted effort, thanks to this simple inequality.

The Surprising Equivalence of Codes

So far, we've only talked about prefix codes, where decoding is instantaneous. But what if we relax this? A code is uniquely decodable if any long string of concatenated codewords can be broken down into its original sequence in only one way, even if you have to look ahead. For example, if we have codewords C = {0, 01}, the string 010 could be (01)(0) or (0)(1...?)—this code is not uniquely decodable because '0' is a prefix of '01'. But what about a set like {1, 10, 100}? The string 100101 can only be parsed as (100)(10)(1). This is uniquely decodable, but it's not a prefix code.

You might think that by allowing these more complex, non-prefix codes, you could maybe find a set of lengths that was previously "over budget". Perhaps the set {1, 2, 2, 3} that cost $\frac{9}{8}$ could work if we were clever enough to find a non-prefix but uniquely decodable arrangement?

Here lies the genius of Brockway McMillan. He proved that if a set of codeword lengths can be used for any uniquely decodable code, then it must satisfy the same Kraft inequality: $\sum D^{-l_i} \le 1$ .

This is a stunning result. It means that in the quest for possible codeword lengths, you gain absolutely nothing by considering the broader class of uniquely decodable codes over the simpler prefix codes. If your chosen lengths have a Kraft sum greater than 1, it's not just that a prefix code is impossible—no uniquely decodable code of any kind can be constructed. The two classes of codes are, for this purpose, equivalent. This simplifies our life enormously: to see if a set of lengths is viable for a uniquely decodable code, we only need to check the Kraft inequality, and if it holds, we know we can build a simple prefix code with those lengths.

Full, Half-Full, or Overflowing?

The value of the Kraft sum, $K = \sum D^{-l_i}$ , tells us a rich story about the nature of our code.

 $K > 1$ (Overflowing): As we've seen, this is the "impossible" zone. The desired lengths are too short, claiming more space on the code tree than exists.
 $K 1$ (Half-Full): This means you haven't used up your entire budget. Your code is valid, but it's not "full". There is leftover capacity. This implies that you can add more codewords to your set without violating the prefix condition. For instance, if your drone's commands have lengths {2, 2, 3, 4}, the Kraft sum is $2^{-2} + 2^{-2} + 2^{-3} + 2^{-4} = \frac{11}{16}$ . The leftover budget is $1 - \frac{11}{16} = \frac{5}{16}$ . How many new commands of length 4 could you add? Each one costs $2^{-4} = \frac{1}{16}$ . So, you can add exactly $5$ new commands of length 4, at which point your budget will be perfectly used up ( $\frac{11}{16} + 5 \times \frac{1}{16} = 1$ ).
 $K = 1$ (Full): This is the special case of a complete code. Every bit of your budget has been spent. The code tree is perfectly filled; there are no available nodes left to assign to a new codeword of any finite length. This is a property of highly efficient codes. For example, if an engineer needs to design a complete code for ten symbols, and has already assigned eight with lengths that sum to $\frac{23}{32}$ , they must choose the final two lengths, $l_9$ and $l_{10}$ , such that their "cost" is exactly the remaining budget: $2^{-l_9} + 2^{-l_{10}} = 1 - \frac{23}{32} = \frac{9}{32}$ . A quick check reveals that lengths of 2 and 5 satisfy this perfectly ( $2^{-2} + 2^{-5} = \frac{1}{4} + \frac{1}{32} = \frac{9}{32}$ ). Sometimes, the constraints are so tight that only one combination of integer lengths is possible, as in a design requiring two codewords of length $l_A$ and three of length $l_B$ to form a complete code; the only solution is lengths of 3 and 2.

Beyond the Binary World

So far, we've lived in the black-and-white world of binary codes. But we can use any number of symbols in our alphabet. What if we used a ternary alphabet ( $D=3$ , symbols {0, 1, 2}) or a DNA-based alphabet ( $D=4$ , symbols {A, C, G, T})? The principle of the McMillan theorem holds just as strong; we simply use the appropriate value for $D$ .

This reveals a fascinating relationship between the size of the alphabet and the feasibility of a code. Consider a proposed set of five codeword lengths: {1, 2, 2, 2, 2}.

For a binary code ( $D=2$ ), the Kraft sum is $2^{-1} + 4 \times 2^{-2} = \frac{1}{2} + 1 = 1.5$ . This is $> 1$ , so it's impossible.
But for a ternary code ( $D=3$ ), the sum is $3^{-1} + 4 \times 3^{-2} = \frac{1}{3} + \frac{4}{9} = \frac{7}{9}$ . This is $\le 1$ , so a ternary prefix code with these lengths can be built!

Having a larger alphabet gives you more "branches" at each step of the code tree, making it easier to fit in short codewords. If an engineer needs to encode six packets with four words of length 2 and two of length 3, a binary alphabet won't work. But by checking the inequality $4D^{-2} + 2D^{-3} \le 1$ , we find that an alphabet of size $D=3$ is the smallest that makes this possible. Similarly, for any alphabet of size $D \ge 2$ , it is always possible to construct a code with one codeword of length 1 and $D$ codewords of length 2, because the Kraft sum is just $\frac{2}{D}$ , which is always less than or equal to 1.

To Infinity and Beyond

The power of this theorem doesn't even stop with a finite number of symbols. What if you had a source that could produce a countably infinite number of different messages? Can you still design a uniquely decodable code? The answer is yes, as long as the infinite sum of the costs converges to a value less than or equal to 1. For instance, if the length of the codeword for the $i$ -th symbol grows sufficiently fast, like $l_i = \lceil \log_2(i^2+1) \rceil$ , the individual "costs" $2^{-l_i}$ shrink so rapidly that their sum remains finite and less than 1, guaranteeing that a code can exist even for an infinite vocabulary.

And perhaps most beautifully, the core idea can be generalized even further. What if the symbols in our code alphabet aren't all equal? Imagine a system where sending a '0' costs 1 unit of energy, but sending a '1' or '2' costs 2 units. The "length" of a codeword is no longer a simple count of symbols, but a total "cost". The McMillan theorem can be adapted to this scenario. We simply replace our sum with $\sum r^{-L_i} \le 1$ , where $L_i$ are the total costs of the codewords, and $r$ is a special number, a characteristic value of the alphabet's cost structure. This shows that the theorem isn't just about length; it's about a fundamental resource, a "budget" that can be measured in bits, energy, or time. It is a universal law for the construction of unambiguous information.

Applications and Interdisciplinary Connections

Having journeyed through the elegant machinery of the Kraft-McMillan theorem, we might be tempted to file it away as a beautiful but abstract piece of mathematics. To do so, however, would be like admiring the blueprint of an engine without ever hearing it roar to life. The true wonder of this theorem lies not in its abstract proof, but in its profound and pervasive influence on the world around us. It is not merely a constraint; it is a fundamental design principle that governs everything from the bits flowing through your internet connection to the very structure of optimal algorithms. It gives us a language to talk about the limits and possibilities of information itself.

The Universal Sanity Check: A Code Designer's First Tool

Imagine you are an engineer tasked with designing a new communication protocol. Before you spend countless hours building hardware or writing complex software, you face a fundamental question: can the code I've imagined even exist? This is not a trivial matter. A poorly designed code can lead to catastrophic ambiguity, where a received message like 10 could mean one thing, or it could mean two entirely different things concatenated together.

The Kraft-McMillan theorem provides a wonderfully simple, yet ruthlessly effective, "sanity check." Before you build anything, you can take your proposed list of integer codeword lengths, $\{l_1, l_2, \dots, l_N\}$ , and simply calculate the Kraft sum, $\sum_{i=1}^{N} 2^{-l_i}$ . If this sum is greater than 1, you can stop immediately. The theorem tells you with absolute certainty that no prefix-free code can be built with those lengths, saving you from a doomed enterprise. For instance, one cannot possibly construct a prefix code with lengths $\{1, 2, 3, 3, 4\}$ , because the sum $2^{-1} + 2^{-2} + 2^{-3} + 2^{-3} + 2^{-4}$ exceeds 1. The theorem acts as a universal law of conservation for code space; you simply cannot use more than you have.

This principle also elegantly explains why some intuitive but flawed designs are impossible. For example, one might try to create a code by assigning symbols not just to the leaves of a binary tree, but to its internal nodes as well. This leads to a situation where one symbol's codeword is a prefix of another's—a recipe for disaster. The Kraft-McMillan inequality immediately flags such a design as invalid, as the sum of $2^{-l_i}$ for such a scheme will invariably be greater than 1, confirming the decoding ambiguity mathematically.

Beyond Pass/Fail: The "Kraft Budget" and Planning for the Future

The true power of a great scientific principle often lies in its ability to be used not just as a "yes/no" gate, but as a quantitative tool for design and optimization. The Kraft-McMillan inequality is a perfect example. The value '1' in the inequality $\sum 2^{-l_i} \le 1$ can be thought of as a total "coding budget." Every codeword you add "spends" a portion of this budget, with shorter words being far more "expensive" than longer ones.

What happens if, after designing a code for a set of symbols, your sum is less than 1? You haven't spent your whole budget! This leftover amount, let's call it $\epsilon = 1 - \sum 2^{-l_i}$ , is not wasted space. It is a measurable, usable resource. It represents the "room" left in your code space for future expansion.

Suppose you need to add a new set of signals to your system, and for hardware reasons, they must all have the same length, $k$ . How many can you add? The answer is a beautiful application of this budget concept. Each new codeword costs $2^{-k}$ of your budget. Therefore, the number of new codewords you can add, $N_{add}$ , must satisfy $N_{add} \times 2^{-k} \le \epsilon$ . This means you can confidently add up to $\lfloor 2^k \epsilon \rfloor$ new codewords, guaranteeing that the expanded code remains uniquely decodable. This transforms the theorem from a static check into a dynamic tool for engineering design, allowing systems to be built with future growth explicitly planned for.

The Art of Packing: From Information Theory to Algorithm Design

This "budget" analogy opens a door to another fascinating domain: optimization. Imagine you have a large wish list of potential codewords with varying lengths, but you know that implementing all of them would violate the Kraft inequality. Your task is to select the largest possible subset of these codewords that can form a valid code.

This problem is strikingly similar to the classic "knapsack problem" in computer science. You want to pack as many items as possible (maximize the number of codewords) into a knapsack with a fixed capacity (the Kraft budget of 1). Each item has a value (in this case, each codeword is equally valued at "1," since we want to maximize the count) and a weight (the "cost" of $2^{-l}$ ). To maximize the number of items, the best strategy is a greedy one: always pack the lightest items first. In our coding world, the "lightest" items are the codewords with the longest lengths, as they have the smallest cost, $2^{-l}$ .

So, an efficient algorithm emerges directly from the theorem: start by adding all the longest-length codewords to your set, then the next-longest, and so on, all while keeping a running tally of your Kraft sum. The moment adding the next group of codewords would push your sum over 1, you stop. This simple, greedy approach guarantees you find the largest possible valid subset of your original wish list. The abstract inequality has now become the core of a practical optimization algorithm.

Connecting to the Masterpiece: Huffman Codes and the Limits of Compression

The Kraft-McMillan theorem tells us what is possible, but it doesn't tell us what is optimal. For that, we turn to the celebrated Huffman coding algorithm, which generates a prefix code with the minimum possible average length for a given source. The connection between the two is profound.

First, Shannon's source coding theorem tells us that for any source with entropy $H$ , the average length $G$ of any uniquely decodable code must be greater than or equal to the entropy ( $G \ge H$ ). Since the Kraft-McMillan theorem guarantees that any set of lengths satisfying its inequality can be realized as a prefix (and thus uniquely decodable) code, it provides the crucial link: the existence of such a code means its performance is automatically bounded by the fundamental physical limit of entropy.

But the relationship is deeper still. The Huffman algorithm doesn't just produce a code that satisfies the inequality; it produces a code with a very specific structure. A key property of any Huffman code is that the two least probable symbols are assigned the two longest codewords, and these two codewords are "siblings" in the code tree—they have the same length, share the same prefix, and differ only in their final bit. This structural signature means that not every valid prefix code can be a Huffman code. For example, a code like $\{0, 01, 11\}$ has lengths $\{1, 2, 2\}$ which satisfy the Kraft equality ( $2^{-1} + 2^{-2} + 2^{-2} = 1$ ). However, its two longest codewords, 01 and 11, are not siblings. Therefore, we can say with certainty that this code, while uniquely decodable, cannot be the result of the Huffman algorithm for any probability distribution.

This connection also helps us debunk common misconceptions. One might look at a set of codeword lengths like $\{1, 3, 3, 3, 3\}$ and think it looks too "unbalanced" to be optimal. Yet, not only does this set satisfy the Kraft equality ( $2^{-1} + 4 \times 2^{-3} = 1$ ), but it can be shown that there exists a probability distribution for which the Huffman algorithm will produce precisely this set of lengths. This teaches us a valuable lesson: our intuition can be misleading, and the mathematical framework of the theorem and its algorithmic consequences provide the ultimate ground truth.

Elegant Engineering and Unbounded Horizons

The beauty of a deep theorem is that its applications keep unfolding in unexpected and elegant ways. Consider the challenge of data compression. When you compress a file using a Huffman code, you must also include the "codebook"—the mapping of symbols to codewords—so the decompressor knows how to interpret the data. This codebook adds overhead to the file size.

Here, the Kraft-McMillan theorem offers a clever hack. If we are using a full Huffman code (where $\sum 2^{-l_i} = 1$ ) and both sender and receiver agree on a canonical ordering of the symbols, the sender doesn't need to transmit all $N$ codeword lengths. They only need to send $N-1$ of them! The receiver, armed with the knowledge that the sum must equal 1, can mathematically derive the final, missing length using the equation $l_N = -\log_2(1 - \sum_{i=1}^{N-1} 2^{-l_i})$ . This subtle trick, born directly from the theorem, allows for a more compact representation of the codebook itself, saving precious bits in the final compressed file.

Furthermore, the theorem's power is not confined to finite alphabets. Consider an infinite code like $\{0, 10, 110, 1110, \ldots \}$ , where a codeword is formed by any number of '1's followed by a '0'. This type of code is useful for encoding things like run-lengths, where you don't know in advance what the longest run will be. Is this code safe to use? A quick check reveals it is a prefix code, and thus uniquely decodable. If we calculate its Kraft sum, we get the infinite geometric series $\sum_{k=1}^{\infty} 2^{-k}$ , which beautifully converges to exactly 1. The theorem holds, providing a solid theoretical foundation for codes that handle sources with a countably infinite number of possible symbols.

From a simple check on a set of integers to the design of expandable communication protocols, from optimization algorithms to the very structure of optimal codes and elegant engineering hacks, the Kraft-McMillan theorem reveals itself not as a mere classifier of codes, but as a deep, unifying principle. It is a testament to how a simple mathematical idea can provide the vocabulary and the tools to understand, design, and perfect the way we communicate in a world built on information.