Popular Science
Edit
Share
Feedback
  • Classical Information Theory Data Compression and Channel Capacity
  • Hands-on Practice
  • Problem 1
  • Problem 2
  • Problem 3
  • What to Learn Next

Classical Information Theory Data Compression and Channel Capacity

SciencePediaSciencePedia
Definition

Classical Information Theory Data Compression and Channel Capacity is a field of study focused on quantifying information through entropy and establishing the fundamental limits of data transmission. It utilizes the Source Coding Theorem to determine the unbreakable limit for lossless data compression based on the measure of information as surprise. Additionally, it defines Channel Capacity via the Channel Coding Theorem to establish the maximum rate for reliable, error-free communication over noisy media using optimal strategies such as water-filling and superposition coding.

Key Takeaways
  • Entropy quantifies information as "surprise" and establishes the fundamental, unbreakable limit for lossless data compression via the Source Coding Theorem.
  • Channel Capacity defines the maximum rate for reliable, error-free communication over a noisy medium, a hard limit described by the Channel Coding Theorem.
  • Information-theoretic principles like "water-filling" and "superposition coding" provide mathematically optimal strategies for resource allocation in complex communication systems.
  • The concepts of information theory extend far beyond communications, providing fundamental limits and insights into diverse fields like control theory, statistical inference, and finance.

Introduction

In an age defined by data, from the messages on our phones to the signals from distant stars, a fundamental question arises: what are the basic laws governing information itself? Before Claude Shannon, communication was an art of engineering and clever tricks. There was no underlying science to define the absolute limits of data compression or the ultimate speed limit for error-free transmission. This article bridges that gap, introducing the elegant and powerful framework of classical information theory. We will first delve into the Principles and Mechanisms​, uncovering the core concepts of entropy, mutual information, and the celebrated theorems that set the unbreakable rules for data compression and channel capacity. Next, we will explore the theory's astonishing reach in Applications and Interdisciplinary Connections​, seeing how these principles shape everything from modern wireless networks and data security to control theory and statistical inference. Finally, the Hands-On Practices section will allow you to solidify your understanding by tackling practical problems, translating these profound ideas into tangible skills.

Principles and Mechanisms

Now that we have been introduced to the grand stage of information theory, let's pull back the curtain and examine the machinery that runs the show. Like a master watchmaker, Claude Shannon didn't just tell us what was possible; he gave us the gears and springs, the principles and mechanisms, that govern the flow of information through our universe. Our journey will start with the very atom of information theory—a single idea called entropy—and from there, we will build worlds.

The Measure of Surprise: What is Entropy?

Before we can compress data or send it across a noisy channel, we must first answer a seemingly philosophical question: What is "information"? If I tell you the sun will rise tomorrow, I have given you very little information. But if I tell you the winning lottery numbers, I have given you a great deal. The difference lies in surprise​, or equivalently, the reduction of uncertainty​. Information is what resolves our uncertainty.

Shannon gave this intuitive idea a precise mathematical form called entropy​, denoted by the symbol HHH. For a process with several possible outcomes, the entropy is the average amount of surprise we experience when we learn the actual outcome. It's the average number of "yes/no" questions you'd need to ask to pinpoint the result. A fair coin flip (H=1H=1H=1 bit) has more surprise than a biased coin that lands on heads 99% of the time (H≈0.08H \approx 0.08H≈0.08 bits). With the biased coin, you're almost certain of the outcome before it even happens, so learning the result is hardly surprising.

This concept extends to any random process. Imagine you are waiting for a machine to complete a task that has a probability ppp of succeeding at each attempt. The number of trials you must wait for the first success follows a geometric distribution. How unpredictable is this waiting time? Information theory gives us a precise answer through its entropy. This isn't just an academic exercise; it's the fundamental measure of randomness in processes all around us, from radioactive decay to the number of calls arriving at a switchboard.

Of course, things in the real world are often related. The weather today gives us a clue about the weather tomorrow. Shannon's theory handles this with grace. We have joint entropy H(X,Y)H(X,Y)H(X,Y), the total uncertainty in a pair of variables, and conditional entropy H(Y∣X)H(Y|X)H(Y∣X), the remaining uncertainty in YYY after you already know XXX. These are beautifully linked by a simple chain rule: H(X,Y)=H(X)+H(Y∣X)H(X,Y) = H(X) + H(Y|X)H(X,Y)=H(X)+H(Y∣X). This is just common sense in mathematical clothing: the total uncertainty of two events is the uncertainty of the first, plus the uncertainty of the second, given that the first has already happened. This simple rule is one of the foundational building blocks we will use again and again.

Squeezing the Essence: The Magic of Data Compression

Entropy is not just a philosophical measure; it's a deeply practical one. It sets the absolute, unbreakable limit for data compression. Shannon's Source Coding Theorem states that a source generating data with entropy HHH bits per symbol can be compressed down to an average of HHH bits per symbol, but no further. How is this magic possible?

The secret lies in a profound idea called the Asymptotic Equipartition Property (AEP). Imagine a monkey typing randomly on a keyboard. A very long sequence like "abababab..." is possible, but extremely unlikely. A sequence like "xqzjw..." is also unlikely. Most long sequences the monkey types will have a character composition that looks "typical" of random English letters—with lots of 'e's and 't's, and few 'z's. The AEP tells us that for any random source, almost all long sequences are "typical." These typical sequences all have roughly the same probability of occurring, and, most importantly, the number of these sequences is much, much smaller than the total number of possible sequences. For a block of nnn symbols, this "typical set" contains only about 2nH2^{nH}2nH sequences. This is the miracle of compression! We don't need to create unique descriptions for every conceivable sequence, only for the ones that are typical. All other "atypical" sequences are so rare we can effectively ignore them.

So, how do we design a practical code? The key is to use shorter descriptions for more probable symbols and longer descriptions for less probable ones—think of Morse code. To avoid ambiguity (does "..." mean "S" or "EEE"?), we use prefix codes, where no codeword is the beginning of another. This idea imposes a strict mathematical budget, formalized by the Kraft Inequality​. It tells us that the "costs" of our codewords (shorter codes are more expensive) cannot exceed a total budget of 1. You can't just assign short codewords to everything; if you choose a short one, you've used up a significant portion of your unique "coding space."

A beautifully simple algorithm called Huffman coding gives us a way to automatically construct an optimal prefix code for any set of symbol probabilities. It follows a simple, greedy procedure: take the two least likely symbols, pair them up as a new parent symbol, and repeat. By working from the bottom up, it builds a coding tree that achieves the lowest possible average code length.

But what if we don't know the probabilities of the source? This is the situation for most real-world compression programs like ZIP or PNG. The field of universal source coding provides the answer. We can design codes that "learn" the statistics of the source as they go. This adaptability comes at a price, a slight "redundancy" compared to a code that knew the statistics from the start. This quantifiable cost of ignorance, however, is remarkably small, and we can even calculate it precisely for simple cases.

Taming the Noise: The Art of Reliable Communication

Now we turn from compressing information to transmitting it across a noisy channel. Noise is the villain of our story; it corrupts messages and creates uncertainty. How much information can survive the journey?

To answer this, we need a new concept: mutual information​, I(X;Y)I(X;Y)I(X;Y). This quantity measures the information that the output YYY provides about the input XXX. It can be expressed as I(X;Y)=H(X)−H(X∣Y)I(X;Y) = H(X) - H(X|Y)I(X;Y)=H(X)−H(X∣Y), which means "the initial uncertainty about the input, minus the uncertainty that remains after seeing the output." Or, thought of another way, I(X;Y)=H(Y)−H(Y∣X)I(X;Y) = H(Y) - H(Y|X)I(X;Y)=H(Y)−H(Y∣X): "the total variety in the output, minus the variety caused by the channel's noise alone." Both views lead to the same number—a measure of the pure correlation between input and output.

Every channel has a speed limit, a maximum mutual information it can support. This limit is its capacity​, CCC. It's found by choosing the cleverest way to send signals (the best input distribution) to maximize the information that gets through. Shannon’s Channel Coding Theorem​, his most celebrated result, states that as long as your transmission rate RRR is less than the capacity CCC, you can communicate with an arbitrarily small probability of error. If R>CR > CR>C, you can't. Capacity is the ultimate speed limit for reliable communication.

This holds for all kinds of channels. For a channel whose "mood" changes over time—say, alternating between a "good" state and a "bad" state—the overall capacity is simply the average of the capacities of the states it visits. For continuous signals, like radio waves, the logic is even more beautiful. In a channel with background noise, the optimal strategy is called water-filling​. Imagine the noise level across different frequencies as an uneven riverbed. To send your signal, you pour a limited amount of "power" (the water) into this riverbed. The water naturally fills the deepest spots (the quietest frequencies) first. This is exactly how modern systems like DSL and 4G/5G work: they allocate more power to the frequency bands where the signal-to-noise ratio is best, an elegant solution found by information theory decades earlier.

The Unbreakable Laws and Necessary Compromises

Shannon's theorems are not just engineering guidelines; they are fundamental laws of nature. They set hard limits and reveal the necessary trade-offs in any system that processes information.

One of the most profound is the Data Processing Inequality​. It states that for any chain of events X→Y→ZX \to Y \to ZX→Y→Z, where the output of one stage becomes the input to the next, you can never increase the information about the original source. Mathematically, I(X;Z)≤I(X;Y)I(X;Z) \le I(X;Y)I(X;Z)≤I(X;Y). Any computation, filtering, or transmission step can, at best, preserve information, but it usually discards some. You can't create information from nothing.

This leads to a crucial converse principle. If noise leaves some ambiguity about the input, errors in decoding are inevitable. Fano's Inequality gives this idea teeth, providing a hard lower bound on the probability of error based on the remaining conditional entropy H(X∣Y)H(X|Y)H(X∣Y). If H(X∣Y)>0H(X|Y) > 0H(X∣Y)>0, then your error rate must also be greater than zero.

And what if you ignore the capacity speed limit and try to transmit at a rate R>CR > CR>C? The result is catastrophic failure. The Strong Converse theorem shows that the probability of successful decoding doesn't just fall; it plummets to zero exponentially fast as the message length increases. Capacity is not a friendly suggestion; it's a cliff edge.

These theorems assume we can use infinitely long codes. In the real world, we use finite blocks of data. This incurs a penalty. The maximum achievable rate is slightly less than capacity, an effect beautifully captured by the normal approximation, which tells us we must "back off" from C by an amount that depends on the channel's "variance" and shrinks as our blocklength nnn gets larger.

So far, we've focused on perfect, lossless reproduction. But for images, audio, and video, we don't need perfection; we just need "good enough." This is the realm of lossy compression and Rate-Distortion Theory​. Here, we face a fundamental trade-off: how much compression (Rate, RRR) can we get for a given level of imperfection (Distortion, DDD)? The rate-distortion function R(D)R(D)R(D) gives the answer. For a Gaussian source (like many natural signals), this relationship is a simple and elegant formula: R(D)=12log⁡2(σ2/D)R(D) = \frac{1}{2}\log_2(\sigma^2/D)R(D)=21​log2​(σ2/D). This tells us that each bit we add to our rate allows us to reduce the distortion power by a factor of four. Just as "water-filling" tells us how to optimally transmit over a noisy channel, a "reverse water-filling" principle tells us how to optimally introduce distortion to compress a source with memory: you should quantize more coarsely (introduce more error) in the frequency bands where the original signal is weakest.

Beyond One Sender, One Receiver

The principles of information theory scale magnificently to entire networks.

  • Multiple-Access Channel (The "Cocktail Party Problem"): When many speakers talk to one listener, they must share the medium. Instead of a single capacity, we have a capacity region, a set of achievable rate combinations. For two users, it defines all pairs (R1,R2)(R_1, R_2)(R1​,R2​) that can be reliably decoded simultaneously.

  • Broadcast Channel (The "Town Crier Problem"): When one sender transmits to many listeners, some of whom may have better reception than others, a wonderfully clever strategy called superposition coding emerges. The sender transmits a "base layer" of information that everyone can decode, and then superimposes a "refinement layer" that only the listeners with clearer signals can unscramble. This is the conceptual basis for scalable video streaming.

  • The Power of Knowledge: What if a transmitter has some side information about the channel's state? For instance, what if it knows the pattern of noise that will be added? In a stunning demonstration of a "mind over matter" principle, the transmitter can use this knowledge to pre-cancel the effects of the noise, effectively creating a clean channel for itself and dramatically boosting capacity. Information about the channel is just as valuable as signal power.

These principles reveal a deep and unified structure. This unity even extends to other fields of science. A remarkable discovery, the I-MMSE relationship​, forges a direct link between the mutual information of a channel and the minimum possible error in statistical estimation. It shows that the increase in information from a slight bump in signal quality is directly proportional to the estimation error. This tells us that the problems of communication and inference are two sides of the same coin. It is in these profound connections, this underlying unity and beautiful simplicity, that the true genius of information theory lies.

Applications and Interdisciplinary Connections

Now that we have grappled with the fundamental principles of information, a new landscape of possibilities unfolds. The real adventure begins when we take these abstract tools—entropy, mutual information, channel capacity—out of the blackboard world and see them at work in the wild. You might suspect that a theory born from the practical problem of sending messages over telegraph wires would be confined to the domain of engineers. But what we are about to discover is something far more profound. We will find that these simple, elegant ideas form a kind of universal grammar, spoken not just by our communication devices, but by financial markets, by the very fabric of physical stability, by the dance of chaotic systems, and even by the limits of knowledge itself. The principles are few, but their echoes are everywhere, and the fun lies in learning to hear them.

The Symphony of Communication

Let's start where the theory began: in communication. Shannon didn't just tell us the limit to communication; his theory provides a blueprint for how to achieve that limit. Imagine you have several communication channels running in parallel, but some are clearer (less noisy) than others. You have a fixed total amount of power to distribute among them. How do you allocate it? Intuitively, you should give more power to the clearer channels. But how much more?

Information theory gives a wonderfully elegant answer with the "water-filling" algorithm. Picture a vessel whose bottom has a profile matching the noise levels of your channels—the noisier the channel, the higher the bottom at that point. Now, pour a fixed amount of "water," representing your total power, into this vessel. The water will naturally settle, filling the deeper parts (the low-noise channels) more, and perhaps leaving the highest parts (the noisiest channels) completely dry. The final depth of the water in each channel's section tells you the optimal power to allocate. This isn't just a pretty metaphor; it is a mathematically precise strategy that maximizes the total data rate across all channels. It is a perfect example of theory guiding practice.

The theory's reach extends far beyond simple point-to-point links. Consider the vast, interconnected web of the internet. We used to think of it as a postal system: discrete packets of information are routed from node to node, like letters in envelopes. But information theory inspired a radical alternative: network coding. At relay nodes, instead of just forwarding packets, we can mix them—performing mathematical operations on the bits. To our intuition, which is used to physical objects, mixing information sounds like a recipe for disaster. But bits are not apples and oranges. Network coding shows that by intelligently combining information flows, a network can achieve its absolute maximum throughput, a limit defined by the famous "max-flow min-cut" theorem. It allows a source to multicast a message to multiple destinations at a rate determined only by the narrowest bottleneck a single destination faces, a feat impossible with simple routing.

This brings us to a recurring, central theme: the power of side information. What you can learn depends critically on what you already know. Imagine a server broadcasting information to several clients, each of whom wants a different message but, through some prior interaction, already knows a few of the other messages. This is the "index coding" problem. A naive server might just send all the messages. A slightly cleverer one might send personalized packets to each client. But the truly elegant solution, guided by information theory, is to broadcast a handful of ingeniously crafted coded messages—linear combinations of the original ones. These coded messages act like a master key. Each client, using their unique side information as a tumbler, can unlock precisely the message they desire and no other. The reduction in the amount of information that needs to be broadcast can be astonishing.

This principle finds an even more sophisticated application in modern wireless systems like cognitive radio. A "secondary" user wants to use a frequency band without interfering with the "primary" user. What if the secondary user has side information in its most potent form: it knows the primary user's message before it's sent? It can then perform a remarkable trick. It transmits its own signal, carefully superimposed with an "anti-signal" that is precisely tailored to cancel out the primary user's transmission at the location of the secondary receiver. The result? The secondary user achieves clear communication over its own link, all while making the primary user's signal vanish from its perspective, thereby satisfying a strict secrecy constraint. This is the art of turning knowledge into a finely tuned scalpel for managing interference. From the informed helper in a relay channel to the clever broadcaster, side information is not just an add-on; it is a force multiplier.

The Logic of Secrecy and Security

The idea of manipulating information flows leads us naturally into the domain of secrecy. For centuries, security was a matter of computational complexity—making a code so hard to break that it would take an adversary an unreasonable amount of time. But Claude Shannon, and later Aaron Wyner, asked a different question: can we achieve perfect security, guaranteed not by computational limits, but by the fundamental laws of information?

The answer is yes, through the model of the "wiretap channel". Imagine Alice sending a message to Bob over a primary channel, while an eavesdropper, Eve, listens in on a secondary, "wiretapper's" channel. If Eve's channel is physically noisier than Bob's, Alice can design a code that includes two parts: one part is a message for Bob, and the other is carefully designed "confusion" data. The code is constructed so that Bob, with his clearer channel, can decode the message and discard the confusion. Eve, however, with her noisier view, cannot separate the message from the confusion. For her, the entire transmission looks like random noise. The rate at which secret information can be sent is, beautifully, the capacity of Bob's channel minus the capacity of Eve's channel. We can even design for uncertainty, ensuring security against the strongest possible eavesdropper in a set of potential adversaries.

Information theory also provides a way to forge secrecy out of thin air, or more accurately, out of shared noise. Suppose Alice and Bob observe correlated, but not identical, random sequences—perhaps they are two scientists measuring the same noisy astronomical signal from different locations. They want to agree on a secret key, but they can only communicate over a public channel that Eve can hear perfectly. Their observations are their private resource. The correlations they share are patterns that Eve, with her different noisy observation, cannot fully see. Alice can send carefully chosen hints over the public channel that help Bob reconcile his observations with hers, but are structured in such a way that they reveal almost nothing to Eve about the final secret pattern they distill. There is a delicate trade-off: more public communication makes it easier for Alice and Bob to agree, but it also leaks more information to Eve. Information theory allows us to find the optimal balance point, maximizing the rate at which secrets can be "distilled" from the noisy environment. This deep principle underpins a startling array of scenarios, from multi-user broadcasts with confidential messages to modern quantum key distribution.

The Pulse of a Connected World: Freshness and Control

In our hyper-connected world of real-time data streams, from the Internet of Things to autonomous vehicle networks, a new dimension of information has become critical: its timeliness. It is often not enough to receive a large amount of data; that data must be fresh​. The "Age of Information" (AoI) is a metric that captures this, measuring the time elapsed since the generation of the most recently received update.

Consider a sensor monitoring a remote process and sending updates through a communication channel modeled as a queue. If the sensor sends updates too frequently, it will congest the queue; packets will wait a long time to be served, and by the time they arrive, they will be stale. If it sends updates too infrequently, the information is already old before it is even sent. Common sense suggests there must be a sweet spot, and information and queueing theory prove it. There exists an optimal update rate that minimizes the long-term average age, balancing the staleness from waiting in the queue against the staleness from infrequent sampling. This principle, which applies equally to transmissions over unreliable channels requiring retransmissions, is fundamental to designing responsive and efficient real-time systems.

The connection between information and the physical world becomes even more dramatic when we consider control theory. Imagine trying to balance a long pole on your fingertip. The pole is an unstable system; any small deviation will grow exponentially. Your eyes measure the state of the pole, and your brain commands your hand to correct its position. Now, suppose your vision is blurry or the nerve signals from your brain to your hand are slow. There is a point at which the system becomes uncontrollable.

Information theory makes this intuition precise with the "data-rate theorem". An unstable system with a parameter ∣a∣>1|a| > 1∣a∣>1 that governs how quickly errors grow (e.g., xk+1=axk+…x_{k+1} = a x_k + \dotsxk+1​=axk​+…) generates uncertainty at a rate of log⁡2∣a∣\log_2|a|log2​∣a∣ bits per time step. To stabilize this system, the feedback loop—from sensor to controller to actuator—must be a communication channel with a capacity of at least Rmin=log⁡2∣a∣R_{min} = \log_2|a|Rmin​=log2​∣a∣ bits per time step. If the channel capacity is less than this, the system's uncertainty will grow faster than information can quell it, and stability is impossible. This is a profound, hard limit. It dictates that to control a physical system, you must be able to "know" it faster than it becomes unknowable.

Information, Inference, and the Laws of Nature

The reach of information theory extends into the most fundamental questions of science: How do we learn? What can we know? And what is the nature of randomness?

In statistics and machine learning, we build models of the world from finite data. A central question is: how good can any model be? Information-theoretic tools like Fano's inequality and the Ziv-Zakai bound provide a universal answer. They establish a fundamental lower bound on the error of any statistical estimator. The core idea is that you cannot estimate a parameter with more precision than the amount of information your data actually contains about that parameter. If the mutual information between your data and the true parameter is low, then no algorithm, no matter how clever, can achieve high accuracy. This is a kind of "uncertainty principle" for statistical inference. It also explains why data processing can be dangerous: if you compress your data before feeding it to a learning algorithm, you might discard relevant information forever. This imposes a hard floor on the performance of your model, a direct consequence of the data processing inequality, which states that no amount of clever processing can create information that isn't there to begin with.

Perhaps one of the most sublime applications is in the study of chaos. Consider the logistic map, a simple equation xn+1=4xn(1−xn)x_{n+1} = 4 x_n (1-x_n)xn+1​=4xn​(1−xn​) that, for a starting value x0x_0x0​, generates a sequence of numbers. Though the equation is perfectly deterministic, the sequence it produces appears utterly random. What's going on? The system is an information generator. At each step, the mapping stretches and folds the space of possibilities in such a way that tiny, unknowable differences in the initial state are amplified exponentially. For the parameter value r=4r=4r=4, the rate of this information generation, known as the metric entropy rate, can be calculated precisely. The answer is h=ln⁡2h = \ln 2h=ln2. This means the system, at every single step, is effectively generating one bit of new information. It is behaving like a perfect, unbiased coin flip. A simple deterministic rule gives rise to pure randomness, and information theory provides the tool to quantify it.

Finally, let us come full circle, from sending messages to making money. Imagine you are betting on horse races, but you have an informational edge: you know the true probabilities of each horse winning, while the bookmaker's odds reflect the public's less-informed bets. How should you allocate your capital to maximize its long-term growth rate? This is the problem solved by the Kelly criterion. The answer is startlingly simple and information-theoretic: the optimal fraction of your wealth to bet on an outcome is proportional to your informational advantage. And the maximum possible exponential growth rate of your wealth is equal to the Kullback-Leibler divergence—the relative entropy—between your true probability distribution and the one implied by the market odds. Your wealth grows at the rate at which you acquire information.

From the hum of a network switch to the flutter of a chaotic system, from the ethics of a spy to the strategy of an investor, the concepts of information theory provide a unifying lens. By asking a simple question about communication, Claude Shannon unlocked a language that describes the flow, the value, and the very generation of knowledge across the scientific landscape. And the journey of discovery is far from over.

Hands-on Practice

Problem 1

Beyond theoretical limits, practical data compression relies on clever algorithms. The Lempel-Ziv-Welch (LZW) algorithm is a cornerstone of modern compression, building a dictionary of phrases on-the-fly to encode data efficiently. This exercise provides a hands-on walk-through of the LZW process, revealing how an adaptive dictionary grows and how variable-length codes are used to represent an input stream, giving you a practical feel for this powerful technique.

Problem​: The Lempel-Ziv-Welch (LZW) algorithm is a universal lossless data compression algorithm. Its operation relies on building a dictionary of strings encountered during compression.

The LZW algorithm can be summarized as follows:

  1. Initialize a dictionary with a set of single-character strings.
  2. Set W (the working string) to the first character of the input stream.
  3. Loop over the input stream: a. Read the next character K. b. If the string W + K exists in the dictionary, set W = W + K. c. If W + K is not in the dictionary, then: i. Output the dictionary code corresponding to W. ii. Add the new string W + K to the dictionary with a new code. iii. Set W = K.
  4. After the input stream is exhausted, output the code for the final working string W.

Consider a specific implementation of LZW with the following specifications:

  • The initial dictionary contains 256 entries for the 8-bit ASCII characters. The codes for these characters are their corresponding ASCII values (0-255). The next available code for new dictionary entries is 256.
  • The number of bits used to represent an output code is determined by the dictionary size at the moment of output. If the dictionary contains NNN entries, the output code is represented using ⌈log⁡2(N)⌉\lceil \log_2(N) \rceil⌈log2​(N)⌉ bits.
  • The input character string to be compressed is BANANA_BANDANA.

Your task is to calculate the total number of bits in the final compressed output sequence.

Display Solution Process
Problem 2

While lossless compression is ideal, many applications require trading some fidelity for a much lower data rate. Rate-distortion theory provides the fundamental limits for this trade-off, defining the minimum rate R(D)R(D)R(D) required for a given distortion DDD. This exercise explores this concept for a correlated, two-dimensional Gaussian source, demonstrating the "reverse water-filling" principle. You will see how to optimally allocate distortion between the source's components, a key insight in compressing multidimensional data.

Problem​: Consider a memoryless bivariate Gaussian source, where each sample is an independent draw of a two-dimensional random vector X=(X1,X2)TX = (X_1, X_2)^TX=(X1​,X2​)T. The vector XXX follows a multivariate normal distribution N(0,Σ)\mathcal{N}(0, \Sigma)N(0,Σ) with zero mean and a covariance matrix given by

Σ=σ2(1ρρ1)\Sigma = \sigma^2 \begin{pmatrix} 1 \rho \\ \rho 1 \end{pmatrix}Σ=σ2(1ρρ1​)

where σ20\sigma^2 0σ20 is the variance of each component and ρ\rhoρ is the correlation coefficient, satisfying 1/2ρ11/2 \rho 11/2ρ1.

We wish to compress this source and reconstruct it as X^=(X^1,X^2)T\hat{X} = (\hat{X}_1, \hat{X}_2)^TX^=(X^1​,X^2​)T. The quality of the reconstruction is measured by the total mean-squared error (MSE) distortion, defined as d(X,X^)=E[∥X−X^∥2]=E[(X1−X^1)2+(X2−X^2)2]d(X, \hat{X}) = E[\|X - \hat{X}\|^2] = E[(X_1 - \hat{X}_1)^2 + (X_2 - \hat{X}_2)^2]d(X,X^)=E[∥X−X^∥2]=E[(X1​−X^1​)2+(X2​−X^2​)2].

The rate-distortion function, R(D)R(D)R(D), for this source specifies the minimum achievable rate (in bits per 2D vector sample) for a given maximum allowed average distortion DDD.

Calculate the value of the rate-distortion function R(D)R(D)R(D) for a total average distortion of exactly D=σ2D = \sigma^2D=σ2.

Display Solution Process
Problem 3

The capacity of a communication channel defines the ultimate speed limit for error-free data transmission. For channels where noise is not uniform across all frequencies, simply broadcasting with uniform power is inefficient. This problem introduces the elegant "water-filling" solution for optimally allocating signal power across a frequency band to maximize the total capacity. By working through this example with a non-uniform noise spectrum, you will gain practical understanding of one of the most important resource allocation principles in communication theory.

Problem​: Consider a continuous-time communication channel affected by additive colored Gaussian noise. The channel output Y(t)Y(t)Y(t) is given by Y(t)=X(t)+N(t)Y(t) = X(t) + N(t)Y(t)=X(t)+N(t), where X(t)X(t)X(t) is the input signal and N(t)N(t)N(t) is the noise.

The input signal X(t)X(t)X(t) is constrained in its total average power to PPP. The transmission is band-limited. The noise process N(t)N(t)N(t) is a zero-mean stationary Gaussian process with a one-sided power spectral density (PSD) given by:

SN(f)={αffor f∈[W1,W2]∞otherwiseS_N(f) = \begin{cases} \alpha f \text{for } f \in [W_1, W_2] \\ \infty \text{otherwise} \end{cases}SN​(f)={αffor f∈[W1​,W2​]∞otherwise​

Here, fff is the frequency, and α\alphaα, W1W_1W1​, and W2W_2W2​ are positive constants with W2W10W_2 W_1 0W2​W1​0.

The capacity CCC of such a channel for a real-valued signal is found by maximizing the integral of the spectral information rate over all valid signal power allocations SX(f)S_X(f)SX​(f) that satisfy the power constraint ∫W1W2SX(f)df=P\int_{W_1}^{W_2} S_X(f) df = P∫W1​W2​​SX​(f)df=P. The capacity formula is:

C=max⁡SX(f)∫W1W2log⁡2(1+SX(f)SN(f))dfC = \max_{S_X(f)} \int_{W_1}^{W_2} \log_2 \left(1 + \frac{S_X(f)}{S_N(f)}\right) dfC=SX​(f)max​∫W1​W2​​log2​(1+SN​(f)SX​(f)​)df

Determine the capacity CCC of this channel. You should operate under the assumption that the total signal power PPP is sufficiently small, such that the optimal power allocation strategy does not utilize the entire available frequency band [W1,W2][W_1, W_2][W1​,W2​]. Express your answer in terms of PPP, α\alphaα, and W1W_1W1​.

Display Solution Process
What to Learn Next
Quantum Information Quantum Computation
Not Started. Start Reading.
Computational Complexity Classes P and Np
Shannon Entropy and Its Properties