Self-Information

SciencePedia

Key Takeaways

Self-information quantifies the "surprise" of an event and is calculated using the formula $I(p) = -\log_2(p)$ , where p is the event's probability.
The fundamental unit of information is the "bit," derived from using a base-2 logarithm, representing the information needed to resolve uncertainty between two equally likely outcomes.
The average self-information over all outcomes of a random variable is known as Shannon entropy, a core measure of average uncertainty.
Self-information is a non-negative quantity; a negative result indicates an error in the underlying probability model, making it a valuable consistency check.
The concept has broad applications, linking the probability of errors in engineering, the significance of scientific discoveries (p-values), and the specificity of biological processes.

Introduction

In our daily lives, "information" is a fluid concept. But in science and engineering, we require precision. How can we objectively measure the amount of information gained from observing an event? The answer lies in a simple yet profound insight: information is a measure of surprise. An event that is certain to happen is not surprising and conveys no new information upon its occurrence. Conversely, a rare, unexpected event is highly surprising and thus, highly informative. This intuitive link provides the foundation for a rigorous, mathematical framework.

This article addresses the fundamental challenge of formalizing this notion of "informational surprise." It builds a quantitative ruler to measure information, moving beyond simple scenarios to a general principle applicable to any event with a known probability. Across the following chapters, you will discover how this single idea revolutionizes our understanding of the world. In "Principles and Mechanisms," we will derive the formula for self-information from first principles, explore its properties, and see how it leads directly to the pivotal concept of entropy. Subsequently, in "Applications and Interdisciplinary Connections," we will journey through diverse scientific fields—from the reliability of space probes and the certainty of genomic data to the very foundations of thermodynamics—to witness how this elegant theory provides a universal language for quantifying structure, order, and knowledge.

Principles and Mechanisms

What is information? We use the word all the time, but in science, we need a more precise idea. We need a way to measure it. Imagine you are about to witness an event. It could be the flip of a coin, the roll of a die, or the result of a complex physics experiment. If you already know the outcome with absolute certainty, learning the result when it happens gives you... nothing. No new information at all. But if the outcome is highly uncertain, if it is something truly astonishing, then learning the result is a big deal. You've gained a lot of information.

This gives us a wonderful clue. Information seems to be deeply connected to the element of surprise. A certain event is not surprising at all (zero information), while a very rare event is extremely surprising (a lot of information). Our first task, then, is to build a mathematical ruler to measure this "surprise."

The Measure of Surprise

Let's play a game. I have picked one specific, pre-determined configuration on a medical imaging device, and there are $N=2500$ possible configurations to choose from. Your job is to guess which one I picked by asking yes-or-no questions. What's the most efficient strategy? You'd probably play a game of "20 Questions." You could ask, "Is the configuration number in the first half, from 1 to 1250?" With that single "yes" or "no," you've cut your uncertainty in half. You keep doing this, halving the remaining possibilities with each question, until you've cornered the unique answer.

How many questions would it take, on average? If there were 2 possibilities, it would take 1 question. If there were 4 possibilities, it would take 2 questions. For 8 possibilities, 3 questions. You see the pattern: the number of questions is the power to which you must raise 2 to get the total number of options. This is the logarithm base 2. For a set of $N$ equally likely possibilities, the amount of information needed to specify one of them is $\log_2(N)$ .

For our medical device with $N=2500$ configurations, the information content would be $\log_2(2500)$ , which is about $11.3$ . This means you would need, on average, around 11 to 12 yes/no questions to pinpoint the exact configuration. Similarly, to identify a specific card drawn from a standard 52-card deck, the amount of information gained is $\log_2(52)$ , or about $5.7$ .

This quantity, the measure of the surprise of an outcome, is called self-information. The unit we get from using the base-2 logarithm is the fundamental unit of information theory: the bit. One bit is the amount of information you get from a fair coin toss—the information needed to resolve an uncertainty with two equally likely outcomes.

From Outcomes to Probabilities: A General Rule

The rule $I = \log_2(N)$ is simple and elegant, but it rests on a big assumption: that all outcomes are equally likely. Reality is rarely so neat. In the English language, the letter 'E' appears far more often than 'Z'. A rainstorm in a desert is far less probable than a sunny day. How do we measure the surprise of events that have different probabilities?

We need to generalize our rule. Instead of depending on the number of outcomes $N$ , the information should depend on the probability $p$ of the specific outcome we observe. Let's think about what properties this function must have.

If an event is certain ( $p=1$ ), the surprise is zero. $I(1) = 0$ .
If an event is less likely, it is more surprising. If $p_1 \lt p_2$ , then $I(p_1) > I(p_2)$ .
If two independent events occur, their total surprise should be the sum of their individual surprises. Since the joint probability of independent events is $p_1 \times p_2$ , we need a function where $I(p_1 p_2) = I(p_1) + I(p_2)$ .

The logarithm is the only function that satisfies these properties! This leads us to the master formula for self-information, first proposed by Claude Shannon, the father of information theory:

$I(p) = -\log_2(p)$

Let's check it. If $p=1$ , $I(1) = -\log_2(1) = 0$ . It works. As $p$ gets smaller and approaches 0, $-\log_2(p)$ grows towards infinity. This also makes sense: an incredibly rare event is incredibly surprising. And what about our old rule for $N$ equally likely outcomes? In that case, the probability of any single outcome is $p = 1/N$ . Plugging this in gives $I(1/N) = -\log_2(1/N) = \log_2(N)$ . Our new rule contains the old one as a special case. This is the mark of a powerful scientific definition.

Consider the probability of observing the letter 'Z' in a typical English text, which is very low, around $p = 7.4 \times 10^{-4}$ . The self-information of seeing a 'Z' is $-\log_2(7.4 \times 10^{-4}) \approx 10.4$ bits. In contrast, for a common letter like 'E' with a probability of about $0.12$ , the self-information is only $-\log_2(0.12) \approx 3$ bits. This difference is the entire basis for data compression algorithms like Huffman coding or the zip format on your computer. They cleverly use shorter codes for low-information (high-probability) symbols and longer codes for high-information (low-probability) ones.

The First Rule of Information: It Can't Be Negative

The formula $I(p) = -\log_2(p)$ is defined for probabilities $p$ in the range $(0, 1]$ . What would happen if, through some calculation error, we ended up with a probability greater than 1, say $p=1.6$ ? If we naively plug this into our formula, we get $I(1.6) = -\log_2(1.6)$ , which is a negative number.

What would negative information even mean? It would suggest that by observing an event, you end up more uncertain than you were before. It's like answering a question and becoming more confused. This is conceptually absurd. The process of gaining information is, by definition, the reduction of uncertainty. Therefore, self-information must always be non-negative.

The fact that an invalid probability leads to a nonsensical result (negative information) is not a flaw in the theory—it's a feature! It acts as a consistency check. If you ever calculate negative self-information, you can be sure that you've made a mistake in your probability model. The universe of valid probabilities, $0 \le p \le 1$ , maps perfectly onto the universe of sensible information values, $I \ge 0$ .

A Matter of Scale: Bits, Nats, and Hartleys

Why did we choose the logarithm base 2? It arose naturally from our game of yes-or-no questions. This choice defines the unit of information as the bit. But this choice is a convention, much like choosing to measure length in meters instead of feet.

We could have used any other logarithmic base. If we use the natural logarithm (base $e$ ), the unit of information is called the nat. This unit is often preferred in theoretical physics and machine learning because of the elegant properties of the number $e$ .

If we were to use the base-10 logarithm, the unit is called the hartley, named after another pioneer of information theory, Ralph Hartley. For instance, if you are guessing the answer to a five-option multiple-choice question, the information you gain upon learning the correct answer is $\log_2(5)$ bits, or $\log_e(5)$ nats, or $\log_{10}(5)$ hartleys. The underlying amount of "surprise" is the same; only the scale on our ruler has changed.

The Random Nature of Surprise

So far, we have focused on the information of a single, given outcome. But most of the time, we are interested in a random process—a source that emits a stream of symbols, a system that fluctuates between states. Let's say we have a source that emits characters from the alphabet $\{\alpha, \beta, \gamma, \delta\}$ with different probabilities.

When a character is emitted, we observe it and calculate its self-information, or surprisal. If we see the most common character, $\alpha$ (with $p=1/2$ ), the surprisal is low: $-\log_2(1/2) = 1$ bit. If we see a rarer character like $\gamma$ (with $p=1/8$ ), the surprisal is higher: $-\log_2(1/8) = 3$ bits.

Notice something interesting: because the outcome $X$ is a random variable, the surprisal we receive, $Y = -\log_2(p(X))$ , is also a random variable. It doesn't have a fixed value; its value depends on the outcome of the random process. And because it's a random variable, we can analyze it just like any other. We can ask about its average value, its variance, its entire distribution.

For example, we can calculate the variance of the surprisal, which tells us how much the information content tends to fluctuate around its average value. For some distributions, this calculation is remarkably straightforward. For a process described by a geometric distribution, the variance of the surprisal turns out to be directly proportional to the variance of the original random process itself. This ability to treat information itself as a statistical quantity is a profound leap.

The Average Surprise: The Birth of Entropy

This brings us to the most important question of all. If a source is randomly generating symbols, what is the average surprisal we should expect to receive per symbol?

This is simply the expected value of our surprisal random variable, $E[Y] = E[-\log_2(p(X))]$ . To calculate this, we take each possible surprisal value, multiply it by the probability of it occurring, and sum them all up.

$\text{Average Surprisal} = \sum_{\text{all } x} p(x) \times (\text{Surprisal of } x) = \sum_{x} p(x) [-\log_2(p(x))]$

This quantity, the average self-information of a random variable, is one of the most fundamental concepts in all of science. It is called the Shannon entropy, and it is typically denoted by $H(X)$ :

$H(X) = -\sum_{x} p(x) \log_2(p(x))$

For a simple bistable switch that is 'ON' with probability $p$ and 'OFF' with probability $1-p$ , its entropy is $H(p) = -p\log_2(p) - (1-p)\log_2(1-p)$ . This famous U-shaped curve is 0 when $p=0$ or $p=1$ (no uncertainty) and reaches its maximum of 1 bit when $p=0.5$ (maximum uncertainty). Entropy, therefore, is a measure of the average uncertainty of a random variable. It tells us, on average, how many bits of information we gain with each new observation.

Where Theory Meets Reality

Is this "average surprise" just a mathematical abstraction? Or does it correspond to something we can actually measure? Herein lies the beauty and power of the concept.

Imagine we build a "surprisal logger" that watches a source emit a long sequence of symbols $X_1, X_2, \ldots, X_n$ . For each symbol, it computes the surprisal, $-\ln(p(X_i))$ , and after $n$ symbols, it calculates the average of all the values it has seen. What value will this device read as $n$ becomes very, very large?

The Strong Law of Large Numbers from probability theory gives a stunning answer: as the sequence gets longer, the measured sample average of the surprisal is guaranteed to converge to the theoretical expected value, the entropy $H(X)$ .

This is the ultimate bridge between abstract theory and the physical world. Entropy is not just a formula; it is the stable, long-term average information produced by a source. It is the fundamental limit to how much you can compress data from that source. It tells us that the elegant mathematical framework we've built, starting from the simple idea of "surprise," describes a real, measurable property of our universe. And this is just the beginning. By extending these ideas, we can quantify the information shared between two variables—the mutual information—and lay the foundations for understanding everything from telecommunication and computer science to the very workings of life itself.

Applications and Interdisciplinary Connections

After our journey through the principles of self-information, you might be left with a feeling similar to the one you get after learning a new, fundamental law of physics. It's elegant, it's powerful, but the natural question arises: "What is it good for?" It's a fair question. The beauty of a concept like self-information isn't just in its mathematical tidiness; it’s in its astonishing universality. The measure of "surprise" we've defined, $I(p) = -\log_2(p)$ , turns out to be a kind of universal language that helps us understand and quantify phenomena in fields that, on the surface, seem to have nothing to do with each other. It’s a tool for thinking, and in this chapter, we will see it in action, journeying from the guts of our computers to the fabric of life itself, and finally to the very heart of thermodynamics.

The Information in an Error

Let's start with something familiar: reliability. We build machines—from a simple light switch to a deep-space probe—to be reliable. We want them to work as expected. The "expected" outcome (the switch turns on the light, the probe's sensor stays quiet) happens with a very high probability, close to 1. The self-information of this everyday event is therefore very low; it's not surprising. But what about when things go wrong?

Imagine a highly sensitive sensor on a space probe designed to detect rare particles, or a quantum bit (qubit) in a future computer designed to hold its state faithfully,. The probability of a malfunction—a false positive signal from the sensor or a spontaneous flip of the qubit—might be incredibly small, say, $p = 0.015$ . The event itself is a nuisance, an error. But the message that this error has occurred is packed with information. Its self-information is $I(0.015) = -\log_2(0.015) \approx 6.06$ bits. This is a substantial amount of information! It tells an engineer on Earth or a quantum error-correction algorithm precisely where to focus its attention. In a sea of "everything is fine" signals (each carrying almost zero bits of information), a single high-information "error" flag is a beacon that guides diagnosis and repair. The rarity of an event is what makes its occurrence so informative. This principle underpins everything from industrial quality control to debugging complex software.

The Signature of Discovery

This same idea extends beautifully to the scientific process itself. How do we decide if a new drug works or if a new particle has been discovered? Scientists often use a tool called the "p-value". In simple terms, the p-value is the probability of seeing our experimental results (or something even more extreme) if our new idea (the "alternative hypothesis") is wrong and nothing special is going on (the "null hypothesis").

If a clinical trial yields a p-value of, say, $p=0.015$ , it means there was only a 1.5% chance of observing such a positive outcome for the drug if it were just a sugar pill. From an information theory perspective, this rare result is a "surprising" event. We can even calculate its information content, though scientists often use natural logarithms and the unit "nats" for this. The surprisal would be $I_{nat} = -\ln(0.015) \approx 4.20$ nats. A smaller p-value corresponds to a more surprising result and, therefore, a higher information content. A scientific discovery, in this sense, is an experiment that delivers a high-information message from nature, a message so surprising that it forces us to update our understanding of the world.

Life's Blueprint: Information Woven into Biology

Perhaps nowhere is the impact of information theory more revolutionary than in biology. Life, after all, is an information-processing system. From DNA to the brain, nature trades in information.

Reading the Code of Life

When scientists sequence a genome, they use machines that read the letters A, C, G, and T. But this reading process is not perfect. How can we trust the data? Modern genomics has a brilliant solution, and it's pure information theory. Each base that is "called" by a sequencer comes with a quality score, known as a Phred score, $Q$ . This score is nothing but a re-expression of the self-information of an error! The score is defined as $Q = -10 \log_{10}(p_{error})$ , where $p_{error}$ is the probability that the base call is wrong. A high Q-score, like $Q=30$ , corresponds to a low error probability ( $p_{error}=10^{-3}$ ), meaning the event of an error would be very surprising, carrying $-\log_2(10^{-3}) \approx 9.97$ bits of information. So, built into the very fabric of modern genomics is a practical, everyday application of self-information to quantify the certainty of our most fundamental biological data.

The Specificity of Molecular Machines

The cell is a crowded place. How does a protein, like a transcription factor, find its specific target docking site on a DNA strand that is millions or billions of bases long? It does so by recognizing a specific sequence of letters. The more specific the protein, the longer and more unique its target sequence must be.

We can quantify this using self-information,. Let's assume the four DNA bases occur with equal probability (1/4). The information needed to specify one base is $-\log_2(1/4) = 2$ bits. A protein that recognizes a short sequence of, say, 9 base pairs is "reading" $9 \times 2 = 18$ bits of information. A more sophisticated gene-editing tool that recognizes a longer, 18-base-pair sequence is reading $18 \times 2 = 36$ bits of information. It's this doubling of information that makes the second tool vastly more specific, less likely to bind to the wrong place in the genome. Furthermore, this information content is directly linked to the thermodynamics of binding. The binding free energy, $\Delta G$ , which a physical chemist would measure, can be directly translated into the information content of the binding site. A "tighter" bond (more negative $\Delta G$ ) corresponds to a more specific, higher-information interaction.

Diving deeper, we find even more subtle layers of information. The genetic code is famously "degenerate"—multiple three-letter codons can specify the same amino acid. For example, both TTA and CTG code for Leucine. One might think the choice between them is random, but it often isn't. Some organisms prefer certain synonymous codons over others. This "codon bias" is another channel of information. By analyzing the probabilities of these choices, we can calculate the information stored in the synonymous parts of a gene, separate from the information encoding the protein itself. This hidden information can influence the speed and efficiency of protein production, adding another layer of regulation to the intricate machinery of life.

The Brain's Noisy Channel

From the single molecule, let's zoom out to the entire nervous system. Your thoughts, feelings, and perceptions are all encoded in the firing of neurons. A neuron fires when an action potential arrives at its terminal, causing ion channels to open and trigger the release of neurotransmitters. But these ion channels, being tiny molecular machines, don't open with perfect certainty. Their gating is a stochastic, probabilistic process.

This inherent randomness, or noise, places a fundamental limit on how much information a synapse can transmit per second. If a neuron tries to encode a "high" signal by opening, on average, 50 calcium channels, the actual number in any given event might be 45, or 53, due to random fluctuations. This variability blurs the lines between different signal levels. Using the principles of information theory, we can build a model of this noisy channel and estimate its capacity. We find that the bit rate is limited by the signal strength relative to the noise. The brain is not a perfect digital computer; it's a noisy, analog, statistical machine, and its remarkable power must operate within the fundamental physical constraints described by information theory.

The Physics of Information

We end our tour by returning to physics, where the concept of information finds its deepest and most surprising connections.

In chemistry, when molecules react, the resulting product molecules can be formed in various rotational or vibrational states. A "statistical" reaction would populate all available energy states according to their degeneracies, like randomly throwing balls into bins. However, many reactions are not statistical. They show a preference for certain states. We can analyze this deviation using "surprisal analysis". The surprisal of a state is, once again, the negative logarithm of the ratio of its observed probability to its expected statistical probability. For many reactions, this surprisal turns out to be a simple linear function of the state's energy. Astonishingly, the parameter describing this linear relationship behaves exactly like an inverse temperature ( $1/T$ ). This gives us the concept of an "effective temperature" for the reaction products, which is a direct measure of how non-statistical, or "surprising," the outcome is.

The most profound connection, however, comes when we look at the foundations of statistical mechanics. The entropy of a system, a cornerstone of thermodynamics, is precisely the average self-information over all possible microscopic states of that system. But what about the variance of the self-information? Does the fluctuation in the "surprisal" of the microstates mean anything physical?

The answer is a resounding yes. In a beautiful and deep result, one can prove that the variance of the thermodynamic surprisal is directly proportional to the system's heat capacity at constant volume, $C_V$ . Let that sink in. Heat capacity is a macroscopic, measurable property that tells you how much energy you need to add to a system to raise its temperature. Water has a high heat capacity; it can soak up a lot of heat. This derivation shows that this same property implies that the microscopic states of water exhibit a wide variation in their probability, and thus in their "surprisal." A system that is good at storing thermal energy is also one with large fluctuations in its microstate information content. This result, $\text{Var}(s) = C_V / k_B$ , is a stunning bridge, connecting a bulk thermodynamic property ( $C_V$ ) directly to the statistical fluctuations of information ( $s$ ).

From engineering to biology to the core of physics, the simple act of quantifying surprise has given us a lens of unparalleled clarity. It reveals the hidden unities in the world and shows us that information is not just an abstract idea, but a physical quantity as real and as fundamental as energy and entropy.