The Units of Information: From Bits to Biology

SciencePedia

Key Takeaways

The 'bit' is the fundamental unit of information, representing the resolution of uncertainty between two equally likely outcomes.
Entropy, defined as the average surprise or inherent uncertainty of a source, sets the absolute, unbreakable limit for data compression.
Information is a physical quantity, with its mathematical formulation (Shannon entropy) directly mirroring thermodynamic entropy in physics.
The principles of information theory provide a unified framework for understanding diverse fields, from chaotic systems and quantum measurement to DNA evolution and animal communication.

Introduction

What is information? While we use the term daily, its scientific meaning is far more precise and powerful. It's not about the semantic content of a message, but about the reduction of uncertainty it provides. This article addresses the fundamental question of how we can quantify information, moving it from an abstract idea to a measurable physical quantity. We will begin by exploring the core principles and mechanisms, defining the fundamental 'atom' of information—the bit—and developing the crucial concept of entropy. From there, we will journey across the scientific landscape to witness the surprising and profound applications of these ideas, revealing how information theory provides a common language for fields as diverse as physics, biology, and computer science. By understanding the units of information, we unlock a new lens to view the universe itself.

Principles and Mechanisms

Imagine you are a spy in an old movie. A secret message arrives. It could be one of sixteen possible plans. Before you read it, your world is a landscape of sixteen possibilities, sixteen futures shrouded in fog. You open the envelope, and the message reads, "Plan G". Suddenly, the fog clears. Fourteen futures vanish. One reality crystallizes. How much clarity did you just gain? How much uncertainty was resolved? This is the central question of information theory. It's not about the meaning of "Plan G", but about the sheer reduction in possibilities.

The Atom of Information: The Bit

Let's simplify. Forget sixteen plans. Imagine a single coin flip, hidden from view. Heads or tails? Two possibilities. The moment you see the result, you have resolved one fundamental ambiguity. This resolution of a single yes/no question, a choice between two equally likely outcomes, is the fundamental atom of information. We call it the bit.

Now, let's return to our spy scenario. With sixteen equally likely plans, how many bits of information does the message "Plan G" contain? You could think of it as a game of "20 Questions." Your first question might be, "Is it one of plans A through H?" Answering this halves the possibilities, giving you one bit of information. If the answer is yes, you ask, "Is it one of plans A through D?" Another bit. You need to ask four such questions to pinpoint "Plan G" exactly. So, receiving that message resolved 4 bits of uncertainty.

Notice the mathematical pattern: $16 = 2^4$ . The number of bits is the power you raise 2 to, to get the number of options. This is the nature of logarithms. The amount of information, $I$ , gained from identifying one outcome out of $N$ equally likely possibilities is:

$I = \log_{2}(N)$

This logarithmic scale is key. It means that information adds up in a beautifully simple way. Resolving one coin flip gives you $\log_2(2) = 1$ bit. Resolving two independent coin flips gives you $\log_2(4) = 2$ bits. The information just adds.

Surprise, Surprise!

But what if the possibilities are not equally likely? Imagine you are reading an English novel. If the next letter you see is 'E', you are not very surprised. It's the most common letter. But if the next letter is 'Z', you might pause. It's a rare event. Common events are predictable and thus carry little information. Rare, surprising events carry a great deal.

Claude Shannon, the father of information theory, captured this intuition perfectly. The information content, or surprisal, of a specific outcome $x$ with probability $p(x)$ is defined as:

$I(x) = -\log_{2}(p(x))$

Notice that for a small probability $p(x)$ , its logarithm is a large negative number, so $-\log_{2}(p(x))$ is a large positive number. A very improbable event carries a lot of bits. For instance, the letter 'Z' appears in English with a probability of roughly $0.00074$ . The information content of observing a 'Z' is therefore $-\log_{2}(0.00074) \approx 10.4$ bits. That single letter resolves as much uncertainty as more than ten consecutive coin flips!

The Measure of Uncertainty: Entropy

Now we can measure the surprise of any single event. But what if we want to characterize the information source itself? A source that only ever produces the letter 'A' is perfectly predictable and produces zero information on average. A source that spits out random letters with equal probability is highly unpredictable. We need a way to measure the average surprise, or the inherent uncertainty, of a source. This measure is called entropy, denoted by $H$ .

We calculate it by taking the information of each possible outcome, $I(x_i)$ , and weighting it by the probability of that outcome, $p(x_i)$ , and summing them all up:

$H = \sum_{i} p(x_i) I(x_i) = -\sum_{i} p(x_i) \log_{2}(p(x_i))$

Entropy is one of the most profound and useful concepts in all of science. For a communications engineer, the entropy of a data source represents a fundamental limit. Shannon's source coding theorem proves that you cannot, on average, compress data from a source into fewer bits per symbol than the source's entropy. It is the absolute, unbreakable speed limit for data compression. Entropy tells you the "true" amount of information being produced.

And here, we stumble upon one of the most beautiful unifications in science. In the 19th century, physicists like Ludwig Boltzmann and J. Willard Gibbs developed a formula for the entropy of a physical system, like a container of gas. The Gibbs entropy is:

$S = -k_{B} \sum_{i} p_{i} \ln(p_{i})$

Look familiar? It is exactly Shannon's formula, but with a natural logarithm instead of base-2, and multiplied by a physical constant, the Boltzmann constant $k_B$ . This is no coincidence. Both formulas are measuring the same fundamental thing: the observer's uncertainty about the state of a system. Whether it's the microstate of gas molecules or the next symbol in a message, entropy quantifies our ignorance. Information is physical.

A Tale of Three Logarithms: Bits, Nats, and Hartleys

The appearance of different logarithm bases ( $\log_2$ versus $\ln$ ) brings us to the question of units. Just as we can measure distance in meters or feet, we can measure information in different units. The choice of base for the logarithm simply sets the unit:

Base 2: The unit is the bit. This is the natural language of computers and binary logic.
Base e: The unit is the nat (for "natural unit"). This is the natural language of calculus and theoretical physics, where the number $e$ appears everywhere.
Base 10: The unit is the Hartley (or ban, or decimal digit). This corresponds to the information in a single decimal digit.

The amount of uncertainty in a system is fixed; the numerical value of its entropy just depends on the units we use to measure it. Converting between them is as simple as converting between any other units, using the change-of-base formula for logarithms: $H_{\text{bits}} = H_{\text{nats}} / \ln(2)$ .

This isn't just an academic exercise. In the field of bioinformatics, scientists search for similarities between DNA or protein sequences using tools like BLAST. The raw score for an alignment is often calculated using formulas that are most natural in base $e$ , giving a score in nats. However, to make the result easy to interpret, this score is converted to bits by dividing by $\ln(2)$ . This bit score has a wonderfully intuitive meaning: for every 1-point increase in the bit score, the alignment becomes twice as statistically significant. This simple unit conversion transforms a raw mathematical output into a powerful tool for discovery.

The Laws of Information

Information, like energy, is not just a bookkeeping device; it obeys fundamental laws.

First, you cannot create information from nothing. Imagine you have a rich dataset of patient health records, $Y$ . You process it to create a smaller, anonymized dataset, $Z$ , perhaps for sharing with researchers. The Data Processing Inequality states that the amount of information the anonymized set $Z$ contains about the original patient's condition $X$ can never be greater than the information the original dataset $Y$ contained. Processing can only preserve or lose information; it can never invent it. In a chain of processing, $X \to Y \to Z$ , it must always be true that $I(X;Y) \ge I(X;Z)$ . This is a fundamental law of information flow.

Second, information transmission has a speed limit. When we send data through a noisy channel, like a deep-space probe transmitting through solar plasma, some bits might get lost. A simple model for this is the Binary Erasure Channel, where each bit either arrives perfectly or is erased with some probability $\epsilon$ . What is the maximum rate you can reliably send data? Shannon's channel coding theorem gives the answer: the channel capacity, $C$ . For this channel, the capacity is beautifully and intuitively simple: $C = 1 - \epsilon$ . The maximum reliable rate is simply the fraction of bits that successfully get through. You can't do better.

Finally, we close the loop and see again how deeply information is woven into the fabric of the physical world. Consider a modern engineering problem involving estimating a signal $X$ from a noisy measurement $Y$ . A deep result called the I-MMSE identity relates the mutual information $I(X;Y)$ to the error in the best possible estimate. At first glance, the equation seems to violate the rules of dimensional analysis—a derivative of dimensionless information equals a quantity with units of volts-squared! The paradox dissolves only when you realize that for the equation to hold true, the abstract "signal-to-noise ratio" parameter must itself carry physical units that make everything consistent. Information, when measured in a physical system, is not a disembodied ghost. It is subject to the same rigorous laws as energy, momentum, and charge. It is a true physical quantity, and its principles govern everything from the whisper of a secret message to the roar of the cosmos.

Applications and Interdisciplinary Connections

We have seen that information has units, like bits and nats, just as length has meters and mass has kilograms. At first, this might seem like a convenient accounting trick for computer scientists and communication engineers. But the truth is far more profound and beautiful. The concept of information, quantified in these units, is not an artificial construct; it is a fundamental part of the physical world. It provides a new and powerful lens through which we can understand an astonishing variety of phenomena, from the laws of thermodynamics to the evolution of life itself. Let us now take a tour of these connections, and you will see how the humble bit is mightier than you ever imagined.

Information as a Physical Substance

If information is physical, can we treat it like other physical quantities? Can we speak of its density, or the way it flows from one place to another? The answer is a resounding yes. Imagine information stored on a magnetic hard drive platter. We can quantify this as a certain number of bits per square meter. Why stop there? Physicists can model information as a quantity that fills a volume, giving it a volumetric information density, with units of bits per cubic meter. Once you have a density, you can talk about a flow. Just as electric charge density moving with a certain velocity creates an electric current, a volume containing information that moves creates an information flux—a flow of bits per square meter per second. This is not just a fanciful analogy. This framework is essential in fields like cosmology, where one might track the flow of information in the evolving universe, or in neuroscience, where one could model the propagation of information through neural tissue.

This physical nature of information has its deepest roots in one of the pillars of classical physics: thermodynamics. You learned in school that entropy is a measure of "disorder." A better, more precise, definition is that entropy is a measure of missing information. The Gibbs or von Neumann entropy formula, $S = -k_B \sum_i p_i \ln(p_i)$ , is, apart from the Boltzmann constant $k_B$ , identical to the Shannon information formula. They are the same concept.

Consider the famous Helmholtz free energy, $F = E - TS$ , which tells us the maximum amount of work a system can do at a constant temperature $T$ . What is the meaning of the $TS$ term? It is the amount of the system's total internal energy $E$ that is "locked up" and unavailable for work because of our ignorance about the system's exact microscopic state. It is, in a very real sense, the energy cost of missing information. For any given system, like a crystal made of atoms that can exist in several quantum energy levels, we can explicitly calculate this "informational energy" by first finding the probability of each microstate and then applying the entropy formula. This stunning connection, discovered in the mid-20th century, revealed that the mysterious entropy of the 1800s was really about the bits of information needed to specify a system's state.

The Dynamics of Information: Creation, Loss, and Measurement

So, information can be stored. But it also has a dynamic life. It can be created, and it can be lost. Nowhere is this more dramatic than in the study of chaos. A simple, predictable system like a pendulum swinging (with no friction) is information-preserving; if you know its state now, you know its state forever. But a chaotic system, like the atmosphere, is a veritable factory of information. Its hallmark is an extreme sensitivity to initial conditions—the famous "butterfly effect."

This sensitivity is not just a qualitative feature; it can be quantified. The rate at which two initially close trajectories in a chaotic system diverge is measured by the largest positive Lyapunov exponent, $\lambda_1$ . According to a profound result known as Pesin's Identity, this exponent is precisely equal to the rate at which the system creates information, known as the Kolmogorov-Sinai entropy. If the Lyapunov exponent is measured using the natural logarithm, this information rate is in nats per second. If we want to know the time it takes for our initial knowledge of the system to be eroded by one bit, we can calculate it directly from the dynamics. For a system like the classic Lorenz weather model, this "information horizon" is simply $\tau = \ln(2) / \lambda_1$ . The reason we cannot predict the weather weeks in advance is not just a failure of our computers; it is a fundamental property of the atmosphere itself, which is actively generating new information (and thus, unpredictability) at a measurable rate of bits per hour.

If chaotic systems destroy information about the past, how do we gain information about the present? Through measurement. But what is a measurement? From an information-theoretic viewpoint, it is an act of data compression. When a physicist measures the temperature of a gas, their thermometer does not query every single one of the $10^{23}$ molecules. It observes some macroscopic consequence of their collective motion and produces a single number. This is a lossy compression process. We can ask two crucial questions: First, how much information does our measurement device actually extract from the full system? This is the mutual information between the microstate and the measurement outcome, $I(X;Z)$ . Second, how much of that information is actually useful for the macroscopic variable we care about? This is the mutual information between the measurement and the variable of interest, $I(Z;Y)$ . This "information bottleneck" perspective is essential for designing efficient experiments and understanding the fundamental limits of what we can know.

These limits are made even more precise by the concept of Fisher information. Imagine you are trying to estimate some hidden parameter of a system by observing its output. Perhaps you are a quantum physicist counting photons from a decaying atom to estimate its excitation rate, or a systems biologist observing how a bacterium responds to a chemical to estimate the nutrient concentration in its environment. In all such cases, each data point you collect—each photon, each twitch of the bacterium—provides some amount of information about the parameter you seek. The Fisher information quantifies the maximum possible amount of information per observation. It sets the ultimate physical limit, known as the Cramér-Rao bound, on the precision of any estimation. It is a universal currency that converts raw data into knowledge, and it shows that the same fundamental laws govern the process of learning, whether for an atom or a bacterium.

Information as the Language of Life

If information theory provides a powerful language for physics, it is the native tongue of biology. Life, after all, is the processing of information.

This starts at the most basic level: our DNA. When biologists compare the genetic sequence of a human protein to that of a fruit fly to infer their evolutionary relationship, they rely on scoring systems like the PAM or BLOSUM matrices. The numbers in these matrices are not arbitrary. They are carefully calculated log-odds scores, quantifying the logarithm of the probability that a particular amino acid substitution occurred through evolution versus by random chance. The scaling factor used to create these scores is chosen specifically to express this information in convenient units, such as bits or half-bits. So when you see an alignment score in bioinformatics, you are literally looking at a measure of evidence, in bits, for a shared evolutionary history.

This logic extends from molecules to societies. Consider a social animal that has found a rare and valuable food source. It can emit a costly call to attract its kin. Is it worth it? Evolution, as a relentless accountant, weighs the metabolic cost of the call against its fitness benefit. Part of that benefit is, of course, the energy from the food itself. But there is a more subtle component: the value of the information. The evolutionary advantage of a signal is related to the amount of uncertainty it resolves for the group. In information theory, this is called "surprisal," given by $-\log(p)$ . A call that signals a very rare event (low probability $p$ ) provides many bits of information, and thus can be evolutionarily justified even if it is very costly.

Finally, let us take the grandest view of all. The entire history of life on Earth can be understood as a series of "major evolutionary transitions," and these transitions are, at their core, revolutions in information management. The emergence of chromosomes, which bundled genes together; the origin of eukaryotic cells, which compartmentalized information processing; the invention of multicellularity, where cells subordinate themselves to a common genetic blueprint; the evolution of societies with complex communication. Each of these steps represents the emergence of a new "Darwinian individual." What defines such an individual? It is an entity that can replicate and pass on its heritable information with high fidelity. A major transition occurs precisely when a new system for packaging, transmitting, and safeguarding information arises, allowing selection to act on a new, higher level of organization. For a multicellular organism like a human to exist, for example, the trillions of component cells had to subordinate their own reproductive potential. This is achieved through a new information protocol: the segregation of a germline and passing all information needed to build the next generation through a single-cell bottleneck—the zygote.

From the thermodynamics of a steam engine to the evolution of consciousness, the concept of information and its fundamental units provide an astonishingly unified framework. It is a language that connects the inanimate world of physics to the vibrant, evolving world of biology, revealing that the universe, in some deep sense, runs on bits.