Expected Counts

SciencePedia

Key Takeaways

The concept of expected counts, often calculated using the principle of linearity of expectation, provides the theoretical baseline for a null hypothesis.
The chi-square goodness-of-fit test is a statistical tool that quantifies the discrepancy between observed reality and expected counts to evaluate a hypothesis.
The validity of the chi-square test is an approximation based on the Central Limit Theorem and requires sufficiently large expected counts to be reliable.
Comparing observed to expected counts is a versatile scientific method used across diverse fields like population genetics, fraud detection, and climate model validation.

Introduction

In science, we constantly build models to describe the world, but how do we know if these models are correct? A casual observation might differ from a theoretical prediction, but is that difference meaningful, or is it merely due to random chance? This gap between a theoretical story and observed reality requires a rigorous method of comparison. This article delves into the foundational concept of expected counts, the quantitative predictions made by a specific hypothesis. We will explore how this simple idea serves as the bedrock of statistical testing. The first section, "Principles and Mechanisms," will unpack the mathematical tools used to calculate expected counts, such as linearity of expectation, and introduce the powerful chi-square test used to measure the gap between expectation and observation. The second section, "Applications and Interdisciplinary Connections," will demonstrate the remarkable versatility of this approach, showcasing its use as a universal detective tool in fields as diverse as population genetics, fraud auditing, and climate science.

Principles and Mechanisms

The Magic of Linearity: Deconstructing Complexity

What does it mean when we say we "expect" something to happen? In everyday life, it's a fuzzy guess. In science and mathematics, it's a concept of stunning precision and power. The expected value isn't the single outcome we are guaranteed to see; rather, it’s the long-term average if we could repeat an experiment over and over again. It’s the center of gravity for a world of possibilities.

Imagine you're scanning a long sequence of random bits, say, a string of a million 0s and 1s, each chosen by a fair coin flip. You decide to look for a specific pattern, '101'. How many times would you expect this pattern to appear? It seems like a tangled mess. What if one '101' overlaps with another, like in '10101'? Does that complicate the counting?

Here, we encounter one of the most elegant and, frankly, magical tools in probability: linearity of expectation. It tells us that the expectation of a sum of random variables is simply the sum of their individual expectations. Crucially, this holds true even if the variables are not independent. The overlapping patterns don't matter for the expectation!

Let’s see this magic in action. A '101' pattern can start at position 1, position 2, and so on, up to position $n-2$ in a sequence of length $n$ . Let's define an "indicator" for each possible starting position, $i$ . We'll call it $X_i$ . This indicator is a simple creature: it's equal to 1 if the pattern '101' starts at position $i$ , and 0 otherwise. The total number of '101's we find, let's call it $X$ , is just the sum of all these indicators: $X = X_1 + X_2 + \dots + X_{n-2}$ .

Linearity of expectation lets us write: $\mathbb{E}[X] = \mathbb{E}[X_1] + \mathbb{E}[X_2] + \dots + \mathbb{E}[X_{n-2}]$ The expectation of an indicator variable is just the probability that it equals 1. For any position $i$ , the probability of seeing '101' is the probability of a 1, then a 0, then a 1. Since each bit is independent with a probability of $1/2$ , this is just $(\frac{1}{2}) \times (\frac{1}{2}) \times (\frac{1}{2}) = \frac{1}{8}$ .

So, the total expected number of occurrences is simply the sum of this probability over all possible starting positions: $\mathbb{E}[X] = \sum_{i=1}^{n-2} \frac{1}{8} = \frac{n-2}{8}$ Just like that, a seemingly complex problem is solved. This isn't just a trick for coin flips. Imagine you're a bioinformatician studying a strand of DNA. The four bases A, C, G, T might not appear with equal frequency. Perhaps in a certain organism, 'C' and 'G' are common ( $0.4$ probability each), while 'A' and 'T' are rare ( $0.1$ probability each). What is the expected number of times you'll find the specific codon 'CAT' in a sequence of length $N$ ? The logic is identical. We calculate the probability of that specific pattern occurring— $P(C) \times P(A) \times P(T) = 0.4 \times 0.1 \times 0.1 = 0.004$ —and multiply it by the number of possible starting positions, $N-2$ . The expected count is simply $(N-2) \times 0.004$ .

This powerful idea of breaking down a large, complicated expectation into a sum of simple, small expectations is a foundational principle. It allows us to calculate the "average" state of a system, which is the first step toward asking a much deeper question: does the world I observe match the world I expect?

Worlds of "What If": Expected Counts and the Null Hypothesis

Science is a game of "what if." What if genes for flower color and seed shape are inherited independently? What if this population is mating randomly? What if this new pesticide has no effect? Each of these questions describes a hypothetical world, a specific model of reality. In statistics, this model is called the null hypothesis ( $H_0$ ). It’s a precise, falsifiable statement about the world that generates a set of predictions, or expected counts.

Imagine you are Gregor Mendel's modern successor, studying two genes in a plant. You perform a testcross and get four types of offspring: $AB$ , $Ab$ , $aB$ , and $ab$ . Your null hypothesis is Mendel's Law of Independent Assortment, which states the genes are unlinked. This hypothesis makes a crisp prediction: all four offspring types should be produced in equal numbers. If you have $400$ offspring in total, your expected counts are straightforward: you expect $100$ of each type. Your "world of what if" is a world of perfect 1:1:1:1 ratios.

Now you can compare the plants you actually counted—your observed counts—to this idealized expectation. If you observed $160, 50, 40, 150$ , that seems quite far from the $100, 100, 100, 100$ you expected. You have a numerical basis to start questioning your null hypothesis.

Often, however, the world of "what if" isn't specified by a pre-existing law. We have to build it from the data itself. This brings us to one of the most important ideas in population genetics: the Hardy-Weinberg Equilibrium (HWE). The HWE principle describes a "null" world for evolution: a large population with random mating, no mutation, no migration, and no natural selection. In this world, genotype frequencies are a simple function of allele frequencies. For a gene with two alleles, $A$ (with frequency $p$ ) and $a$ (with frequency $q$ ), the expected genotype frequencies are $p^2$ for $AA$ , $2pq$ for $Aa$ , and $q^2$ for $aa$ .

Suppose we sample $400$ individuals from a population and observe $132$ of genotype $AA$ , $210$ of $Aa$ , and $58$ of $aa$ . We don't know the true $p$ and $q$ in the population. What are our expected counts? We must first estimate the allele frequencies from our sample. Each $AA$ individual has two $A$ alleles, and each $Aa$ has one. So, the frequency of allele $A$ in our sample is: $\hat{p} = \frac{2 \times n_{AA} + n_{Aa}}{2 \times N} = \frac{2 \times 132 + 210}{2 \times 400} = 0.5925$ Now we use this estimate to build our null world. If the population were in HWE with this allele frequency, the expected counts would be:

$E_{AA} = N \times \hat{p}^2 = 400 \times (0.5925)^2 \approx 140.4$
$E_{Aa} = N \times 2\hat{p}\hat{q} = 400 \times 2 \times 0.5925 \times (1-0.5925) \approx 193.2$
$E_{aa} = N \times \hat{q}^2 = 400 \times (1-0.5925)^2 \approx 66.4$

We now have two sets of numbers: the observed ( $132, 210, 58$ ) and the expected ( $140.4, 193.2, 66.4$ ). They are different, but are they significantly different? Is the discrepancy small enough to be due to random chance in sampling, or is it large enough to suggest that one of the HWE assumptions—like random mating or no selection—is being violated? To answer this, we need a tool to quantify the "distance" between observation and expectation.

The Chi-Square Test: Measuring the Gap Between Reality and Expectation

In the early 20th century, Karl Pearson provided us with an exceptionally elegant tool for this job: the Pearson's chi-square ( $\chi^2$ ) goodness-of-fit test. The test gives us a single number that summarizes the total discrepancy between the observed ( $O$ ) and expected ( $E$ ) counts across all categories. The formula is: $\chi^2 = \sum \frac{(O - E)^2}{E}$ Let's appreciate its simple beauty. For each category, we take the difference between what we saw and what we expected, $(O - E)$ . We square it, so deviations in either direction (more or less than expected) contribute positively to the total. Then, we divide by the expected count, $E$ . This final step is crucial: it puts the squared difference into perspective. A difference of 10 is very surprising if you only expected 2, but completely unremarkable if you expected 10,000. The $\chi^2$ statistic is a sum of these relative squared differences.

Let's apply this to a real scenario. A population of insects is being monitored for resistance to a pesticide. Before its use, we knew the allele frequencies for susceptibility were $p=0.8$ and $q=0.2$ . The null hypothesis is that the pesticide had no effect and the population is still in HWE with these original frequencies. In a new sample of 200 insects, we observe 135 susceptible homozygotes (AA), 50 heterozygotes (Aa), and 15 resistant homozygotes (aa). Do these numbers challenge our null hypothesis?

First, we calculate the expected counts based on the original frequencies:

$E_{AA} = 200 \times p^2 = 200 \times (0.8)^2 = 128$
$E_{Aa} = 200 \times 2pq = 200 \times 2 \times 0.8 \times 0.2 = 64$
$E_{aa} = 200 \times q^2 = 200 \times (0.2)^2 = 8$

Now, we calculate the $\chi^2$ statistic: $\chi^2 = \frac{(135 - 128)^2}{128} + \frac{(50 - 64)^2}{64} + \frac{(15 - 8)^2}{8} = \frac{49}{128} + \frac{196}{64} + \frac{49}{8} \approx 0.38 + 3.06 + 6.13 = 9.57$ This number, $9.57$ , quantifies the mismatch. The larger it is, the worse the fit. But how large is "too large"? This is where the theoretical beauty of the $\chi^2$ statistic comes in. Pearson showed that, if the null hypothesis is true, the distribution of this statistic follows a known mathematical form—the chi-square distribution. By comparing our calculated value to this theoretical distribution, we can determine the probability of getting a discrepancy this large or larger just by random chance.

This probability depends on the degrees of freedom ( $df$ ), which you can think of as the number of independent categories that are "free to vary". For a goodness-of-fit test, you start with the number of categories, subtract 1 because the total count is fixed, and subtract another 1 for each parameter you had to estimate from the data. In our HWE test where we estimated the allele frequency, we have 3 genotypes, so $df = 3 - 1 - 1 = 1$ . This principle scales beautifully. For a gene with $k$ alleles, there are $\frac{k(k+1)}{2}$ genotypes, and we estimate $k-1$ allele frequencies, leaving $df = \frac{k(k-1)}{2}$ .

Peeking Under the Hood: Why the Chi-Square Test Works (And When It Doesn't)

The fact that the $\chi^2$ statistic follows a predictable distribution is not an accident; it's a deep consequence of the Central Limit Theorem (CLT). The CLT is one of the crown jewels of mathematics, and it states, in essence, that if you add up a large number of independent random variables, their sum will tend to be distributed according to a bell-shaped normal distribution, regardless of the original distribution of the variables.

A count in a category, like the number of $aa$ individuals, can be thought of as the sum of many little indicators (1 if the individual is $aa$ , 0 if not). For large samples, the CLT tells us that these counts will be approximately normally distributed. The $\chi^2$ statistic is, in fact, a sum of squared, standardized, approximately normal variables, and the distribution of such a sum is the chi-square distribution.

But notice the key phrase: "for large samples." The entire foundation of the test is an approximation. And like all approximations, it has its limits. This is the origin of the famous rule of thumb taught in introductory statistics classes: "all expected counts must be at least 5." Why 5? Why not 3, or 10?

The reason lies in the quality of the normal approximation. When an expected count is very small, say $E_{aa} = 0.1$ , the observed count $O_{aa}$ can only be 0, 1, 2, ... Its distribution is not a smooth bell curve at all; it's a spiky, highly skewed, discrete distribution. The normal approximation is terrible. Consequently, the approximation of our test statistic's distribution by the smooth $\chi^2$ curve also fails dramatically.

What happens when it fails? The true sampling distribution of the $\chi^2$ statistic becomes "lumpier" than the theoretical curve. It develops a heavier tail, meaning that large values of the statistic become more likely than the theory predicts, just by random chance. This leads to an anticonservative test: you will reject the null hypothesis more often than you should. Your nominal 5% error rate might actually be 10% or 20%. You'll cry "Discovery!" when, in fact, you're just looking at statistical noise. The rule of thumb, $E_i \ge 5$ , is a pragmatic safety check to ensure we are in a zone where the CLT's magic holds and our test is reliable.

For tests with one degree of freedom, like the standard HWE test, a continuity correction can sometimes be applied to improve the approximation. This involves subtracting 0.5 from the absolute difference $|O-E|$ before squaring. It's an attempt to bridge the gap between the discrete nature of the counts and the continuous chi-square curve. While it can help, it's not a panacea and can sometimes over-correct, making the test too conservative.

Beyond Approximations: The Modern Toolkit

So what do we do when our expected counts are stubbornly low, as is common in genetics when dealing with rare alleles or small sample sizes? Suppose we sample just 10 individuals and find 8 $AA$ , 2 $Aa$ , and 0 $aa$ . Our expected count for $aa$ is a minuscule $0.1$ . The chi-square test is clearly inappropriate. Do we just give up?

Of course not! This is where modern statistical thinking provides more powerful and exact tools. The problem with the HWE test is the unknown allele frequency, $q$ , which we call a "nuisance parameter." An ingenious solution, developed by statisticians like R.A. Fisher, is to construct an exact test. The logic is as follows: we can reframe the question. Given that we observed exactly 2 copies of the 'a' allele in our sample of 20 alleles, what is the probability of them being arranged as two $Aa$ individuals and eight $AA$ individuals (the observed configuration), as opposed to, say, one $aa$ individual and nine $AA$ individuals (the only other possibility)?

By conditioning on the observed number of alleles (the "sufficient statistic" for the nuisance parameter), we can calculate the exact probability of every possible genotype configuration without any approximation whatsoever. This allows us to compute a precise $p$ -value.

When enumerating all possibilities is too computationally intensive, we can turn to Monte Carlo methods. We use a computer to simulate thousands of datasets under the null hypothesis (either from the exact conditional distribution or from the HWE model with our estimated allele frequency). We then calculate our test statistic for each simulated dataset. This gives us an empirical distribution of the statistic under the null, a direct picture of the "world of what if." We can see exactly where our observed statistic falls in this distribution to get an incredibly accurate $p$ -value.

The journey from a simple idea—the expected number of '101's—has taken us through the heart of scientific hypothesis testing. We've seen how expected counts form the bedrock of our predictions, how the chi-square test measures the gap between prediction and reality, and how a deep understanding of the test's theoretical foundations allows us to recognize its limits and deploy more powerful, exact methods when needed. This is the process of science in microcosm: building models, making predictions, checking them against reality, and constantly refining our tools to see the world more clearly.

Applications and Interdisciplinary Connections

Now that we have grappled with the machinery of expected counts and their role in hypothesis testing, you might be tempted to think of this as a somewhat dry, statistical exercise. Nothing could be further from the truth. In fact, what we have developed is one of the most versatile and powerful magnifying glasses in the entire toolkit of science. It is a universal method for asking one of the most fundamental questions: "Does this thing I'm looking at match the story I've been told about it?" The "story" is our null hypothesis, our model of how the world ought to behave if nothing special is going on. The "expected counts" are the concrete predictions of that story. By comparing these expectations to the "observed counts"—the stubborn facts of reality—we can quantify surprise. And science, at its heart, is the business of investigating surprises.

Let's take a walk through the vast landscape of knowledge and see where this simple idea—comparing observation to expectation—has allowed us to make remarkable discoveries.

The Blueprint of Life: Reading the Story in Our Genes

Perhaps the most classic and elegant application of this method is in population genetics. Imagine a large, randomly-mating population where no evolutionary forces are at play—no natural selection, no mutation, no migration. What would its genetic makeup look like from one generation to the next? The Hardy-Weinberg principle gives us the answer. It provides a "null model" for genetic inertia, predicting genotype frequencies from allele frequencies with simple rules: $p^2$ , $2pq$ , and $q^2$ . This is our baseline, our "expected" genetic state.

When we go out into the real world and sample a population, we can then ask: does it fit this idealized equilibrium? For instance, when ecologists study a species of wild grass re-colonizing land contaminated with heavy metals, they might hypothesize that strong natural selection is favoring a tolerance gene. To test this, they count the observed genotypes in the field. They then calculate the allele frequencies from their sample and use the Hardy-Weinberg rule to determine the expected number of tolerant, heterozygous, and sensitive plants they should have seen if no selection were occurring. A significant deviation between the observed and expected counts, measured by the $\chi^2$ statistic, becomes a smoking gun for evolution in action. The same logic applies whether we are studying wing shape in lab-grown fruit flies or any other trait in any other species.

But the story gets more subtle. Sometimes, the deviation itself tells a specific tale. Consider two isolated herds of mountain goats that begin to merge. If we naively treat them as a single, randomly mating population and calculate the expected genotype counts, we will find a significant mismatch. Specifically, we'll see fewer heterozygotes than expected. This isn't random error; it's a known phenomenon called the Wahlund effect. Our test hasn't just told us our "single population" model is wrong; the nature of the deviation points directly to the underlying reality of population substructure. The tool is more than a simple "yes" or "no" detector; it's a diagnostic instrument.

We can extend this thinking. Instead of looking at one gene, what about two? Do the alleles for two different genes on the same chromosome get passed down independently, like a coin flip? Or are they "linked," tending to travel together? Our null model is independence: the frequency of a haplotype (say, $AB$ ) should just be the frequency of allele $A$ times the frequency of allele $B$ . We can calculate the expected counts of all four possible haplotypes ( $AB$ , $Ab$ , $aB$ , $ab$ ) under this assumption and compare them to what we actually observe in a population. A significant deviation, a state known as Linkage Disequilibrium (LD), tells us the genes are not independent—perhaps because they are physically close on a chromosome or because a specific combination is favored by selection. This very principle is the cornerstone of efforts to map the genes responsible for human diseases. And the method is so flexible, it can even be adapted to the peculiar genetics of haplodiploid insects like bees and ants, where males are haploid and females are diploid, by simply adjusting how we calculate our expectations.

The Universal Detective: From Fraud to Forensics

The power of comparing observed to expected is by no means confined to biology. It is a universal detective tool. Consider the strange and wonderful Benford's Law. It states that in many naturally occurring sets of numbers—financial transactions, street addresses, physical constants—the first digit is far more likely to be a "1" (about 30% of the time) than a "9" (less than 5% of the time). This law gives us a set of expected first-digit frequencies.

Now, imagine you are an auditor examining a company's expense reports. If the data are genuine, the leading digits should roughly follow Benford's Law. But if someone has fabricated the numbers, they are unlikely to reproduce this subtle, counter-intuitive distribution. They will likely use 5s, 6s, and 7s more often than they should. By counting the observed frequencies of each leading digit (1-9) and comparing them to the expected counts from Benford's Law, an auditor can spot a statistical red flag for fraud. The $\chi^2$ test quantifies just how "surprising" the deviation is.

This same logic can enter the world of humanities. Is a newly discovered manuscript the lost work of a famous author? One clue lies in stylistic fingerprints. Different authors have characteristic, often unconscious, patterns in their writing, such as the relative frequency of vowels. An analyst can establish the known vowel distribution for an author from their confirmed works—this becomes the basis for the "expected" counts. Then, they count the vowels in the disputed manuscript (the "observed" counts) and perform a goodness-of-fit test. If the observed pattern is wildly different from the author's known style, it casts serious doubt on the attribution.

Validating Our Models of Reality

In the modern era of computation, we build fantastically complex models to understand the world—from the intricate dance of proteins in a cancer cell to the vast, churning dynamics of the global climate. A critical question always looms: is our model any good? Does it capture reality?

Once again, our trusty tool provides the answer. A climate model, for example, can generate predictions for the distribution of daily temperature anomalies over 30 years. These predictions become our "expected counts" for how many days should fall into various temperature bins (e.g., -2 to -1 degrees Celsius, -1 to 0, etc.). We can then compare this to 30 years of actual historical weather data—our "observed counts." The goodness-of-fit test gives us a rigorous way to score the model's performance. A significant deviation tells us the model is missing something important about the climate system and needs to be refined.

The same idea applies at the microscopic scale. Bioinformatics researchers scan entire genomes, which are millions or billions of base pairs long, looking for specific DNA sequences, or "motifs." For example, a bacterium might have a restriction enzyme that cuts the DNA at the sequence "CCWGG" (where W is A or T). To protect itself, the bacterium's own genome might evolve to have fewer of these sites than one would expect by chance. We can build a simple probabilistic model of the genome based on its overall GC content to calculate the expected number of times this motif should appear. If the observed count is dramatically lower, it's strong evidence of selective pressure and provides insight into the molecular arms race between the enzyme and the genome. Similarly, in cancer research, the principle of Hardy-Weinberg equilibrium can be cleverly repurposed. A sample of tumor cells should, in theory, follow HWE if all cells are genetically identical. A significant deviation can signal somatic mosaicism—the emergence and clonal expansion of new mutations within the tumor, a key process in cancer evolution.

From the gene to the globe, from fraudulent ledgers to forgotten sonnets, the simple act of comparing what is with what ought to be is a thread of inquiry that unifies disparate fields of human knowledge. The "expected count" is not just a number in a formula; it is the embodiment of a hypothesis, a theory, a story we tell about the world. And the chi-square test is our way of holding that story up to the light of evidence.