Quantity Disagreement

SciencePedia

Key Takeaways

Total disagreement between two categorical maps can be perfectly partitioned into two distinct components: quantity disagreement (mismatch in category totals) and allocation disagreement (mismatch in spatial location).
This error decomposition provides more actionable insights for model improvement than single metrics like Overall Agreement or Cohen's Kappa, which can be misleading.
Quantity disagreement reveals errors in the predicted amount of each category, while allocation disagreement highlights errors in the predicted spatial pattern of those categories.
The framework is a powerful diagnostic tool applicable across diverse fields, including land use modeling, remote sensing, and machine learning, to diagnose issues like overfitting or model bias.

Introduction

When comparing two maps, such as a model's prediction against reality, the immediate question is, "How well do they agree?" The conventional answer often comes from a single metric like Overall Agreement, which simply counts the percentage of matching pixels. However, this simplicity masks a critical reality: not all disagreements are created equal. Lumping all errors into one number prevents us from understanding why the maps differ, hindering our ability to make targeted improvements to our models and analyses. This article addresses this knowledge gap by introducing a more powerful framework that separates total disagreement into two fundamental and interpretable parts: Quantity Disagreement and Allocation Disagreement.

This article will first delve into the foundational principles and mathematical mechanisms behind this decomposition in the "Principles and Mechanisms" chapter. We will explore how to calculate these components from a simple confusion matrix and see why they offer a more complete picture of error than traditional metrics. Subsequently, the "Applications and Interdisciplinary Connections" chapter will demonstrate how this lens provides actionable insights across fields like urban planning, environmental science, and even machine learning, transforming how we diagnose and improve our models of the world.

Principles and Mechanisms

When we compare two maps of the world, whether they are satellite images of land cover taken a decade apart or a new climate model's output versus reality, our first instinct is to ask a simple question: "How well do they agree?" The most straightforward answer is to lay one map on top of the other and count up all the places where the labels match. If we're looking at a grid of pixels, we sum up all the pixels that are correctly classified and divide by the total number of pixels. This gives us the Overall Agreement, a single percentage that seems to tell the whole story. Its complement, the proportion of pixels that don't match, is the Total Disagreement.

For a long time, this was the standard way of thinking. But as with so many things in science, the simple, obvious answer often hides a much richer and more beautiful reality. Is all disagreement created equal? Let’s play a game.

Beyond a Simple Match: Peeling the Onion of Disagreement

Imagine we have two geographers, Alice and Bob, who have each created a land cover map of a fictional, perfectly square island. The island consists of only two types of land: Forest and Desert.

In the first round, Alice’s map shows that $50\%$ of the island is Forest and $50\%$ is Desert. Bob’s map, however, claims that $60\%$ is Forest and $40\%$ is Desert. Before we even look at where they’ve placed their forests and deserts, we know something fundamental: their maps must disagree. At the very least, $10\%$ of the island that Bob calls Forest, Alice must call something else (Desert). And $10\%$ of the island that Alice calls Desert, Bob must call Forest. This mismatch is unavoidable, baked into the very totals of their categories. It is a disagreement in quantity.

Now for the second round. Alice and Bob go back and revise their maps. This time, they both agree perfectly on the totals: $50\%$ Forest and $50\%$ Desert. A triumph for agreement? Not so fast. When we look at their maps, we see that Alice has painted the entire northern half of the island as Forest and the southern half as Desert. Bob has done the exact opposite. Every single pixel on their maps disagrees! Yet, their quantities for each category are identical. This is a purely spatial disagreement. The pixels are simply in the wrong place. This is a disagreement in allocation.

These two simple scenarios reveal a profound truth: the single number for Total Disagreement is an onion with at least two layers. To truly understand why two maps differ, we need to peel them apart and look at the error due to mismatched quantities and the error due to mismatched locations.

A Tale of Two Errors: Quantity versus Allocation

This intuitive idea can be made precise, and this is where the real beauty lies. When we compare two maps—let's call one the "reference" map and the other the "comparison" map—we summarize their relationship in a confusion matrix (or contingency table). It's a simple grid that tells us, for example, how many pixels that were Forest in the reference map were classified as Urban in the comparison map.

Let's use a real example. Imagine we are comparing two land cover maps with three categories. We tally up the pixels and get a confusion matrix like this one, where rows are the comparison map and columns are the reference map:

\begin{pmatrix} 410 & 50 & 40 \\ 60 & 320 & 30 \\ 20 & 35 & 265 \end{pmatrix}

The numbers on the main diagonal ( $410$ , $320$ , $265$ ) represent agreement—pixels that were classified the same way on both maps. Everything off the diagonal represents disagreement. The total number of pixels is $N=1230$ . The total agreement is $(410+320+265) / 1230 = 995/1230$ . Therefore, the total disagreement is $1 - 995/1230 = 235/1230$ . Our goal is to split this total disagreement of $235$ pixels into its quantity and allocation components.

Quantity Disagreement ( $Q$ ) is the error that arises purely from the mismatch in the total number of pixels assigned to each category. We find these totals by summing the rows (for the comparison map) and the columns (for the reference map).

Comparison Map Totals (row sums): $(500, 410, 320)$
Reference Map Totals (column sums): $(490, 405, 335)$

For the first category, the comparison map has $500$ pixels, but the reference map only has $490$ . There's a surplus of $10$ . For the second, the comparison map has $410$ while the reference has $405$ , a surplus of $5$ . For the third, the comparison has $320$ while the reference has $335$ , a deficit of $15$ . Notice that the total surplus ( $10+5=15$ ) perfectly matches the total deficit ( $15$ ). This has to be true, as the total number of pixels is the same.

The total number of pixels that must be in disagreement due to these imbalances is half the sum of the absolute differences:

\text{Quantity Disagreement (pixels)} = \frac{1}{2} \left( |500 - 490| + |410 - 405| + |320 - 335| \right) = \frac{1}{2} (10 + 5 + 15) = 15

The factor of $\frac{1}{2}$ is crucial. Each mismatched pixel contributes to a deficit in one category and a surplus in another; summing the absolute differences without the $\frac{1}{2}$ would count every single quantity error twice. As a proportion of the total, the Quantity Disagreement is $Q = 15/1230$ .

Allocation Disagreement ( $A$ ) is the rest of the error. It's the disagreement that happens because pixels are in the wrong place, even after we've accounted for the inevitable disagreement from quantity imbalances. We can think of it this way: for the first category, the comparison map has $500$ pixels and the reference has $490$ . The maximum number of pixels that could possibly agree for this category is therefore $\min(500, 490) = 490$ . If we do this for all categories, the maximum possible agreement across the whole map, given the quantities, is $\min(500, 490) + \min(410, 405) + \min(320, 335) = 490 + 405 + 320 = 1215$ pixels.

However, the actual number of pixels that agree is only $995$ . The shortfall, $1215 - 995 = 220$ pixels, represents the pixels that could have agreed based on the numbers, but didn't because they were spatially misplaced. This is the allocation disagreement. As a proportion, the Allocation Disagreement is $A = 220/1230$ .

The Perfect Partition: A Deeper Look at the Numbers

Now for the magic. Let's add our two components of disagreement:

Q + A = \frac{15}{1230} + \frac{220}{1230} = \frac{235}{1230}

This is exactly equal to the Total Disagreement we calculated at the beginning! This is not a coincidence. It is a mathematical certainty that for any confusion matrix, the Total Disagreement is perfectly and completely partitioned into Quantity Disagreement and Allocation Disagreement. There are no gaps and no overlaps.

\text{Total Disagreement} = Q + A

This simple, elegant equation provides a far more powerful lens for understanding error than a single, monolithic number. It gives us a complete accounting of the nature of the disagreement.

From Numbers to Knowledge: What the Errors Tell Us

This decomposition isn't just a neat mathematical trick; it's a powerful diagnostic tool. Imagine you are a scientist modeling land use change in a coastal watershed. You compare your model's prediction for the year 2020 against a satellite-derived reference map for the same year. You find a total disagreement of $24\%$ . Is the model good or bad? Where do you even begin to improve it?

By calculating $Q$ and $A$ , you get a much clearer picture. Let's say you find that $Q = 0.10$ and $A = 0.14$ . This tells you that of the total $24\%$ disagreement, a larger portion ( $14\%$ ) is due to allocation error than to quantity error ( $10\%$ ). In practical terms, this means the more significant problem with your model is not how much land it thinks has changed from, say, Forest to Urban, but where it is placing that new urban development. The model is creating spatial swaps: it might correctly predict a loss of Forest in the west and a gain of Urban in the east, but it misplaces them, putting the Urban in the west and leaving Forest in the east.

This insight provides a clear path for improvement. Since allocation error is dominant, you should prioritize improving the model's spatial features—perhaps using higher-resolution elevation data or incorporating road networks to better constrain where development can occur. The quantity error, while smaller, is still significant, indicating a secondary need to calibrate your model's overall tendency to, for example, overpredict Forest and underpredict Agriculture. By separating the errors, you can devise targeted, efficient strategies for making your model better.

The Illusion of Agreement: Why Old Metrics Can Fail Us

For decades, a popular metric for assessing agreement has been Cohen's Kappa ( $\kappa$ ). Kappa was designed to improve upon Overall Agreement by attempting to correct for the agreement that would happen just by random chance. A high Kappa was thought to signify true, non-random agreement.

However, Kappa, like Overall Agreement, is a single number that bundles all sources of error together, and this can be dangerously misleading. Consider two hypothetical scenarios where we compare a classified map to a reference map:

Scenario X: The maps have perfectly balanced quantities for all classes. The Overall Agreement is a high $0.85$ , and Kappa is a "substantial" $0.775$ .
Scenario Y: The maps have a significant mismatch in quantities for two of the three classes. Yet, the Overall Agreement is also $0.85$ , and the Kappa is an almost identical $0.766$ .

Through the lens of Overall Agreement and Kappa, these two scenarios are virtually indistinguishable. An analyst would conclude that the maps have the same high level of agreement. But when we apply our new tools, a dramatically different story emerges.

In Scenario X, because the quantities are perfectly matched, the Quantity Disagreement $Q$ is exactly $0$ . All $15\%$ of the total disagreement is due to Allocation Disagreement ( $A=0.15$ ). The error is purely spatial swapping.
In Scenario Y, the quantity mismatch results in a Quantity Disagreement of $Q=0.10$ . This means that two-thirds of the total disagreement is from quantity, and only one-third is from allocation ( $A=0.05$ ).

Kappa was blind to this fundamental difference. It packaged two completely different error profiles into the same numerical score. This is like a doctor telling two patients they have the same "fever score" when one has a bacterial infection and the other has a broken leg. The single number masks the underlying cause and gives no hint as to the proper treatment. The Q/A decomposition, by contrast, reveals the true nature of the error, providing the deeper diagnosis that Kappa cannot.

The Eye of the Beholder: How Our Categories Shape the Truth

The Q/A framework reveals one final, subtle truth: our definition of "agreement" depends entirely on our choice of categories. What happens if we decide that, for our purposes, "Shrubland" and "Grassland" are functionally similar, and we merge them into a single "Non-woody" class?

When we aggregate classes, something interesting happens. Any pixel that was previously considered an error because it was called Shrubland on one map and Grassland on the other is now considered an agreement—both fall into the new "Non-woody" category. Consequently, the Overall Agreement always goes up (or stays the same), and the Total Disagreement goes down.

Our framework allows us to see exactly where this "disappearing" disagreement went. When classes are merged, the quantity disagreement often changes very little. The major change is a reduction in allocation disagreement. The confusion between the now-merged classes was a form of spatial swapping—an allocation error. By changing our definitions, we have simply chosen to no longer see it as an error.

This is not a flaw, but a feature. It shows us that the allocation component of disagreement is intimately tied to the thematic detail of our classification scheme. It reminds us that there is no single, objective "truth" about map agreement; there is only agreement as defined by the categories we choose to see. By providing a clear and complete accounting of how and why maps differ, the decomposition into quantity and allocation disagreement offers a more honest, insightful, and ultimately more useful way to understand our world.

Applications and Interdisciplinary Connections

In our previous discussion, we dissected the nature of disagreement, separating it into two fundamental components: a mismatch in the quantity of categories and a mismatch in their spatial allocation. On the surface, this might seem like a niche accounting trick for statisticians. But the real magic of a powerful idea is not in its complexity, but in its ability to clarify, to connect, and to reveal hidden truths across a surprising range of endeavors. Moving beyond a simple score of "percent correct" to asking how we are incorrect—in amount or in location—is like a physician moving from simply taking a patient's temperature to using a stethoscope to listen to the heart and lungs. Both tell you if something is wrong, but only the latter begins to tell you why.

Let's embark on a journey to see how this simple concept of quantity disagreement provides a sharper lens through which to view our world, from the sprawling growth of cities to the invisible logic of artificial intelligence.

The Geographer's Dilemma: Modeling a Changing World

Imagine you are a city planner, tasked with managing the inevitable growth of a metropolis. You have two different computer models, each attempting to predict which parcels of land will be developed in the next decade. After ten years, you compare the predictions to what actually happened. You find that both Model A and Model B achieved an overall accuracy of, say, $0.90$ . They both correctly predicted the fate of $90\%$ of the land in the region. Are they equally good?

A traditional assessment might stop there. But we can do better. Let’s look closer. Model A, it turns out, correctly predicted the total amount of new development—the total number of newly urbanized acres—almost perfectly. Its quantity disagreement was near zero. However, it placed that development in all the wrong locations. It scattered new suburbs randomly across pristine forests, while in reality, the growth was concentrated along a new transit corridor. Its allocation disagreement was enormous.

Model B, in contrast, got the total amount of new growth wrong; it predicted far too much development. Its quantity disagreement was high. Yet, it correctly identified the transit corridor as the hotbed of activity. Its pattern of prediction was much better; its allocation disagreement was low.

Suddenly, the two models don't look equally good at all. They are failing for completely different reasons. Model A has a good "economic" component, correctly gauging the demand for new housing, but its "spatial suitability" rules are nonsense. Model B has a better grasp of the spatial logic of urban growth (suitability) but has a flawed understanding of economic demand. The decomposition of error into quantity and allocation disagreement gives the modeler a specific, actionable diagnostic. It tells them which part of their model's engine needs fixing. This isn't just about grading the model; it's about improving it.

This principle extends to nearly every corner of environmental science and geography. Whether we are modeling the spread of a forest fire, the retreat of a glacier, the conversion of rainforest to agricultural land, or the erosion of a coastline, we must always ask the two fundamental questions: Is our model getting the rate of change right? And is it getting the location of change right? Quantity and allocation disagreement provide the precise language to answer this.

A Deeper Look: The Anatomy of Error

This framework does more than just evaluate a final map; it provides a powerful tool for diagnosing a model's health throughout its development. Consider a common pitfall in all of modeling and machine learning: overfitting. This happens when a model, instead of learning the general rules of a process, simply "memorizes" the specific data it was trained on.

Let's say we build a land-use model using data from the 1980s. We tweak and tune it until its performance on the 1980s data is nearly perfect. Its quantity disagreement is tiny, and its allocation disagreement is tiny. We are very proud. Then, we use this "perfect" model to predict changes in the 1990s and compare it to the real 1990s map. The performance collapses. The overall accuracy plummets. But why?

By looking at the disagreement components, we might find that the quantity disagreement for the 1990s is now huge, and so is the allocation disagreement. This tells us something profound. Our model didn't learn the general principles of land change; it specifically memorized the rate of change from the 1980s (leading to high quantity disagreement in the 90s) and the unique spatial patterns of the 1980s (leading to high allocation disagreement in the 90s). The decomposition of error acts as a clear signal of overfitting, revealing exactly what aspects of the model failed to generalize.

This separation of error isn't just a convenient trick; it seems to be a fundamental property of comparing categorical patterns. In the world of remote sensing, scientists often use a metric called the "Figure of Merit" (FoM), which is identical to the Jaccard Index from statistics. It measures the accuracy of predicting change by dividing the correctly predicted changes (the intersection of prediction and reality) by the total area of predicted or observed change (their union). It turns out that the total error, $1 - \text{FoM}$ , can be mathematically decomposed perfectly into a term representing quantity disagreement and a term representing allocation disagreement. This shows that these two error types are not just ad-hoc inventions; they are the natural, built-in components of disagreement when we compare patterns.

A Philosophical Debate Between Models

The power of this framework becomes even more apparent when we use it to compare not just different parameterizations of one model, but entirely different types of models, each representing a different philosophy of how the world works.

Imagine a debate among three scientists trying to model the evolution of a landscape.

The first is a proponent of Agent-Based Models (ABM). She argues that large-scale patterns emerge from the bottom-up decisions of countless individual "agents" (people, households, companies) interacting locally.
The second champion's Cellular Automata (CA). He believes that change is governed by simple, fixed rules that a cell follows based on the state of its immediate neighbors.
The third is an econometrician who uses a CLUE-S type model. She insists that change is a top-down process, driven by aggregate economic demands that are allocated across the landscape based on a map of land suitability.

How can we possibly stage a fair comparison between such different worldviews? Quantity and allocation disagreement provide a common language. We can run all three models and evaluate them. We might find that the ABM produces incredibly realistic, clustered patterns of growth (low allocation disagreement) but struggles to match the overall quantity of change (moderate quantity disagreement). The CLUE-S model, by design, might perfectly match the overall quantity of change (zero quantity disagreement) but spread it across the landscape in an unnatural, dispersed way (high allocation disagreement). The CA model might fall somewhere in between.

The result is not a single winner, but a richer understanding. The evaluation tells us that the ABM philosophy is good at capturing spatial processes, while the econometric approach is best at capturing total demand. Perhaps the future lies in a hybrid model that combines the strengths of both. The QD/AD framework allows us to judge these different scientific paradigms on their own terms and see where each one shines or falters.

Echoes in Other Fields: The Universal Challenge of Imbalance

The fundamental problem that quantity disagreement helps to solve—the danger of being misled by an overall score when the underlying components are imbalanced—is not unique to geography. It echoes throughout science and technology.

Consider the field of machine learning and artificial intelligence. An AI is being trained to diagnose a rare disease from medical images. The dataset contains $990$ healthy patients and $10$ sick patients. A lazy AI that simply learns to always say "healthy" will achieve $99\%$ accuracy! It is nearly always right, but it is completely useless, as its sole purpose is to find the $1\%$ of patients who need help.

Machine learning experts have their own language for this problem. They compare metrics like the micro-average F1 score and the macro-average F1 score. The micro-F1 score behaves just like overall accuracy; it gives a high score to our useless "always healthy" classifier because it gives equal weight to every patient. The macro-F1 score, however, calculates the metric for each class ("healthy" and "sick") separately and then takes a simple average. In this case, the macro-F1 would be terrible, because the performance on the "sick" class is zero. It gives equal weight to every class, regardless of how rare it is.

This tension between micro and macro averaging is a perfect analogy for the insights provided by quantity disagreement. An overall accuracy score (like the micro-average) looks at the whole pile of pixels or instances. A decomposition into quantity and allocation disagreement (conceptually similar to a macro-average approach) forces us to look at the performance on each category, giving a voice to the rare but often most important ones.

A More Honest Way of Looking

From city planning to climate science to artificial intelligence, the same story unfolds. A single number, while simple and seductive, often hides more than it reveals. The concept of decomposing error, with quantity disagreement as a key component, is more than just a statistical tool. It is a philosophy. It encourages a more honest and critical engagement with our models and our data. It pushes us to move beyond asking "Is our prediction right?" and to ask the more insightful questions: "In what ways is it wrong? Is it wrong in amount or in pattern? And what does that tell us about the process we are trying to understand?" In the pursuit of those answers, true scientific discovery begins.