Exclusive Cache

SciencePedia

Key Takeaways

Exclusive caches maximize effective capacity by ensuring no data duplication between cache levels, holding the sum of all individual cache sizes.
The capacity advantage of exclusive caches is offset by swap overhead, where moving data between cache levels adds latency to what would be a simple L2 hit.
In multicore systems, inclusive caches simplify coherence with a central directory, while exclusive caches introduce significant complexity and directory storage overhead.
The cache inclusion policy impacts system-wide behavior, influencing hardware prefetching, transactional memory, fairness, and security against side-channel attacks.

Introduction

In modern computing, the speed of a processor is fundamentally tied to its memory system, where multi-level cache hierarchies act as the critical bridge between the CPU and slow main memory. Efficiently managing data across these levels—from the small, fast L1 cache to the larger Last-Level Cache (LLC)—is paramount for performance. This gives rise to a fundamental design question: should the data in a lower-level cache also exist in the higher-level cache? This choice between an 'inclusive' and an 'exclusive' policy is not merely a technical detail but a deep architectural trade-off with far-reaching consequences. This article dissects this crucial choice, exploring the delicate balance between capacity, complexity, and speed. The first chapter, "Principles and Mechanisms," will break down the core mechanics of each policy, analyzing their impact on effective capacity, performance, and coherence in multicore systems. Subsequently, "Applications and Interdisciplinary Connections" will reveal how this single design decision ripples outward, influencing everything from software synchronization and hardware prefetching to system security and algorithm design.

Principles and Mechanisms

In the grand theater of a computer's processor, data plays the leading role, and the caches are its many dressing rooms. An access to main memory is like a long, slow trip back to the trailer, while an access to the cache is a quick change just off-stage. The whole performance of the system hinges on keeping the right data in the right cache at the right time. But this raises a wonderfully simple and profound question: if we have multiple levels of caches—say, a small, lightning-fast Level 1 (L1) cache and a larger, slightly slower Level 2 (L2) cache—how should they organize their contents? Should the smaller cache's contents be a mirror of a small part of the larger one, or should they be completely different?

This is not just a question of tidiness; it's a fundamental choice between two opposing philosophies of data management: inclusion and exclusion. Understanding this choice peels back a crucial layer of computer architecture, revealing a beautiful tapestry of trade-offs between space, speed, and complexity.

The Gift of Space: Maximizing Capacity

Let's imagine you are a carpenter in a workshop. You have a small, tidy workbench right in front of you (the L1 cache) and a large tool chest a few steps away (the L2 cache).

The inclusive philosophy dictates that any tool on your workbench must also have a reserved spot in your tool chest. If you have a hammer on the bench, you have a hammer-shaped cutout in the chest that is also, in a sense, occupied. This is wonderfully organized. To find any tool you own, you only need to look in the big tool chest; if it's not there, you don't have it. The downside? The space in your tool chest taken up by duplicates of the tools already on your workbench is unusable for storing other, different tools. The total number of unique tools you can have ready is limited by the size of the tool chest.

The exclusive philosophy is thriftier with space. Your workbench and your tool chest hold entirely different sets of tools. If you want a screwdriver from the chest, you must make room on your bench by putting, say, your hammer back into the chest. You are constantly swapping. The great advantage is that no space is wasted on duplicates. The total number of unique tools you have at your disposal is the sum of the workbench's capacity and the tool chest's capacity.

Translating this back to caches, an inclusive hierarchy insists that the set of data blocks in L1 ( $S_1$ ) must be a subset of the blocks in L2 ( $S_2$ ), written as $S_1 \subseteq S_2$ . An exclusive hierarchy, in its ideal form, requires that their contents be disjoint, $S_1 \cap S_2 = \emptyset$ . This leads to a stark difference in effective capacity. For an inclusive system with an L1 cache of size $C_1$ and an L2 of size $C_2$ , the total unique data it can hold is just $C_2$ . For an exclusive system, the effective capacity is $C_1 + C_2$ .

This might seem like an abstract difference, but its performance impact can be staggering. Consider a program that needs to work with a set of data slightly larger than the L2 cache. Let's say our L1 can hold 512 blocks, our L2 can hold 4096 blocks, and our program's working set is 4300 blocks.

In an inclusive system, the total capacity is 4096 blocks. The 4300-block working set simply does not fit. As the program runs, it will constantly be evicting data that it needs just a moment later. Nearly every access will miss in both L1 and L2, forcing a slow fetch from main memory. The performance is disastrous.
In an exclusive system, the total capacity is $512 + 4096 = 4608$ blocks. The 4300-block working set fits comfortably! After an initial warm-up period, every single access will be a hit somewhere within the L1-L2 hierarchy. The difference in speed is not just a few percent; it's the difference between a system that is gracefully executing and one that is perpetually stalled.

This capacity advantage becomes even more critical for mixed workloads. Imagine a program working on a small "hot" set of data that lives permanently in L1, while also streaming through a large "cold" set of data. In an inclusive L2, a portion of its precious capacity is forever occupied by a useless copy of the hot set. This reduced available space might be too small to hold the streaming cold set, causing constant L2 misses. The exclusive L2, however, dedicates its entire capacity to the cold set, potentially turning a thrashing nightmare into a smooth ride.

The Price of a Swap: Performance Trade-offs

So, with such a clear capacity advantage, why isn't every cache exclusive? Because, as the saying goes, "there ain't no such thing as a free lunch." The elegance of the exclusive design comes with its own set of costs.

The most direct cost is the swap. In an inclusive cache, an L1 miss that hits in L2 is simple: the data is copied from L2 to L1. In an exclusive cache, the same event triggers a more complex dance. The requested block is moved from L2 to L1, and to make space, a "victim" block must be evicted from L1 and moved down into L2. This swap operation takes more time and adds overhead to what would otherwise be a fast L2 hit. An L2 hit in an exclusive cache can actually be slower than an L2 hit in an inclusive one.

This creates a fascinating performance tension. The exclusive cache is better at avoiding the catastrophe of a main memory access, but it pays a small tax on some of its successful cache hits. The overall benefit depends on which effect dominates. We can capture this trade-off in a beautiful mathematical expression for the performance gain ( $g$ ) of an exclusive cache over an inclusive one:

$g = M_{L1} \left( (h_E - h_I)t_M - h_E t_X \right)$

Let's break this down. $M_{L1}$ is the L1 miss rate. The gain only matters when L1 misses, so everything is scaled by that. Inside the parentheses, we see the battle. The term $(h_E - h_I)t_M$ is the benefit: $h_E$ is the L2 hit rate for the exclusive cache and $h_I$ for the inclusive. Since the exclusive cache has more capacity, $h_E$ is often higher than $h_I$ . This difference, $(h_E - h_I)$ , represents the fraction of memory accesses that the exclusive design saves from going to main memory, a time savings of $t_M$ for each. This is the big prize. The term $-h_E t_X$ is the penalty: on all $h_E$ of the exclusive L2 hits, we pay the swap overhead tax, $t_X$ . Whether $g$ is positive (a net gain for exclusive) depends on whether the enormous savings from avoiding memory trips outweighs the cumulative small costs of swapping.

We can generalize this even further. Imagine a system with an application cache and an OS page cache, where we can choose an inclusive or exclusive policy. The exclusive policy is better only if its swap overhead, $\Delta$ , is below a certain threshold. What is this threshold? It can be shown that the maximum tolerable overhead, $\Delta^*$ , is given by a remarkably intuitive formula:

$\Delta^* = \frac{\text{Benefit of extra capacity}}{\text{Frequency of paying the swap cost}} \times (\text{Cost of a disk access})$

More formally, using a stack-distance model where $F(d)$ is the probability of an access having a reuse distance less than or equal to $d$ :

$\Delta^* = \frac{F(S_{A} + S_{P}) - F(S_{P})}{F(S_{A} + S_{P}) - F(S_{A})} t_{3}$

The numerator, $F(S_{A} + S_{P}) - F(S_{P})$ , is the probability that an access hits precisely in the "extra" capacity provided by the exclusive design, thus saving a slow disk access of time $t_3$ . This is the total benefit. The denominator, $F(S_{A} + S_{P}) - F(S_{A})$ , is the probability of hitting in the second-level cache, which is exactly when the swap cost is paid. The formula beautifully expresses that the acceptable cost per swap depends on how many catastrophic misses you avoid for every swap you perform.

The Ripple Effect: Inter-Level Dependencies

The differences run deeper than just capacity and swap times. The choice of policy fundamentally alters the relationship between the cache levels.

In an exclusive hierarchy, the L1 cache is its own master. Its hit and miss rates are determined by its own size and the program's behavior. Making the L2 cache bigger or smaller has no direct effect on whether a given access hits or misses in L1. The levels are wonderfully decoupled.

An inclusive hierarchy creates a tight, and sometimes troublesome, coupling. The inclusion rule ( $S_1 \subseteq S_2$ ) has a powerful corollary: if a block is evicted from L2, it must also be removed from L1 to maintain the property. This is called a back-invalidation. Imagine the L2 cache controller, needing to make space, evicting a block. It then sends a command up to L1: "Get rid of block X." This can happen even if block X was the most recently used item in L1!

This interference from below effectively reduces the L1's ability to manage its own content, shrinking its effective capacity. The fascinating consequence is that in an inclusive system, increasing the size of the L2 cache can decrease the L1 miss rate. A larger L2 evicts blocks less often, which means it sends fewer disruptive back-invalidation commands, allowing L1 to do its job more effectively. This subtle ripple effect is a direct consequence of the inclusion contract.

The Challenge of Many: Coherence and Complexity

The plot thickens dramatically when we move from a single processor to a modern chip with many cores, all sharing a last-level cache (LLC). Now, the question of "where is the data?" becomes a system-wide problem of coherence.

In an inclusive system, the shared LLC acts as a central clearinghouse. If Core 3 wants to read a piece of data, it checks the LLC. The LLC's directory not only knows if the data is present but also which other cores might have a copy. The search is bounded and orderly. The directory, which keeps track of these sharers, only needs to manage entries for blocks that are physically in the LLC.

In an exclusive system, chaos looms. The data that Core 3 wants could be in the LLC, or it could be hidden away in Core 1's private L1 cache, or Core 7's private L2. There is no single place to look. The directory must now track the location of every single cached block in the entire system, whether it's in the LLC or any of the numerous private caches.

This has a devastating impact on the directory storage overhead. For an inclusive cache, the directory size scales with the size of the LLC. For an exclusive cache, the directory needs to store not just sharer information but also the full address tags for all the blocks living exclusively in the private caches. This extra storage can be substantial. Even worse, the size of the directory for an exclusive system can scale with the square of the number of cores ( $n$ ), while for an inclusive system it scales linearly with $n$ . For a chip with dozens or hundreds of cores, this quadratic scaling cost can become prohibitively expensive.

Furthermore, the inclusive policy, despite its other flaws, has a hidden benefit for multicore systems. When the LLC evicts a block that is shared by several cores, the back-invalidations it sends out enforce a clean, system-wide state. In an exclusive system, this cleanup is not automatic, adding yet another layer of complexity to the coherence protocol.

The choice, then, is a profound one. Exclusivity offers the tantalizing promise of more effective capacity—a simple, powerful idea. But as we scale up and face the complexities of a many-core world, the simplicity and order of the inclusive design, despite its apparent wastefulness, provides a much easier-to-manage foundation for the fiendishly complex problem of cache coherence. The journey from a simple workbench to a bustling factory floor of cores changes everything.

Applications and Interdisciplinary Connections

Now that we have explored the principles of inclusive and exclusive caches, we can begin to appreciate that this design choice is far from a mere technical footnote in a processor’s blueprint. It is a fundamental decision that sends ripples through the entire computing system, influencing everything from raw performance and power consumption to software design, system fairness, and even vulnerability to cyberattacks. Like a choice in the fundamental laws of a toy universe, the decision to enforce the inclusion property—or not—creates a cascade of consequences, revealing the beautiful and intricate interconnectedness of computer architecture. There is no single "correct" choice; there are only trade-offs, and understanding these trade-offs is what makes the art of computer design so fascinating.

The Heart of Performance: Capacity, Traffic, and Interference

Let's start with the most direct and tangible effects. At its core, the choice between inclusivity and exclusivity is a trade-off between orderliness and efficiency.

Imagine a small program whose working set—the data it needs at any moment—is just a little too large to fit into a core's private cache. With an inclusive hierarchy, where the shared Last-Level Cache (LLC) must contain a copy of everything in the private caches, the total number of unique data blocks the hierarchy can hold is limited by the size of the largest cache, the LLC. If our program's working set exceeds the capacity of the LLC for a given conflicted set of addresses, the system begins to "thrash": it constantly evicts data from the LLC only to find it needs it again moments later, leading to a storm of slow memory accesses.

Now, consider an exclusive cache. Here, the LLC acts more like a spill-over area, or a "victim cache," for data evicted from the private caches. The total effective capacity is the sum of the private caches and the LLC. That same program, whose working set was just a bit too big for the inclusive LLC, might now fit perfectly within the combined space of the private cache and the exclusive LLC. By avoiding duplication, the exclusive cache offers a larger effective storage space, turning a thrashing nightmare into a smooth, efficient execution. This effect is not just theoretical; it can be demonstrated with carefully constructed access patterns that would cripple an inclusive cache system but run beautifully on an exclusive one.

But this extra capacity comes at the cost of complexity. The inclusive cache, for all its capacity constraints, has an elegant simplicity: to find a piece of data, you can just check the LLC. If it's not there, it's not anywhere on the chip. This orderliness, however, creates its own performance penalty. To maintain the strict inclusion property, whenever a line is evicted from the LLC, the system must send "back-invalidation" messages to all private caches to ensure no copies are left dangling. These messages are not free; they create additional coherence traffic on the chip's interconnect. While each message may only add a fraction of a cycle to memory access latency, this overhead, when multiplied over billions of operations, can lead to a noticeable degradation in overall performance, reflected in higher Average Memory Access Time (AMAT) and Cycles Per Instruction (CPI).

This tighter coupling in an inclusive system can also lead to more unpredictable performance. In a multi-core processor, one core's memory access patterns can interfere with another's. An aggressive application on one core might cause many evictions in the shared LLC. In an inclusive system, these evictions can trigger a cascade of back-invalidations that "reach into" another core's private cache, invalidating data that core was actively using. This creates a sudden, unexpected performance spike for the victim core, making system performance jittery and harder to predict. An exclusive cache, lacking this back-invalidation mechanism, better isolates the private caches from LLC eviction pressure, leading to more stable, predictable behavior.

A Web of Interactions: When Caches Don't Act Alone

Modern processors are not just a simple hierarchy of caches; they are a complex ecosystem of interacting components. The choice of cache policy has profound implications for how these other components behave.

Consider the hardware prefetcher, a clever assistant that tries to guess which data a program will need next and fetches it into the cache ahead of time. When the prefetcher is right, performance soars. But what if it's wrong? With an inclusive LLC, an over-aggressive or inaccurate prefetcher can become a liability. Imagine a prefetcher on one core flooding a shared LLC set with useless data. This "prefetch pollution" can displace useful data belonging to another core. The eviction of this useful data from the LLC then triggers a back-invalidation, purging it from the victim core's private cache. The result is that one core's well-intentioned but misguided prefetcher has actively harmed another core's performance—an outcome that would not occur in a non-inclusive system where LLC evictions do not automatically invalidate private cache lines.

This theme of unintended interactions extends to one of the most fundamental challenges in computing: synchronization. When multiple cores need to coordinate access to a shared resource, they often use a "spin lock." A naive implementation using a Test-and-Set (TAS) instruction causes each waiting core to repeatedly attempt a write to the lock variable. In a cache-coherent system, each attempt requires gaining exclusive ownership of the cache line, leading to a massive "invalidation storm" as the line is furiously passed from one core to another. Here, the cache policy dictates the path of this storm. With an inclusive LLC, all this frantic traffic must pass through the central LLC, potentially saturating its bandwidth. An exclusive cache, however, allows for more direct core-to-core transfers of the hot cache line, keeping the storm of traffic away from the LLC and preserving its bandwidth for other, more useful work. This subtle difference in traffic path can have a major impact on the scalability of parallel software. The choice of cache policy even influences the design of better locking algorithms like Test-and-Test-and-Set (TTAS), which are specifically designed to minimize this very invalidation traffic.

The delicate nature of modern processor features is perhaps best illustrated by Hardware Transactional Memory (HTM). HTM allows a programmer to execute a block of code as a single atomic "transaction," which either completes fully or is aborted and rolled back. A transaction is a fragile, speculative operation, like a house of cards. It maintains a "read-set" of all memory locations it has touched. If any of these lines are evicted or invalidated before the transaction completes, the transaction aborts. An inclusive LLC introduces a devastating new failure mode. If a transaction reads a large number of items that, by chance, all map to the same set in the shared LLC, the LLC set will overflow. This causes evictions from the LLC, which in turn trigger back-invalidations that kill the read-set lines in the private caches, causing the transaction to abort. An exclusive LLC, by providing the full capacity of the private caches to track the read-set without this LLC-induced conflict, offers much more "breathing room" for transactions to succeed.

Beyond Performance: Fairness, Security, and Algorithms

The consequences of cache design extend even beyond performance, into the realms of fairness, security, and even abstract algorithm design.

In the world of heterogeneous computing, processors often pair powerful "big" cores with efficient "little" cores. These cores frequently share parts of the cache hierarchy. What happens when they share an inclusive cache? The inclusion property dictates that the shared cache must "reserve" space to hold copies of everything in the private L1 caches. Since the big core has a much larger L1 cache than the little core, it implicitly claims a larger, disproportionate share of the shared cache's capacity. This can starve the little core of shared cache resources, unfairly penalizing its performance simply as a side effect of enforcing inclusion. An exclusive cache avoids this problem, naturally creating a more equitable sharing of resources.

Perhaps the most surprising consequence of cache inclusivity lies in the domain of computer security. The very mechanism of back-invalidation, which we have seen as a performance overhead, can be turned into a weapon. In a "Flush+Reload" side-channel attack, an attacker infers a victim's secret-dependent memory accesses by timing their own. The "Flush" step involves evicting a shared cache line. With an inclusive LLC, this is devastatingly effective: evicting the line from the LLC guarantees its removal from the victim's private cache via back-invalidation. The "Reload" step will then be slow if the victim didn't access the line, and fast if they did. Ironically, an exclusive cache, by being "messier" and allowing a line to linger in the victim's private cache even if evicted from the LLC, makes this attack much harder to execute. The clean, orderly nature of the inclusive cache becomes a security vulnerability.

Finally, the influence of cache policy reaches all the way to the design of high-performance numerical algorithms. Consider an algorithm like Tall-Skinny QR (TSQR), used in scientific computing to factorize massive matrices. These algorithms are designed to be "communication-avoiding," carefully managing what data resides in the small, fast levels of the memory hierarchy. The memory footprint of each step is critical. An inclusive cache, with its duplication of data, can have a significantly larger memory footprint for a given operation than an exclusive cache. For an algorithm designer, this means that on a machine with an inclusive cache, you might need a much larger fast memory to achieve the same performance, or you may be forced to use a deeper, slower version of the algorithm. This demonstrates that the architects of supercomputers and the mathematicians designing algorithms for them live in the same world, bound by the same fundamental trade-offs, right down to the choice of cache inclusivity.

From a simple rule—"duplicate or not?"—an entire universe of consequences unfolds. The choice of cache exclusivity is a testament to the profound principle that in any complex, unified system, there are no isolated decisions. Every choice matters, everywhere.