No-Write-Allocate Cache Policy

SciencePedia

Key Takeaways

Write-allocate brings data into the cache on a write miss, betting on future reuse, while no-write-allocate bypasses the cache to minimize initial memory traffic.
No-write-allocate is highly effective for streaming workloads (e.g., video encoding) and sparse writes, as it can cut memory traffic in half by avoiding unnecessary data reads.
Write-allocate is superior for data with high temporal locality, where the initial cost of fetching the cache line is justified by subsequent fast cache hits.
Modern processors use adaptive strategies, such as Write Combining memory types for I/O, to dynamically apply the most efficient policy for a given task.

Introduction

In the relentless pursuit of computational speed, the memory cache stands as a critical pillar of modern processor design. It acts as a high-speed buffer, holding frequently used data close to the CPU to avoid the long journey to main memory. While the strategy for handling read misses is straightforward—fetch the data—a more subtle and consequential question arises for write misses: when the CPU needs to write to a memory location not currently in the cache, what is the most efficient course of action? This fundamental design choice gives rise to two distinct philosophies: the write-allocate policy, which brings data into the cache before writing, and the no-write-allocate policy, which bypasses the cache entirely. This decision creates a fascinating trade-off between upfront costs and future benefits, with profound impacts on system performance and memory bandwidth.

This article explores the deep logic behind this critical choice. In the first chapter, Principles and Mechanisms, we will dissect the step-by-step operation of both policies, quantifying their performance trade-offs for different workloads like streaming and sparse writes. We will then transition to the real world in Applications and Interdisciplinary Connections, examining how this decision shapes the efficiency of video encoding, device communication, and multicore systems, revealing how modern processors artfully combine both strategies for optimal performance.

Principles and Mechanisms

Imagine you are a painter, and your cache is your palette. Most of your commonly used colors are already on the palette, ready for a quick dab of the brush—this is a cache hit, fast and efficient. But what happens when you need a color that isn't on your palette, say, a specific shade of cerulean blue? This is a cache miss. You have a choice. Do you go to your paint tubes (main memory), squeeze out a generous amount of blue onto an empty spot on your palette, and then apply it to the canvas? Or do you just dip your brush directly into the tube for this one stroke and leave your palette as it was?

This simple choice is the heart of a deep and important design decision in computer architecture: the choice between a write-allocate and a no-write-allocate policy. It's a classic tale of trade-offs, a story about when it pays to be prepared versus when it's better to be expedient. Let's explore the beautiful logic behind this choice.

The Standard Approach: Prepare the Palette with Write-Allocate

The most intuitive strategy, and the one often taught first, is write-allocate. The philosophy is simple: if the processor needs to work with a piece of data, that data should be brought into the cache first, regardless of whether the processor wants to read from it or write to it. It keeps the model of the cache consistent and simple.

Let's trace the journey of a single STORE instruction that misses the cache. Suppose your CPU wants to write 8 bytes of data to a memory address, but the corresponding 64-byte "container" for that data—the cache line—is not currently in the Level 1 (L1) cache. Here is the sequence of events under a write-allocate policy:

Miss and Make Space: The CPU discovers the line is not in the L1 cache. If the cache set where the new line should go is already full, a "victim" line must be chosen for eviction. If this victim line contains modified data (i.e., it's "dirty"), it can't just be discarded. Its contents must first be written back to the next level of memory (like an L2 cache or main memory) to ensure no data is lost. This is a write-back.
Claim Ownership: The L1 cache then sends a special request down the memory hierarchy. This isn't just a simple read request; it's a Read-For-Ownership (RFO). An RFO is like saying, "Not only do I need a copy of this 64-byte line, but I also intend to modify it. Please give me an exclusive copy and invalidate any other copies that might exist elsewhere." This is a crucial step in ensuring that different parts of the system don't end up with conflicting versions of the same data.
Fill the Cache: The L2 cache or main memory responds by sending the entire 64-byte line up to the L1 cache.
Perform the Write: Now that the data's container is finally on the palette, the CPU performs its 8-byte write. The write is now a cache hit. The line is updated and marked as dirty, signifying that the L1 cache holds a newer version of this data than what's in main memory.

This process is robust and logical. It ensures that the cache is always the primary workspace. But look at all the steps involved—eviction, write-back, a full 64-byte read, and then the write. Is all this work always necessary?

The Minimalist's Choice: No-Write-Allocate

What if the CPU is simply writing a large file to disk, or clearing a large block of memory to zeros? It's writing data, but it has no immediate plans to read that data back. In this situation, bringing the entire old cache line into the L1 cache just to overwrite it seems... wasteful.

This is the insight behind the no-write-allocate policy, sometimes called write-around. The philosophy here is to treat write misses as a special case. Instead of preparing the palette, you just dip the brush in the tube.

Let's revisit our STORE miss scenario with this new policy:

Miss and Bypass: The CPU discovers the line is not in the L1 cache.
Forward the Write: Instead of initiating an RFO, the L1 cache controller simply "steps aside" and forwards the 8-byte write request directly to the next level in the memory hierarchy.

That's it. The L1 cache's contents remain completely unchanged. No eviction, no RFO, no 64-byte line fill. The elegance is in its minimalism. It avoids polluting the cache with data that might not be needed again soon, leaving precious cache space available for more important data.

Quantifying the Trade-off: When is Minimalism Better?

So, which policy is superior? This is where the true beauty of computer architecture reveals itself. There is no single "best" answer; the right choice depends entirely on the pattern of memory accesses. It's a fascinating game of prediction and optimization.

The Streaming Workload: A Clear Win for "No-Write-Allocate"

Consider the case of a streaming store, where the processor writes to a large, contiguous block of memory from beginning to end—think saving a video file or running a scientific simulation that outputs a massive array.

Let's analyze the memory traffic for writing a total of $S$ bytes of data.

With write-allocate, for every single cache line within that block, the processor first issues an RFO to read the old, soon-to-be-overwritten line from memory. After modifying it, the dirty line is eventually written back. This means for $S$ bytes of useful data, we incur $S$ bytes of read traffic and $S$ bytes of write traffic, for a total of  $2S$ bytes moved across the memory bus.
With no-write-allocate, the processor simply sends the $S$ bytes of new data to memory. There are no reads. The total traffic is just  $S$ bytes.

The conclusion is striking: for this extremely common workload, write-allocate generates twice the memory traffic! In a system where memory bandwidth is the bottleneck, this can translate directly into a 2x performance improvement for the no-write-allocate policy.

The Reused Data: Revenge of "Write-Allocate"

But the story isn't so simple. What if a program writes to a memory location, and then, a short time later, writes to that same location again? This pattern, known as temporal locality, is also very common.

With write-allocate, the first write is expensive. It pays the full "entry fee" of an RFO to bring the line into the cache. But the second, third, and fourth writes to that same line are now lightning-fast cache hits, generating zero traffic to main memory.
With no-write-allocate, the first write is cheap—just a small write sent to memory. But the second, third, and fourth writes are also misses that must be sent to memory. It avoids the entry fee but pays a "toll" on every single access.

There is a clear trade-off: write-allocate pays a high up-front cost for the potential of cheap future accesses, while no-write-allocate minimizes the cost of the first access at the expense of all subsequent ones. It's possible to model this and find a critical "reuse probability" where one policy becomes better than the other. If the probability of writing to the same line again soon is high enough, the initial investment of the write-allocate policy pays off handsomely.

The Sparse Write: The Cost of a Cannon to Kill a Fly

Another scenario where no-write-allocate shines is with sparse writes, where a program modifies just a few bytes here and there across a very large memory region.

Under write-allocate, writing even a single byte requires an RFO that fetches the entire 64-byte line. This is like ordering an entire pizza just to eat one pepperoni—the vast majority of the data read from memory is "wasted bandwidth" because it was fetched only to be immediately overwritten or ignored. No-write-allocate, by contrast, elegantly handles this by sending only the specific bytes that were actually modified, generating far less traffic.

Modern processors are masterpieces of engineering, and they employ several clever tricks to get the best of both worlds.

Write-Combining: Processors using no-write-allocate often employ write-combining buffers. If the CPU issues several small writes to the same cache line in quick succession, instead of sending each one to memory individually, this buffer collects them. Once it has a full cache line's worth of data (or after a short timeout), it sends all the data to memory in a single, efficient burst transaction. This is like waiting for your Amazon cart to be full before checking out, saving on "shipping costs" (bus overhead).
Full-Line Write Optimization: Even the write-allocate policy can be made smarter. If the processor is writing to an entire 64-byte cache line at once, the hardware knows there is no "old" data to preserve. In this special case, it can skip the read part of the RFO, effectively just allocating a line and filling it with the new data. This eliminates the wasteful read traffic for full-line stores, making write-allocate much more competitive for certain streaming workloads.
The Adaptive Choice: The most sophisticated processors don't have to be dogmatically committed to one policy. They can be adaptive. On each write miss, the processor can make an intelligent choice based on the circumstances. It can analyze the situation: "What is the expected cost of each policy from this point forward?" The cost of write-allocate is the RFO read plus the eventual write-back. The cost of no-write-allocate is the immediate write-through plus the cost of any future accesses that will now miss the cache. By using performance counters, instruction types, or even hints from the software, the processor can dynamically choose the policy that is expected to yield lower latency or lower memory traffic for the specific code it's running at that moment.

This dynamic decision-making is the pinnacle of the design process. It acknowledges that there is no universal truth, only a set of trade-offs. The ultimate performance comes not from picking one rule, but from building a system that understands the rules of the game so well that it can choose the best strategy for any situation it encounters.

Applications and Interdisciplinary Connections

We have journeyed through the inner workings of cache write policies, dissecting the logical gears of [write-allocate](/sciencepedia/feynman/keyword/write_allocate) and no-[write-allocate](/sciencepedia/feynman/keyword/write_allocate). But to truly appreciate the genius of these mechanisms, we must leave the abstract realm of diagrams and state machines and see them at work in the real world. Why would a chip designer—or a programmer—care about such a seemingly minute detail? The answer, as we will see, is that this choice has profound consequences, shaping everything from the smoothness of your video streaming to the efficiency of massive supercomputers. It is a beautiful illustration of how a simple, local decision can have far-reaching, global effects.

The Art of Sending a Message: Streaming Data

Imagine you are a video encoder, and your job is to create a massive video file, frame by frame. You are writing out a long, continuous stream of data. Once a piece of the frame is written, you have no intention of reading it back; your job is to send it on its way to its final destination in memory.

Now, consider what happens if your cache employs a [write-allocate](/sciencepedia/feynman/keyword/write_allocate) (WA) policy. When you write the first byte of a new, 64-byte segment of your video file, the cache says, "Hold on! I don't have that data." It then triggers a Read-For-Ownership (RFO), dutifully fetching the entire 64-byte block of old, garbage data from main memory. It brings this useless data all the way into its precious workspace, only for you to immediately paint over every single byte of it with your new video frame data. This is akin to painting a wall by first taking a detailed photograph of the old, peeling paint, developing it, bringing it back to the room, and then opening your new can of paint. It is pure, unadulterated waste.

This is precisely the scenario explored in computational workloads like video encoding. For every single cache line of output, a WA policy doubles the required memory bandwidth: one read for the RFO, and one eventual write-back of the dirty data. The no-[write-allocate](/sciencepedia/feynman/keyword/write_allocate) (WNA) policy, coupled with a write-through approach and a clever feature called a write-combining buffer, is the elegant solution. On a write miss, it simply says, "This isn't for me to keep." It bypasses the cache entirely, sending the write directly towards memory. The write-combining buffer gathers up all the little writes until a full cache line is ready, and then sends a single, efficient burst to memory. The result? The needless RFO is eliminated, cutting the memory traffic for the stream in half and freeing the cache from being polluted with data that is merely passing through.

This principle holds true not just for dense, byte-by-byte streams, but for any "write-only" or "fire-and-forget" workload. Whether a program writes to every byte in a cache line or just sparsely updates elements with a large stride, if the data is not going to be read back soon, fetching the old line is a fool's errand. For such tasks, WNA is the undisputed champion of efficiency.

Talking to the Outside World: I/O and Device Communication

The world of computing is not just about the CPU talking to memory. It's a bustling ecosystem where the processor must communicate with a vast array of peripheral devices: network cards, graphics processors, storage controllers, and more. This communication often happens through a clever trick called Memory-Mapped I/O (MMIO). From the CPU's perspective, it's just writing to a memory address. But in reality, that address is a "doorbell" or a "mailbox" for an external device.

When your computer sends a packet over the network, the CPU might write a descriptor to a specific MMIO address. This write is not a request to store data; it's a command to the network card: "Send this data packet, now!" Caching this write would be nonsensical. You don't want to keep a local copy of the command; you want the command to be sent.

This is a perfect application for a no-[write-allocate](/sciencepedia/feynman/keyword/write_allocate) policy. Modern systems define certain regions of memory, like those used for PCIe devices, with a "Write Combining" memory type. This tells the CPU to use a WNA policy for any stores to that region. The writes bypass the cache and are funneled into a write-combining buffer. This buffer acts as an intelligent staging area, merging many small, consecutive command writes into a single, large, efficient transaction on the PCIe bus. WNA prevents cache pollution, while the write-combining buffer ensures the underlying hardware is used efficiently. It is a beautiful, symbiotic partnership.

This principle also dramatically simplifies the fiendishly complex problem of keeping memory consistent when multiple agents are at play. Consider a CPU writing data to a buffer in memory while a Direct Memory Access (DMA) engine—a specialized hardware block that can move data without CPU intervention—wants to write to that same location. If the CPU used [write-allocate](/sciencepedia/feynman/keyword/write_allocate), its write would be sitting in its private L1 cache. The DMA's write would go to main memory. Now the system has two different versions of the data! Reconciling this requires complex and slow cache coherence protocols.

With a no-[write-allocate](/sciencepedia/feynman/keyword/write_allocate) policy for the DMA buffer, the situation is vastly simpler. The CPU's write doesn't enter the cache; it goes into the write buffer. When the DMA engine initiates its write, the system simply has to check this small, well-defined write buffer for a conflicting address and cancel the CPU's pending write. The coherence problem is constrained and easily managed, preventing a race condition where a stale CPU write could overwrite fresh DMA data.

Keeping the Peace in a Multicore World

The challenge of keeping data consistent explodes in a multicore processor. If each core has its own private cache, how does a write by one core become visible to the others? This is the domain of cache coherence protocols. Here again, the choice of write policy has profound implications.

Imagine a scenario where a core is initializing a large block of private data that no other core will touch. Every single write is to a "cold line"—a line not present in any cache. With a [write-allocate](/sciencepedia/feynman/keyword/write_allocate) policy, each write miss triggers an RFO. The core must broadcast its request across the shared interconnect, creating traffic and consuming bandwidth, just to fetch data from memory that it's about to completely overwrite. For a system with dozens of cores all trying to initialize their data, this creates a "traffic jam" of unnecessary RFOs on the critical shared bus.

A no-[write-allocate](/sciencepedia/feynman/keyword/write_allocate) policy acts as a "good neighbor." For these private, cold writes, the core can simply send its data towards memory without allocating a line and without broadcasting a disruptive RFO. It keeps its activity quiet, leaving the interconnect free for more meaningful communication. The write may still be snooped by other cores to maintain coherence, but the costly step of fetching the data from memory is avoided. In large-scale systems, this simple policy choice can lead to a dramatic reduction in system-wide memory traffic, improving overall performance.

The Delicate Dance of System Interactions

Lest we conclude that [write-allocate](/sciencepedia/feynman/keyword/write_allocate) is always the villain, it is crucial to understand that performance is a delicate dance of interacting components. WA is the "optimist's policy": it bets that data being written will be read again soon, so it eagerly brings it into the cache. For many workloads, this is exactly the right bet and yields huge performance gains.

However, this optimism can lead to shockingly bad outcomes in certain corner cases. Consider a single instruction that tries to store 16 bytes of data, but its target address is misaligned such that 8 bytes fall in one cache line and 8 bytes fall in the adjacent line. This is called a "split-line store." With a no-[write-allocate](/sciencepedia/feynman/keyword/write_allocate) policy, this is no big deal; the CPU simply sends two small 8-byte writes to the memory subsystem. But with [write-allocate](/sciencepedia/feynman/keyword/write_allocate), the result can be a catastrophe. The single instruction triggers two separate cache misses. Each miss might evict a dirty line, causing two 64-byte write-backs to memory. Then, each miss triggers a 64-byte RFO to fetch the two old lines. In the worst case, a single 16-byte store can generate $64+64+64+64 = 256$ bytes of memory traffic! WNA's minimalist approach proves far more robust against such architectural landmines.

The dance becomes even more intricate when we introduce other performance-enhancing features, like hardware prefetchers. A prefetcher tries to guess what data the CPU will need soon and fetches it into the cache ahead of time. When a prefetcher works well, it's magical. But when it's wrong, it pollutes the cache with useless data. This pollution can have a sinister interaction with a [write-allocate](/sciencepedia/feynman/keyword/write_allocate) policy. An inaccurate prefetch evicts a potentially useful line from the cache. If the victim line happened to be dirty (perhaps because of a previous [write-allocate](/sciencepedia/feynman/keyword/write_allocate) operation), the prefetcher's mistake has just triggered a completely unnecessary 64-byte write-back to memory. The system's attempt to be helpful in one area causes unintended, costly consequences in another.

A Tale of Two Philosophies

Ultimately, the choice between [write-allocate](/sciencepedia/feynman/keyword/write_allocate) and no-[write-allocate](/sciencepedia/feynman/keyword/write_allocate) represents two different philosophies. Write-allocate is the philosophy of the tidy workshop: it assumes any data you touch should be brought into your workspace (the cache) because you're likely to work on it again. No-[write-allocate](/sciencepedia/feynman/keyword/write_allocate) is the philosophy of the minimalist messenger: it recognizes that some tasks are simply about sending a package, and it's best not to clutter the workshop with things that are just passing through.

The true beauty of a modern processor is that it doesn't blindly follow just one philosophy. It has learned to embrace both. Through mechanisms like memory typing, the software can give hints to the hardware about its intent. By marking a region of memory as "Write Combining," the programmer tells the CPU, "This is a fire-and-forget mailbox for a device." The CPU, in its wisdom, then automatically applies the no-[write-allocate](/sciencepedia/feynman/keyword/write_allocate) policy for that region. For all other "normal" memory, it uses its default [write-allocate](/sciencepedia/feynman/keyword/write_allocate) policy, betting on temporal locality. The art of high-performance computing lies in this collaboration, in understanding these fundamental trade-offs and guiding the hardware to make the smartest choice for the task at hand.

No-Write-Allocate Cache Policy

Introduction

Principles and Mechanisms

The Standard Approach: Prepare the Palette with Write-Allocate

The Minimalist's Choice: No-Write-Allocate

Quantifying the Trade-off: When is Minimalism Better?

The Streaming Workload: A Clear Win for "No-Write-Allocate"

The Reused Data: Revenge of "Write-Allocate"

The Sparse Write: The Cost of a Cannon to Kill a Fly

The Art of Implementation: Refinements and Adaptations

Applications and Interdisciplinary Connections

The Art of Sending a Message: Streaming Data

Talking to the Outside World: I/O and Device Communication

Keeping the Peace in a Multicore World

The Delicate Dance of System Interactions

A Tale of Two Philosophies

No-Write-Allocate Cache Policy

Introduction

Principles and Mechanisms

The Standard Approach: Prepare the Palette with Write-Allocate

The Minimalist's Choice: No-Write-Allocate

Quantifying the Trade-off: When is Minimalism Better?

The Streaming Workload: A Clear Win for "No-Write-Allocate"

The Reused Data: Revenge of "Write-Allocate"

The Sparse Write: The Cost of a Cannon to Kill a Fly

The Art of Implementation: Refinements and Adaptations

Applications and Interdisciplinary Connections

The Art of Sending a Message: Streaming Data

Talking to the Outside World: I/O and Device Communication

Keeping the Peace in a Multicore World

The Delicate Dance of System Interactions

A Tale of Two Philosophies