Non-Uniform Memory Access

SciencePedia

Key Takeaways

NUMA architecture addresses the "memory wall" by creating nodes with local memory, resulting in faster local access but slower remote access.
Performance in NUMA systems critically depends on data locality—the physical proximity of data to the processor that uses it.
Operating system policies like "first-touch" automatically place data, making programmer awareness of initialization crucial for performance.
NUMA-aware programming, including strategic data placement and algorithm design, can significantly improve scalability in fields from scientific computing to databases.
By intelligently partitioning data to maximize local access, NUMA systems can outperform theoretically simpler UMA machines.

Introduction

In the world of computing, we often start with a simple, elegant model: memory is a single, unified space where every location is equally fast to access. This concept, known as Uniform Memory Access (UMA), served as the foundation for computer architecture for decades. However, as processors grew exponentially faster than memory, this model began to buckle under the strain, creating a "memory wall" where CPUs spent more time waiting for data than processing it. The rise of multicore processors further exacerbated this issue, creating a bottleneck at the central memory controller and limiting scalability. This gap between the simple model and physical reality necessitated a new approach to memory design.

This article delves into the solution: Non-Uniform Memory Access (NUMA). We will journey from the flat, predictable world of UMA to the rich, complex geography of NUMA, where the location of data is paramount. You will learn not only what NUMA is but why it exists and how its principles fundamentally reshape our approach to programming. The first chapter, "Principles and Mechanisms," will uncover the core concepts of NUMA, quantify the cost of remote memory access, and explain the crucial role of operating system policies like "first-touch." The subsequent chapter, "Applications and Interdisciplinary Connections," will explore how to harness NUMA principles in the real world, from building smarter data structures to scaling massive computations in scientific domains and database systems. By the end, you will understand that non-uniformity is not a flaw, but a feature to be leveraged for building truly powerful and efficient systems.

Principles and Mechanisms

To write a program, we often begin with a wonderfully simple mental model: the computer’s memory is a single, vast, and orderly warehouse of information. We imagine a gigantic array of numbered boxes, where we can store or retrieve any piece of data with equal ease and speed. This elegant abstraction is known as Uniform Memory Access (UMA). For many years, this model was not just a convenience; it was a fairly accurate reflection of reality. A single processor (CPU) communicated with a single bank of memory through a dedicated memory controller. The time it took to fetch data from address 100 was, for all practical purposes, the same as from address 1,000,000.

But a quiet revolution was happening. Processors were becoming astonishingly fast, while the speed of memory struggled to keep pace. This growing gap, often called the "memory wall," meant that our lightning-fast CPUs were spending more and more of their time just waiting for data to arrive. The solution? If one CPU is waiting, why not have many CPUs working at once? This led to the era of multicore and multi-socket processors. But this created a new kind of traffic jam. Imagine dozens of workers all trying to get tools from a single storeroom through one door. The storeroom itself might be efficient, but the queue at the door becomes the real bottleneck.

In a computer, this bottleneck is the memory controller. Even if the controller is very fast, serving requests at a rate of, say, $\mu$ , if the total request rate $\Lambda$ from all the cores approaches $\mu$ , the average time to get a piece of data skyrockets. This isn't because the controller is slow, but because of the waiting time, or contention, in the queue. The beautiful, simple UMA model begins to break down under this load.

A New Geography: The NUMA Principle

Nature, and computer architecture, often solves a bottleneck problem not by making the bottleneck wider, but by getting rid of it entirely and decentralizing. What if instead of one giant, central warehouse, we built several smaller, local depots? This is the core idea behind Non-Uniform Memory Access (NUMA).

In a NUMA system, the machine is divided into a small number of "nodes" or "sockets." Each node is like a self-contained neighborhood: it has its own set of processor cores and its own bank of local memory. All these nodes are connected by a high-speed interconnect. The defining principle of NUMA is this: a processor can access its own local memory very quickly. However, if it needs to access memory belonging to another node—a remote access—it must send a request over the interconnect. This journey takes extra time. Access time is no longer uniform; it depends on the location of the data relative to the processor accessing it.

It's like having a library in your own neighborhood versus having to take a bus across town to the central branch. The local library is much faster to get to. This non-uniformity isn't a flaw; it's a deliberate and brilliant trade-off. We have traded the simplicity of the UMA model for a system that can scale to a huge number of cores without a single, overwhelming memory bottleneck. We have introduced a geography to memory.

Quantifying the NUMA Effect: The Price of Distance

So, how much does this "distance" cost? We can capture this with a beautifully simple formula for the Average Memory Access Time (AMAT). An access is either local or remote. If the probability of an access being local is $p_{\text{local}}$ and the time for that access is $t_{\text{local}}$ , and the probability of it being remote is $p_{\text{remote}}$ with time $t_{\text{remote}}$ , then the average time is simply a weighted average:

AMAT = p_{\text{local}} \cdot t_{\text{local}} + p_{\text{remote}} \cdot t_{\text{remote}}

The latencies, $t_{\text{local}}$ and $t_{\text{remote}}$ , are characteristics of the hardware. For a typical machine, $t_{\text{local}}$ might be around $80$ nanoseconds, while $t_{\text{remote}}$ could be $140$ nanoseconds or more. The crucial insight is that software has the power to influence the probabilities, $p_{\text{local}}$ and $p_{\text{remote}}$ .

Let's look at this another way. Let $p$ be the fraction of remote accesses, so $1-p$ is the fraction of local ones. Let's also define a remote access penalty factor, $\alpha$ , as the ratio of remote to local latency, $\alpha = t_{\text{remote}} / t_{\text{local}}$ . For our example numbers, $\alpha = 140/80 = 1.75$ . We can then rewrite the average access time in terms of the baseline local latency:

AMAT = (1-p) \cdot t_{\text{local}} + p \cdot t_{\text{remote}} = t_{\text{local}} \left[ (1-p) + p \cdot \alpha \right]

This form tells us everything. The performance penalty you pay is directly proportional to two things: the fraction of times you have to "go across town" ( $p$ ) and how much longer that trip is ( $\alpha$ ). If just $10\%$ of your memory accesses are remote ( $p=0.1$ ), the average access time becomes $80~\text{ns} \times (0.9 + 0.1 \times 1.75) = 80~\text{ns} \times 1.075 = 86~\text{ns}$ . The slowdown seems small. But what if the workload is a long chain of dependent operations, like chasing pointers through a linked list? Each step adds latency, and the remote accesses cause this "latency stacking" to be amplified, potentially crippling performance. Your program's performance is now inextricably linked to the geography of its data.

The Unseen Hand: How Data Gets Placed

If data location is so critical, who decides where a newly created piece of data should live? In most modern operating systems, the answer is an elegantly simple policy called first-touch. When a program asks for a chunk of memory, the OS doesn't immediately assign it a physical home. It waits until a processor core actually tries to write to that memory for the first time. At that moment, the OS allocates a physical page of memory from the node where the writing core resides. The first one to touch it, claims it.

This simple rule has profound consequences. Imagine a program designed to process a massive $64 \text{ GiB}$ dataset. A programmer, thinking in the old UMA world, might write a simple initialization routine that runs on a single thread to prepare the data. Because of the first-touch policy, all $64 \text{ GiB}$ of that data will be allocated on the memory of the node where that single thread ran. Now, what happens when the main computation starts, and $32$ threads spread across two nodes begin to work on the data? The $16$ threads on the "home" node will enjoy fast, local memory access. But the other $16$ threads on the other node will find that all of their data is remote. Their performance is immediately throttled by the slower remote memory bandwidth. The aggregate performance of the system is hamstrung, not by the CPU speed or total memory bandwidth, but by a single NUMA-oblivious line of code.

The solution is as elegant as the policy itself: NUMA-aware programming. Instead of a single-threaded initialization, we use a parallel one. Each of the $32$ threads first writes to the portion of the data it will be responsible for processing. This way, the data is automatically placed on the correct node for the thread that needs it most, maximizing locality and unleashing the full power of the machine. It’s a beautiful dance between the hardware's geography, the operating system's policy, and the application's logic.

When Locality Is Not Everything

Is the goal, then, always to achieve perfect locality? Not necessarily. The beauty of science is in the nuance. Consider a program that doesn't just scan through a large array, but instead accesses memory in a seemingly random pattern, like looking up values in a large hash table.

For a linear scan, placing contiguous chunks of data local to the thread processing them is clearly optimal. But for a random access pattern, it's hard to predict which thread will need which piece of data next. If we use the first-touch policy to place all the data locally for one group of threads, any other thread needing that data is penalized.

In such cases, a different strategy might be better: page interleaving. Instead of trying to group pages, the OS deliberately scatters them across all NUMA nodes in a round-robin fashion. Page 0 goes to Node 0, Page 1 to Node 1, Page 2 to Node 0, Page 3 to Node 1, and so on. Now, any thread accessing a large amount of data will find that about half its accesses are local and half are remote. No single memory controller is overwhelmed, and every thread sees roughly the same average performance. We sacrifice the potential for perfect locality in exchange for predictable and balanced performance. The right strategy depends entirely on the problem you are trying to solve.

The Deeper Implications of Non-Uniformity

The principle that "location matters" runs deeper than just the time it takes to read a byte. It permeates the entire system, forcing us to rethink concepts we thought were settled.

Synchronization: Consider a simple spinlock, a mechanism to ensure only one thread enters a critical section at a time. A basic "ticket lock" uses a single shared variable to coordinate waiting threads. On a UMA machine, this is fine. On a NUMA machine, it's a disaster. When the lock is released, the holder writes to this shared variable. The cache coherence protocol then sends invalidation messages to every other node where a waiting thread has a cached copy of that variable. This creates a broadcast "storm" across the high-latency interconnects for every single lock handoff. A NUMA-aware algorithm, like the MCS lock, is completely different. Each thread waits by spinning on its own local variable. The handoff becomes a targeted, point-to-point write from the releasing thread to its direct successor. The communication pattern respects the machine's geography.

Scalability: Amdahl's Law teaches us that the serial portion of a program limits its parallel speedup. NUMA introduces a new, insidious source of serialization. The time spent waiting for remote memory accesses is time that doesn't shrink as you add more processors. This communication overhead acts as an effective serial fraction, fundamentally limiting the scalability of a NUMA-oblivious program.

The OS and Virtualization: This principle even affects the lowest levels of the operating system and virtualization. When the OS needs to change a virtual-to-physical address mapping, it may have to invalidate cached translations on other cores using Inter-Processor Interrupts (IPIs). On a NUMA machine, the cost of sending an IPI to a core on a remote node is higher than sending one locally, and the coordination overhead multiplies. Even in a virtual machine that thinks it's running on a simple UMA system, the reality of the underlying NUMA host will bleed through. If the host's hypervisor is forced to place some of the virtual machine's memory on a remote node, the guest application will mysteriously slow down, its performance dictated by the physical reality it cannot see.

The journey from the simple, flat world of UMA to the rich, structured geography of NUMA is a tale of how physical constraints give birth to beautiful complexity. This non-uniformity is not a problem to be lamented, but a characteristic to be understood. By designing software—from algorithms to operating systems—that respects the physical layout of the machine, we can create systems that are not only more powerful but also more elegant, working in harmony with the fundamental nature of modern computation.

Applications and Interdisciplinary Connections

In our previous discussion, we delved into the principles of Non-Uniform Memory Access, uncovering the fundamental reason for its existence: the inescapable reality of physical distance. A processor cannot be everywhere at once, and so some memory will always be closer than other memory. This creates a "geography" within the computer, with local suburbs and distant provinces.

But knowing the map is only the first step. The real adventure lies in exploring this terrain and learning to live in it. How does this non-uniformity affect the programs we write every day? And how can we, as scientists and engineers, turn this apparent complication into an advantage? This is not merely a technical exercise; it is a journey into the beautiful and intricate dance between software and hardware, between abstract algorithms and their physical embodiment.

The Foundation: Locality of Data and Computation

Let's start with the simplest of tasks: processing a very large list of numbers, say, an array in memory. Imagine our computer has two processors, or "sockets," each with its own local memory. Our task is to use threads on both sockets to work on a giant array. What is the best way to do this?

The answer, as you might guess, is to ensure that the threads on each socket work on data that is stored in their local memory. If we carefully place the first half of the array in the memory of the first socket, and the second half in the memory of the second, and then assign each socket's threads to their corresponding half, we achieve maximum performance. Each socket hums along at its full local memory bandwidth, like two independent workshops, each with all its tools at hand. This is the ideal, the NUMA-aware utopia.

What happens if we are careless? If we, for instance, place all the data on one socket's memory but have both sockets work on it, then the second socket is in for a long commute. Its threads must constantly reach across the interconnect to fetch data, dramatically slowing them down. The entire job now runs at the speed of this throttled, remote-accessing socket. An even worse, almost perverse, scenario is to place the data in partitioned halves but then have the sockets work on the opposite halves. Now, everyone is making a long commute, and the performance is dismal. These simple scenarios teach us the cardinal rule of NUMA: keep your data and your computation together.

But the plot thickens. The cost of remoteness is not always so straightforward. Consider an "out-of-place" algorithm, which reads from an input array and writes the results to a separate output array. Let's say our thread is running on socket 0. The input array is local, which is good. But the output array is on remote socket 1. You might think, "Well, the writes are remote, that's a penalty, but at least the reads are fast."

Here, the subtle machinery of modern processors plays a trick on us. Most caches follow a "write-allocate" policy. Intuitively, this is like saying: before you can write on a blank line in a notebook, you must first have the notebook page in front of you. If the notebook is in another room (on a remote NUMA node), a simple write operation forces you to first go fetch the entire page, bring it to your desk (the local cache), make your small annotation, and only then can you proceed. This "read-for-ownership" request must traverse the interconnect, wait for the remote memory to respond, and then transfer a full cache line back—all just to perform a write. This hidden remote read can turn your write operations into a major bottleneck, dominated by the interconnect's limited bandwidth. This teaches us a deeper lesson: NUMA awareness isn't just about the obvious access patterns; it's about understanding the subtle behavior of the underlying hardware.

Building Smart Data Structures

Understanding these fundamentals is an invitation to be clever. If the placement of data is so critical, can we design our fundamental tools—our data structures—to be aware of their own location and purpose?

Consider a circular queue, a fixed-size ring buffer used for communication between different parts of a program. If we know that threads on one socket tend to add items (enqueue) and threads on another tend to remove them (dequeue), we have a predictable access pattern. The "head" of the queue is primarily accessed by one node, and the "tail" by another. We can then intelligently lay out the queue's underlying array in memory. Perhaps we partition the array into blocks and assign those blocks to NUMA nodes in a pattern that matches the movement of the head and tail pointers. By choosing the right block size and starting offset, we can statically optimize the layout so that most accesses are local for the threads performing them, minimizing the cost of this predictable "chase" around the ring.

But what if we don't know the access pattern in advance? Can a data structure learn and adapt? Absolutely. Imagine a dynamic array, one that grows as you add elements. On a NUMA system, each time the array needs to be resized, it has a choice: where should the new, larger block of memory be located? A "NUMA-aware" dynamic array can keep track of which socket has been accessing it most frequently. When it's time to resize, it doesn't just ask for more memory; it makes an intelligent decision to migrate to the NUMA node that has become its "home," the place where it is used most often. The cost of moving is paid once during the resize, but the benefit is then enjoyed for countless future accesses. It is as if the data structure itself decides to move to a new city to be closer to its most frequent users, a beautiful example of online, adaptive optimization.

Scaling Up: High-Performance and Scientific Computing

These ideas of data placement and algorithmic awareness scale up from single data structures to the largest computations in science and engineering. When simulating everything from the collision of galaxies to the folding of a protein, we are often manipulating enormous grids or matrices of data on supercomputers with thousands of cores spread across many NUMA nodes.

A classic example is matrix multiplication, $C = A \times B$ . If the matrices are too large to fit on one node, we must partition them. This is called domain decomposition. Let's say we split our two-socket machine's memory, giving each socket half the rows of matrix $A$ and half the columns of matrix $B$ . To compute the corresponding half of the output matrix $C$ , a socket will need its local rows of $A$ and various columns of $B$ . The key insight is that with a smart work assignment, a socket uses its local rows of $A$ for all its calculations. The communication is then reduced to only fetching the necessary columns of $B$ from the other socket. This strategic decomposition, which co-locates computation with at least one of the large input data structures, is a cornerstone of high-performance computing.

This intricate dance between computation and data also occurs at a much finer scale, right inside the loops of our code. Compilers often use an optimization called loop tiling (or blocking) to improve cache performance, breaking a large loop over an array into smaller loops over "tiles" of the array that fit in cache. On a NUMA system, this has another dimension. If a tile of data happens to cross the boundary between two NUMA domains, the thread processing that tile will be forced to pay the remote access penalty for a portion of its work. A careful analysis can precisely predict how many cache lines will be fetched remotely and quantify the exact performance tax paid at these boundaries.

In a real-world scientific code, all of these layers come together in a grand synthesis. Consider a simulation for magnetohydrodynamics used in astrophysics. It might use a hybrid parallel model: MPI (Message Passing Interface) to decompose the problem domain across many nodes in a cluster, and OpenMP to use multiple threads within each NUMA node. To achieve good performance, the programmers must be masters of locality on all scales. They must use topology-aware mapping to place communicating MPI ranks on physically close nodes in the cluster. Within each node, they must use techniques like first-touch allocation and thread pinning to ensure their OpenMP threads stay on one socket and operate on local data. Failure at any level—from the network switch down to the local memory bus—can cripple the performance of the entire simulation.

The World of Data: Databases and Distributed Systems

Lest you think this is only the concern of scientists simulating the universe, these same principles are the bedrock of the systems that manage the world's information.

Consider a massive key-value store, the kind that powers social media feeds and online shopping carts, sharded across the NUMA nodes of a large server. The performance is directly tied to the rate of remote memory accesses. We can even model this with a simple, elegant equation. If a fraction $\alpha$ of the requests have "locality" (meaning they are known to target keys on the thread's home node), and the system has $M$ nodes, the overall probability of a remote access is simply $P(R) = (1-\alpha) \frac{M-1}{M}$ . This formula beautifully reveals how the application's intrinsic data access patterns ( $\alpha$ ) directly dictate the physical performance on the machine.

Furthermore, this remote access probability has a nasty habit of being amplified. In many database systems, a single logical read operation might trigger multiple underlying memory accesses to traverse an index structure or fetch scattered data blocks. This is called read amplification. If a logical read requires $A$ physical memory accesses, the total performance penalty from NUMA is proportional to the product of $A$ and the remote access fraction $f$ . This means that as an application's internal logic becomes more complex (higher $A$ ), its sensitivity to NUMA locality grows immensely. Optimizing for locality isn't just a minor tweak; its value is multiplied by the complexity of the application itself.

So, is NUMA just a headache, a penalty we must always pay? Or can it be an advantage? This leads us to our final, and perhaps most profound, example: a graph database. Imagine running queries on a huge network graph, first on a UMA machine with a uniform memory latency of $\ell_U$ , and then on a NUMA machine with a faster local latency $\ell_L$ and a slower remote latency $\ell_R$ . If we naively partition the graph by simply hashing the vertex IDs, our traversals will frequently jump between nodes, incurring the high remote latency $\ell_R$ . The average performance will be poor, likely worse than the simpler UMA machine.

But what if we are clever? What if we partition the graph by finding its natural communities—tightly connected clusters of vertices—and place each community entirely within a single NUMA node? Now, a traversal that starts in a community is very likely to stay within that community, and thus within local memory. The vast majority of our memory accesses will now enjoy the faster local latency $\ell_L$ . Because $\ell_L$ is faster than the UMA system's uniform latency $\ell_U$ , our average query time on the NUMA machine can actually become lower than on the UMA machine. We have won.

This is the ultimate lesson of Non-Uniform Memory Access. It is not a bug or a flaw; it is a true reflection of physical reality. By understanding this geography of memory, and by designing algorithms and data structures that respect it, we can build systems that are not just patched to work, but are fundamentally faster and more efficient because of it. The quest for performance becomes a quest for locality, a beautiful and rewarding challenge at the heart of modern computing.