Buddy Allocation

SciencePedia

Key Takeaways

The buddy system manages memory by recursively splitting large power-of-two sized blocks to satisfy requests and aggressively merging (coalescing) them upon deallocation.
It uses a single, highly efficient bitwise XOR operation to instantly locate a block's "buddy" for merging, reflecting its underlying binary tree structure.
While elegant, the system introduces internal fragmentation by rounding requests up to a power of two and is susceptible to external fragmentation under certain workloads.
It is a critical component in modern operating systems, used for allocating physically contiguous memory for DMA and working in partnership with slab allocators.
The principles of buddy allocation are adapted for modern challenges, including managing memory on GPUs and in complex heterogeneous memory systems.

Introduction

In the complex world of computer science, managing memory is a foundational yet persistent challenge. Operating systems and applications constantly request and release blocks of memory in varying sizes, creating a dynamic puzzle. A naive approach can quickly lead to memory fragmentation, where free space becomes so scattered into small, unusable pieces that even simple requests fail. This article tackles this problem by exploring one of the most elegant and influential solutions: the buddy allocation system. First, we will delve into the "Principles and Mechanisms," uncovering the simple power-of-two logic, the recursive dance of splitting and coalescing, and the inherent trade-offs that define its behavior. Following that, in "Applications and Interdisciplinary Connections," we will see how this fundamental algorithm extends beyond theory, playing a critical role in everything from low-level hardware communication to the sophisticated memory management of modern programming languages.

Principles and Mechanisms

Imagine you are the librarian of a vast, empty library. Your job is to hand out sections of shelving to patrons who request them. Some ask for a tiny shelf, others for an entire aisle. When they're done, they return the space to you. How do you keep track of what's free without your catalog becoming a complete mess of tiny, unusable gaps? This is, in essence, the challenge of memory management, and one of the most elegant solutions ever devised is the buddy system.

At first glance, the buddy system seems to impose a rigid, almost tyrannical rule: all blocks of memory must have a size that is a power of two—1, 2, 4, 8, 16, and so on. This might seem terribly inefficient. If someone asks for 9 kilobytes of memory, you have to give them a 16-kilobyte block! But as we'll see, this rigid discipline is the very source of the system's profound simplicity and speed.

A World Built on Two

The beauty of the buddy system begins with its hierarchical view of memory. It sees a large, contiguous memory pool not as a simple line of bytes, but as a complete binary tree. At the very top, the root of the tree is the entire memory block, say of size $2^M$ . This block can be split into two "children" blocks, each of size $2^{M-1}$ . Each of these can be split further, and so on, until you reach the smallest possible block size you wish to manage.

This structure dictates the two fundamental operations of the system: a dance of splitting and merging.

Allocation: The Art of Splitting

When a request for a certain amount of memory, say size $S$ , arrives, the allocator first determines the necessary block size. It rounds $S$ up to the next power of two, let's call it $B=2^k$ . For example, a request for 180,000 bytes would be rounded up to $2^{18} = 262,144$ bytes.

The allocator then looks for a free block of this exact size. If one is available, great! It's handed over. But what if there isn't? This is where the splitting begins. The allocator finds the smallest available free block that is larger than the required size $B$ . Let's say it finds a block of size $2^{j}$ where $j > k$ . It takes this block and splits it in half, creating two "buddy" blocks of size $2^{j-1}$ . One of these buddies is placed on the free list for its size, while the other is taken for the next step. This process repeats—split, put one buddy on the free list, take the other—cascading down the levels of the tree until a block of the perfect size $2^k$ is carved out. The process is deterministic and wonderfully efficient; it's like navigating from the root of the tree down to a specific leaf.

Deallocation: The Magic of Coalescing

The real genius of the buddy system reveals itself when memory is returned. A naive system might just mark the returned block as "free," leading to a fragmented landscape of small, useless chunks. The buddy system is smarter. When a block is freed, it immediately looks for its buddy—the other half it was split from.

If the buddy is also free, they instantly coalesce, or merge, reforming their original parent block of double the size. But it doesn't stop there. This newly formed parent block then looks for its buddy. If that buddy is also free, they too merge. This chain reaction, a cascade of coalescing, continues up the hierarchy as far as possible, aggressively reassembling larger and larger contiguous blocks of free memory. The system constantly strives to heal the fragmentation it creates, aiming to restore the memory to its initial, pristine state of a single large free block.

The Secret of the Buddy: A Bit of Magic

This all sounds wonderful, but it hinges on one crucial question: how does a block find its buddy? Does it need to consult a complex directory or search through long lists? The answer is a piece of computational poetry and the reason the buddy system is so beloved for its elegance. The address of a buddy can be found with a single, lightning-fast operation: the bitwise exclusive-or (XOR).

For a block of size $2^k$ at address $A$ , its buddy's address is simply $A \oplus 2^k$ .

Why does this work? Remember the tree structure. When a parent block at some address is split, it creates two children. One child's address is the same as the parent's, and the other's is offset by the size of the child block. For example, a 64KB block at address 0 splits into a 32KB block at address 0 and another at address 32768. In binary, their addresses differ by exactly one bit—the bit corresponding to the value $32768$ . The XOR operation simply flips this one specific bit, instantly jumping from a block's address to its sibling's in the tree. This is not just a clever hack; it's a deep reflection of the system's binary-tree nature, implemented with the most fundamental operations of a computer.

The Inevitable Trade-offs: A Tale of Two Fragmentations

No system is perfect, and the buddy system's elegant discipline comes at a cost, which manifests in two forms of fragmentation.

Internal Fragmentation: The Price of Rounding Up

The first cost is obvious. When we round up a request of size $S$ to the next power of two, $B=2^k$ , the space $(B-S)$ inside the allocated block is wasted. This is called internal fragmentation. It seems like this could be a huge problem. What if someone requests $2^{k-1} + 1$ bytes? We give them a block of size $2^k$ , and nearly half of it is wasted!

Remarkably, the situation is never worse than that. By the very nature of the power-of-two rounding, the requested size $S$ is always greater than half the size of the block it's given ( $S > B/2$ ). This leads to a simple and powerful guarantee: the fraction of wasted space, $(B-S)/B$ , is always less than $0.5$ , or 50%. While some models show the average fragmentation is even lower, around 25%, this worst-case guarantee of less than 50% is a cornerstone of the system's predictability.

External Fragmentation: The Checkerboard of Doom

The second, more subtle, cost is external fragmentation. This occurs when there is enough total free memory to satisfy a request, but it's scattered in small, non-contiguous chunks. The buddy system's aggressive coalescing is designed to fight this. But can it be defeated?

Consider this thought experiment. Imagine we fill the entire memory with blocks of the smallest possible size, say 16 bytes. Now, we go through and free every other block, creating a "checkerboard" pattern of allocated and free 16-byte blocks. What happens? Each free block looks for its buddy to coalesce. But its buddy is the adjacent 16-byte block, which, by our design, is still allocated! Coalescing fails everywhere.

The result is a catastrophe of fragmentation. Half the total memory is free, but it exists entirely as isolated 16-byte blocks. The system has plenty of free space, but if a request for just 32 bytes arrives, it will fail. This pathological case illustrates the fundamental limitation: the allocator's ability to satisfy a request depends not on the total free memory, but on the size of the largest single contiguous free block.

From Ideal Model to Messy Reality

This beautiful, self-contained model of memory as a perfect power-of-two-sized block is an idealization. Real-world physical memory is often messy. It can have "holes"—regions reserved for hardware that the operating system cannot use. This breaks the single contiguous block assumption.

Does this mean we must abandon our elegant buddy system? Not at all. The principle is robust. Instead of managing one giant memory pool, the OS can run an independent buddy system for each contiguous zone of physical memory. A zone of 64 MiB gets its own buddy allocator, as does a separate 32 MiB zone elsewhere. Coalescing is confined within each zone's boundaries. The buddy-finding logic might need a small tweak to work with relative addresses within each zone, but the core XOR principle remains. The elegant tree-based division of space proves to be a flexible and powerful concept, capable of being applied to the disjointed regions of a real computer's memory, turning a complex problem into a collection of simpler, solvable ones.

Applications and Interdisciplinary Connections

Having journeyed through the elegant mechanics of the buddy system—its recursive splits and eager coalescing—one might be tempted to view it as a tidy, self-contained piece of algorithmic art. But its true beauty, like that of any fundamental principle in science, lies not in its isolation but in its profound and often surprising connections to the world around it. The buddy system is not just an algorithm; it is a recurring answer to a fundamental question: how do we impose a simple, predictable order on a resource that is constantly being broken apart and pieced back together? This simple idea echoes through the layers of modern computing, from the bare metal of hardware interfaces to the abstract realms of programming languages and beyond.

The Last Mile: Bridging Virtual Dreams and Physical Reality

In our modern computing world, we are blessed with the beautiful illusion of virtual memory. Each program believes it has an enormous, pristine, and perfectly contiguous stretch of memory all to itself. The Memory Management Unit (MMU) works tirelessly behind the scenes, like a master translator, patching together scattered fragments of physical memory to uphold this magnificent lie. A contiguous block of virtual addresses your program sees might correspond to a chaotic jumble of physical page frames scattered far and wide.

But this illusion has its limits. Some hardware components, particularly high-performance devices like network cards or image signal processors, are not privy to the CPU's sophisticated games. For speed and simplicity, they often perform Direct Memory Access (DMA), writing directly to physical memory. If such a device doesn't have its own translator (an IOMMU), it needs a buffer that is contiguous not in the virtual dreamscape, but in cold, hard physical reality.

Suddenly, the OS is faced with a difficult task. The memory, which looks like a smooth road to the CPU, is in fact a collection of disconnected islands to the DMA device. How can the OS find a single, unbroken chain of, say, sixteen pages ( $64\,\mathrm{KiB}$ ) for a high-definition video frame when its free memory is a patchwork of single-page gaps? This is the buddy system's most classic and vital role. It is the specialist called in to combat external fragmentation—the state where plenty of free memory exists in total, but none of it is in one place. By constantly trying to merge adjacent free blocks, the buddy system strives to maintain larger contiguous regions, increasing the chance that a request for a large, physically contiguous buffer can be satisfied. Of course, it's not a magic bullet; on a system that has been running for a long time, finding a large contiguous block becomes increasingly unlikely, but the buddy system gives the OS a fighting chance. For truly critical tasks, modern systems even employ specialized techniques like the Linux Contiguous Memory Allocator (CMA), which reserves a large region of memory at boot and manages it with buddy-like principles to guarantee that huge contiguous blocks remain available for devices like 4K video cameras.

The Art of Partnership: Engineering Real-World Allocators

The buddy system is a master of managing large, page-aligned blocks. But what about the countless tiny data structures a kernel needs—network packet headers, file descriptors, process control blocks? Using a buddy allocator to hand out a $4\,\mathrm{KiB}$ page for a $60$ -byte object would be fantastically wasteful. This is where we see the buddy system not as a lone hero, but as a crucial partner in a two-level strategy.

Imagine a workload that alternates between needing a large, contiguous DMA buffer and then suddenly needing thousands of small objects. The buddy system handles the large request. For the small objects, the kernel employs a slab allocator. The slab allocator's genius is to request a reasonably large chunk of memory—a "slab" of one or more pages—from the buddy system. It then acts like a precision machinist, carving this slab into many small, perfectly-sized slots for the tiny objects it manages. This partnership is beautiful: the buddy system manages the coarse-grained, physical page landscape, while the slab allocator handles the fine-grained world within those pages, eliminating internal fragmentation for small objects.

This principle of a hybrid approach is so powerful that it's a standard pattern in general-purpose heap allocators used by programs every day. An allocator like jemalloc might use one strategy, such as segregated free lists, for small requests (e.g., under $512$ bytes) and switch to a buddy-like algorithm for larger requests. This is a masterful piece of engineering, recognizing that there is no single "best" allocator, only the right tool for the right job, combined into a harmonious whole.

Echoes in a Wider World: Data Structures and Languages

The influence of the buddy system's power-of-two nature extends far beyond the kernel, creating subtle but important interactions with the data structures we build in our applications. Consider the humble dynamic array, the foundation of C++'s std::vector or Python's list. A common strategy for these structures is to double their capacity whenever they run out of space.

What happens when this doubling strategy meets a buddy allocator? Suppose a dynamic array needs to grow and requests a new block of memory. The array's size doubles, a power-of-two growth. The buddy allocator provides blocks in power-of-two sizes. This can lead to a wonderful synergy, where the array's needs align perfectly with the allocator's offerings, minimizing waste. However, a slight mismatch—say, an array of $8$ -byte elements growing to a capacity that requires just over $1024$ bytes—can force the buddy system to allocate a $2048$ -byte block, creating significant internal fragmentation. The performance of a high-level data structure, including its amortized cost of adding elements, is thus invisibly tied to the behavior of the low-level allocator it runs on.

This connection becomes even more profound in the world of managed languages like Java, C#, or Go. Their runtimes feature sophisticated garbage collectors (GCs) that automatically manage memory. A common strategy is to separate objects by size. Small objects might be handled by a "copying" collector that moves them around to compact memory. But copying very large objects (e.g., a multi-megabyte image or a machine learning model) is prohibitively expensive. Instead, these are often placed in a special "Large Object Space" (LOS) where they are allocated and never moved. And what is the perfect tool for managing a memory region for large, variably-sized, immovable blocks? The buddy system. It provides the allocation and deallocation services for the LOS, coexisting with the rest of the GC machinery in a complex, finely-tuned dance to keep the application running smoothly.

The Modern Frontier: Taming New and Exotic Hardware

The principles of the buddy system, conceived decades ago, have proven remarkably resilient and are now being applied to solve the challenges of the most advanced hardware.

Take the Graphics Processing Unit (GPU). A GPU contains dozens of processing cores, each with its own tiny, ultra-fast "scratchpad" memory. To achieve maximum performance, this precious resource must be carefully partitioned among the many threads (warps) running in parallel. A buddy system is a natural choice for this partitioning. Here, the consequences of the allocator's behavior are immediate and dramatic. If a warp requests $0.9\,\mathrm{KiB}$ of memory, the buddy allocator might round this up to a $1\,\mathrm{KiB}$ block. This seemingly small amount of waste—the internal fragmentation—when multiplied across hundreds of warps, can mean that fewer thread blocks can fit onto the GPU's cores. This directly lowers the GPU's "occupancy," a critical metric for hiding latency and achieving high throughput. The abstract concept of fragmentation becomes a concrete bottleneck in high-performance parallel computing.

Or consider the rise of heterogeneous memory systems. A modern server might have a small amount of ultra-fast DRAM paired with a vast amount of slightly slower but cheaper Non-Volatile Memory (NVM). The operating system's job is to act as a smart traffic cop, ensuring that "hot," frequently accessed data resides in DRAM, while "cold," inactive data is moved to NVM. To do this, the OS needs to manage the free space in both pools. And so we find separate buddy allocators, one for DRAM and one for NVM, working in concert. When a new, hot object arrives and DRAM is full, the OS can consult its buddy allocators, perform a cost-benefit analysis of migrating a cold object from DRAM to an available NVM block, and then use the newly freed DRAM block for the hot object. The buddy system becomes a key enabler in a sophisticated, system-wide optimization engine.

In this journey from the physical needs of a DMA controller to the performance tuning of a GPU, we see the buddy system for what it is: a simple, powerful idea with the flexibility to adapt. It is a testament to the enduring beauty of computer science, where an elegant solution to one problem can ripple outwards, providing clarity and order in countless other contexts, revealing the deep and satisfying unity of the field.