Buddy Allocator

SciencePedia

Key Takeaways

The buddy allocator manages memory using power-of-two-sized blocks, splitting them to meet requests and efficiently merging them upon release using a bitwise XOR operation to locate "buddies".
While it effectively combats external fragmentation through immediate coalescing, it inherently causes internal fragmentation by rounding up every memory request to the next power of two.
It serves as a foundational component in operating systems for managing physical memory pages, often paired with a slab allocator to handle small object allocations efficiently.
The algorithm's speed and predictable nature make it valuable in diverse modern applications, including GPU shared memory, large object heaps, and heterogeneous memory systems.

Introduction

Managing a computer's memory is like working with a block of clay; as pieces are requested and returned, the once-pristine block becomes a collection of small, unusable chunks. This problem, known as fragmentation, can cripple a system's performance by making it impossible to find large, contiguous blocks of memory even when plenty is available in total. To combat this chaos, computer science developed elegant strategies, one of the most fundamental being the buddy allocator. It imposes a rigid but remarkably efficient discipline that has become a cornerstone of modern computing.

This article delves into the beautiful simplicity of the buddy allocator. In the following chapters, we will first explore its "Principles and Mechanisms," uncovering how its strict power-of-two rule, cascading splits, and a clever bitwise trick enable fast allocation and deallocation. We will then transition to its "Applications and Interdisciplinary Connections," revealing how this core algorithm functions as a vital component within operating system kernels, GPUs, and other complex systems, proving that sometimes, the most powerful solutions are born from elegant constraints.

Principles and Mechanisms

Imagine you are given a large, pristine block of sculptor's clay. One person asks for a small piece, another for a medium one, and a third for a large, oddly-shaped chunk. You cut them off. Later, they return their pieces. Now you're left with a collection of used, misshapen lumps. When the next person asks for a large piece, you might have enough clay in total, but it's all in small, separate chunks. You can't just glue them back together seamlessly. This is the classic problem of fragmentation, a plague for any system that needs to manage a resource, especially a computer's memory.

The Buddy Allocator is a beautifully simple, almost ruthlessly disciplined approach to solving this problem. It doesn't try to accommodate every arbitrary request. Instead, it imposes a strict set of rules that, at first glance, seem wasteful. But in this rigidity, we find a profound elegance and a remarkable efficiency that has made it a cornerstone of modern operating systems.

The Elegance of Order: Power-of-Two Discipline

The buddy system's foundational rule is radical in its simplicity: all blocks of memory must have a size that is a power of two. A system might manage a total memory pool of size $2^K$ bytes, and the only permissible block sizes are $2^0, 2^1, 2^2, \ldots, 2^K$ . There are no blocks of size 37, 100, or 500. Everything is regimented.

Initially, the entire memory is a single, large free block of order $K$ (size $2^K$ ). When a smaller block is needed, this large block isn't just arbitrarily carved. It is split perfectly in half, creating two "buddy" blocks of order $K-1$ . If a still smaller block is needed, one of these buddies is itself split in half, and so on. This creates a natural binary tree structure, where every block (except the smallest possible) has a parent and two children.

This rigid, hierarchical structure is the key. It trades flexibility for order, and in that order, it finds its power.

The Allocation Dance: A Cascade of Splits

Let's see this discipline in action. Suppose a program requests a block of memory of size $s$ . The buddy allocator's first step is to be pessimistic. It knows it can't provide a block of exactly size $s$ unless $s$ happens to be a power of two. So, it rounds the request up to the next power of two. A request for 100 bytes, for example, is treated as a request for a 128-byte block ( $2^7$ ). In some systems, a small metadata overhead is also added before rounding, to store information about the block itself.

Let's say the rounded-up size is $2^k$ . The allocator consults its free lists, which are simply lists of available blocks, neatly organized by their order (size). Is there a free block of order $k$ ? If so, wonderful! The allocator hands it over, and the dance is done.

But what if there isn't? The allocator then looks for a free block of order $k+1$ . If it finds one, it performs a single, clean split. The order- $(k+1)$ block is cleaved into two order- $k$ buddies. One is given to the requester, and its twin, its buddy, is placed on the free list for order $k$ .

If there's no block of order $k+1$ either, it looks for one at $k+2$ and splits it, creating a free block of order $k+1$ . Then it splits that block to finally get the desired order- $k$ block. This creates a beautiful cascade of splits, starting from the smallest available block that is large enough and proceeding downwards until the perfect size is achieved. The process is deterministic and predictable: always split the lowest-address block, and keep splitting the lower-address half, putting the higher-address half on the free list.

The Magic of the XOR: Finding Your Twin

So, we've split blocks, creating a diaspora of siblings scattered across memory. How do we ever put them back together? This is where the true genius of the buddy system shines, a trick of pure binary arithmetic.

For any block of size $2^k$ starting at an address $a$ , its buddy's address is given by a startlingly simple formula:

$\text{buddy_address} = a \oplus 2^k$

Here, $\oplus$ is the bitwise Exclusive OR (XOR) operation. It’s a piece of computational magic. Why does this work? Because of the alignment invariant: a block of size $2^k$ must always start at an address that is a multiple of $2^k$ . This means the lower $k$ bits of its address are all zero. The number $2^k$ is just a 1 in the $k$ -th bit position, followed by $k$ zeros.

When you XOR the address $a$ with $2^k$ , you are simply flipping the $k$ -th bit of the address. If that bit was 0, it becomes 1. If it was 1, it becomes 0. This instantly jumps you to the address of the other block that was created from the same parent. Finding a block's buddy isn't a search; it's a single, constant-time calculation. This is the heart of the buddy system's efficiency.

The Reunion: Coalescing and Combating Chaos

Armed with this XOR trick, the process of freeing memory, called coalescing, becomes an elegant reunion dance. When a block of order $k$ at address $a$ is freed, the allocator doesn't just add it to a list. It first asks a critical question: "Is my buddy also free?"

It calculates the buddy's address, $a \oplus 2^k$ , and checks the free list for order $k$ . If the buddy is there, a merge occurs! The allocator removes the buddy from the free list, and the two are fused back into their single parent block of order $k+1$ .

But the dance doesn't stop. This new, larger block might also have a free buddy. So the process repeats: the allocator calculates the new block's buddy address and checks again. This recursive coalescing continues up the hierarchy until it finds a buddy that's still in use or it has merged all the way back to the single, original block of memory.

This eager, "maximal coalescing" guarantees a crucial invariant: there can never be two free blocks that are buddies. If such a pair could exist, the free operation would have already merged them. This constant battle against fragmentation is built into the very fabric of the algorithm.

The Price of Discipline: Internal Fragmentation

This beautiful system is not without its cost. The rigid rule of power-of-two sizes leads to a specific kind of waste: internal fragmentation. When a program requests 65 bytes, it gets a 128-byte block. The leftover 63 bytes inside that allocated block are wasted. They've been handed to the program, but the program didn't ask for them.

How bad can this waste be? At first glance, it seems like it could be terrible. But a simple argument reveals a surprisingly tight bound. Consider a request for size $s$ , which is given a block of size $B = 2^k$ . If a smaller block size, say $B/2 = 2^{k-1}$ , had been sufficient, the allocator would have used it. Therefore, for the allocator to have chosen block size $B$ , the request size $s$ must have been larger than the next size down.

$\frac{B}{2} s \le B$

The wasted space is $B-s$ . Because $s$ is always strictly greater than $B/2$ , the wasted space, $B-s$ , must always be strictly less than $B/2$ . The fractional waste is $(B-s)/B$ , which must be less than $0.5$ .

So, no matter what, the buddy system guarantees that for any single allocation, the amount of internal fragmentation will never reach 50% of the block's size. The worst-case scenario is a request for a size just over a power of two, like $2^{k-1} + 1$ , which wastes almost half the block.

The Unbeatable Ghost: External Fragmentation

The buddy allocator's primary goal is to fight external fragmentation—the problem of free memory being scattered in small, non-contiguous chunks. And it does a good job. But it cannot eliminate it entirely.

Imagine a scenario where we fill the entire memory with the smallest possible blocks, say size 16 bytes. Now, we free every other block. We have freed exactly half the memory! But look at what's left: a checkerboard pattern of allocated and free 16-byte blocks. Now, if we ask for a block of size 32 bytes, the request will fail. Why? Total free memory is enormous, but every free block's buddy is still allocated. No coalescing can occur. The largest contiguous block we can offer is 16 bytes.

This is a pathological case, but it demonstrates a fundamental truth: even with the buddy system's clever coalescing, it's possible to have a large amount of "unusable-free" memory. The allocator's strict rule that only buddies can merge prevents it from combining two adjacent free blocks that happen to be strangers from different parent splits.

Buddies in the Real World: From Kernel Latency to Virtual Reality

So, with these trade-offs, where does the buddy allocator fit? Its speed and predictable behavior make it a star performer inside the operating system kernel. When the kernel needs memory, it needs it fast. The worst-case number of splits or merges is directly proportional to the difference in orders, $m-k$ , which bounds the latency of any single operation. To guarantee responsiveness, some advanced systems even defer parts of the splitting and merging process to background threads, capping the maximum time any single operation can take.

Perhaps its most vital role is in managing physical memory for virtual memory systems. Your computer's CPU lives in a world of clean, contiguous virtual addresses. But the physical memory (RAM) is a chaotic place. The OS uses the buddy allocator to manage the physical page frames (e.g., 4 KiB chunks). When your program asks for a large, 48 KiB block of contiguous virtual memory, the OS can satisfy this by finding 12 physical pages from the buddy allocator. These pages might be scattered all over the physical RAM (e.g., at frames 100, 305, 101, ...).

The Memory Management Unit (MMU), a piece of hardware, translates the CPU's neat virtual addresses into the messy physical reality on the fly, completely hiding the physical fragmentation from the program. However, other parts of the system, like a network card using Direct Memory Access (DMA), often speak in physical addresses. Without special hardware (an IOMMU), that device would see the scattered pages and fail, forcing the OS to find physically contiguous blocks or perform expensive data copying to "bounce buffers".

This illustrates the beautiful layering of modern computer systems. The buddy allocator, with its simple but powerful rules, provides a foundational layer of order in the chaotic world of physical memory, enabling the seamless illusions of virtual memory that all modern software depends on. It's a perfect example of a deep, beautiful idea in computer science: by embracing a strict, elegant constraint, we can conquer a world of complexity.

Applications and Interdisciplinary Connections

After our journey through the elegant mechanics of splitting and coalescing, you might be tempted to file the buddy allocator away as a clever but niche algorithm. Nothing could be further from the truth! Its simple, powerful idea—managing space by halving and doubling—is not just an academic curiosity. It is a fundamental building block, a recurring pattern that nature, or at least computer science, seems to love. Like the standardized bricks a mason uses to construct everything from a simple wall to a grand cathedral, the buddy allocator’s power-of-two blocks provide the orderly foundation for some of the most complex software systems we have ever built.

Let's explore where this beautiful idea comes to life. We will see that its applications are not just numerous, but are woven into the very fabric of modern computing, from the core of the operating system to the frontiers of high-performance and future hardware.

The Heart of the Operating System

The most natural home for the buddy allocator is at the very bottom of the software stack: the operating system kernel, managing the physical memory of the machine. The memory in your computer is a vast sea of billions of bytes, but the OS prefers to think in terms of "pages"—fixed-size chunks, often $4~\text{KiB}$ apiece. The buddy system is the perfect tool for doling out these pages, providing blocks of $1, 2, 4, 8, \dots$ pages as needed.

But here, we immediately run into a profound challenge: the tyranny of contiguity. Imagine you need a very large, single, unbroken block of memory—say, for a high-resolution video frame or a buffer for a hardware device. The buddy allocator can, in principle, provide this by coalescing many small free blocks into a large one. But what if a single, tiny, unmovable page is allocated right in the middle of a vast expanse of free space? The buddy system’s rigid rules, which rely on specific address alignments, are thwarted. The two large free regions on either side of the offending page are not "buddies" and can never be merged. This is the specter of external fragmentation: you have enough total free memory, but it's in so many small, non-adjacent pieces that your large request fails.

This isn't just a theoretical worry. A system under load is a dynamic and messy place. Small, long-lived kernel data structures can act like immovable rocks in a stream, preventing the formation of large, contiguous free regions. A classic example is a "pinned" page that a hardware device is using for Direct Memory Access (DMA); the OS dares not move it. In such a scenario, the buddy allocator alone may be powerless to create a large block, even if 99% of the memory is free.

So, how does the OS escape this tyranny? It brings in more powerful tools. One is memory compaction, a process that is like playing a game of Tetris with the computer’s memory. The OS painstakingly moves all the "movable" pages into one contiguous region, squeezing out the free space into a single, large block. This is a powerful but expensive operation, a last resort for when the buddy system's simple coalescing fails.

A more proactive approach is reservation. For certain critical tasks, like providing a large buffer for a legacy hardware device that cannot handle fragmented memory, the OS can reserve a large pool of memory at boot time. This memory can be "lent out" for other uses, but its pages can always be reclaimed to form the guaranteed contiguous block when needed. This is the principle behind mechanisms like the Contiguous Memory Allocator (CMA) found in the Linux kernel—a safety net that ensures the buddy allocator's fragmentation doesn't prevent critical hardware from functioning.

A Symbiotic Relationship: The Buddy and Slab Allocators

We've seen that the buddy allocator struggles with external fragmentation caused by small, persistent allocations. We've also seen that its power-of-two rounding rule can be wasteful for very small requests—allocating a $512$ -byte block for a $65$ -byte object is not exactly efficient! This suggests a beautiful division of labor. Why not use a different tool for small objects?

This is the insight behind the slab allocator. The buddy allocator remains the manager of the coarse-grained resource: pages. But instead of servicing every small request itself, it hands over entire pages (or small, contiguous groups of pages) to the slab allocator. The slab allocator then treats this page like a block of clay, expertly carving it up into many small, fixed-size slots perfectly tailored for objects like file descriptors or network packet headers.

This two-level hierarchy is a masterpiece of symbiotic design. The slab allocator solves the internal fragmentation problem for small objects and, by containing them within their own slabs, prevents them from littering the global page pool. This, in turn, helps the buddy allocator by reducing the external fragmentation it has to deal with, making it easier to find and coalesce large blocks.

The relationship can even be dynamic and intelligent. Imagine the slab allocator needs a new page. It could ask the buddy system for a 2-page block or two 1-page blocks. Which is better? If the system is low on 2-page blocks, forcing the buddy allocator to split an even larger block might be a bad idea, as it could prevent a future, genuinely large request from succeeding. Using OS-level heuristics like "low watermarks" on the free lists, the slab allocator can make an intelligent request, perhaps asking for a 1-page block instead to avoid stressing the system. This is a constant, quiet dialogue between the layers of the memory manager, all working to keep the system in a healthy, unfragmented state.

Beyond the Kernel: New Frontiers for a Classic Algorithm

The buddy system's principles are so fundamental that they have found fertile ground far beyond the confines of the OS kernel.

One of the most exciting arenas is in Graphics Processing Units (GPUs). A modern GPU is a parallelism monster, with thousands of threads running concurrently. These threads communicate and cooperate using a tiny, extremely fast on-chip memory called "shared memory." Allocating this precious resource efficiently is critical for performance. The buddy allocator, with its speed and simplicity, is a fantastic candidate for this job. Here, the abstract concept of fragmentation has a direct, measurable cost. Every byte wasted to the allocator's power-of-two rounding is a byte that can't be used by another thread. This directly reduces the number of thread blocks that can run simultaneously on the hardware—a key metric called occupancy. Lower occupancy often means lower performance. Thus, in the world of high-performance computing, the buddy allocator's behavior is not just a matter of memory efficiency, but of computational throughput.

Let's jump to another world: managed language runtimes, like those for Java or C#. These systems feature automatic garbage collectors (GC) that reclaim memory from objects that are no longer in use. Many high-performance collectors move objects around to compact memory, a strategy called copying or compacting GC. However, copying very large objects is prohibitively expensive. The solution? A hybrid approach. Small objects live in a region managed by a moving collector, while giant objects are placed in a special "Large Object Space" (LOS) where they are never moved. And what is the perfect allocator for a space of large, immovable objects of varying sizes? The buddy system, of course! Its ability to handle variable-sized large blocks and combat external fragmentation through coalescing makes it an ideal choice for managing the giants of the managed heap.

Finally, let's look to the near future of computer architecture: heterogeneous memory systems. Your next computer might have multiple kinds of memory—a small amount of ultra-fast DRAM and a larger pool of slightly slower but persistent Non-Volatile Memory (NVM). How does an OS manage such a hierarchy? A common strategy is to run a separate buddy allocator for each memory pool. A high-level policy engine then acts as a traffic cop. It observes which data is "hot" (frequently accessed) and which is "cold." When the fast DRAM is full and a new, hot object arrives, the OS can choose a cold object, migrate it to the slower NVM, and place the new hot object in the freshly freed DRAM space. This decision is a complex cost-benefit analysis, weighing access latencies against migration costs. In this sophisticated dance, the buddy allocators provide the underlying mechanism for managing the space within each tier, enabling the OS to make intelligent, performance-optimizing placement decisions.

From its humble origins as a way to organize memory, the buddy allocator has proven to be an astonishingly versatile and enduring idea. Its elegance lies not in solving every problem perfectly, but in providing a simple, fast, and predictable foundation upon which more complex and intelligent systems can be built. It is a testament to the power of a good idea, a simple pattern of splitting and joining that brings a necessary order to the beautiful chaos of computation.