Systolic Arrays

SciencePedia

Key Takeaways

Systolic arrays are parallel computing architectures that solve the memory bottleneck by rhythmically pumping data through a grid of simple processing elements.
Classified as Single Instruction, Multiple Data (SIMD), they achieve massive parallelism by having all processors execute the same command on different data streams.
The architecture's design emphasizes data reuse, where a single data element is used in multiple computations as it flows through the array, drastically improving arithmetic intensity and efficiency.
Systolic arrays are highly effective for applications with regular, repetitive computations like matrix multiplication in AI, convolutions in DSP, and sequence alignment in bioinformatics.

Introduction

In the quest for greater computational power, modern computing faces a fundamental obstacle: the "memory bottleneck," where powerful processors lie idle while waiting for data. Traditional von Neumann architectures, with their centralized processing units, are increasingly strained by the demands of data-intensive tasks. This article explores a revolutionary solution: the systolic array, a parallel processing architecture inspired by the rhythmic pumping of the human heart. Instead of a single powerful processor, systolic arrays use a grid of simple, synchronized processing elements through which data flows, enabling massive parallelism and unprecedented efficiency.

This exploration is divided into two main sections. In "Principles and Mechanisms," we will dissect the core workings of systolic arrays, from their rhythmic data flow and classification as SIMD machines to the concepts of data reuse and computational wavefronts that make them so powerful. We will also examine practical design challenges, including hardware utilization and network topology. Following this, the "Applications and Interdisciplinary Connections" section will reveal how this architectural paradigm has become the driving force behind breakthroughs in diverse fields. We will see how the simple "multiply-accumulate" pattern at the heart of systolic computation powers modern artificial intelligence, digital signal processing, genomic analysis, and robotics, while also understanding the types of problems for which this architecture is not a good fit.

Principles and Mechanisms

Imagine trying to build a machine that can perform a colossal number of calculations, like multiplying two enormous matrices. The traditional approach, pioneered by John von Neumann, is to have a powerful, centralized processor—a single, brilliant mind—that fetches data from a vast library (memory), performs a calculation, and writes the result back. This works wonderfully, but when the task is immense, our brilliant mind spends most of its time waiting for books to be delivered from the library. This "memory bottleneck" is one of the great limiting factors in modern computing.

Systolic arrays propose a radically different, and altogether beautiful, solution. Instead of one brilliant mind, why not have an army of simple, synchronized workers? And instead of having them run back and forth to the library, why not have the data flow past them on a conveyor belt? This is the core idea. The architecture is named after the systole of the heart, the rhythmic contraction that pumps blood through the body. In a systolic array, it is data that is rhythmically pumped through a grid of simple Processing Elements (PEs).

The Heart of the Machine: A Rhythmic Pulse of Data

Let's make this concrete. Picture a one-dimensional assembly line of workers, our PEs. Each worker has a very simple job. In a common Digital Signal Processing (DSP) task like implementing a Finite Impulse Response (FIR) filter, the job might be to take an input value from the worker to its left, multiply it by a weight stored at its own station, add that product to a value passed from the right, and then pass the final sum to the worker on its left. The input signal itself also moves down the line from left to right, one worker at a time, with each tick of a global clock.

This is precisely the scenario modeled in a digital circuit ****. Each PE in a line performs two concurrent operations every clock cycle: an input data value $X_{\text{in}}$ is passed along to the next PE, becoming $X_{\text{out}}$ , while an accumulating sum $Y_{\text{in}}$ is updated by the multiplication of $X_{\text{in}}$ with the PE's local, static weight $W$ . The new sum becomes the output $Y_{\text{out}}$ . The operation is defined by the register transfers:

$X_{\text{out}} \leftarrow X_{\text{in}}$
$Y_{\text{out}} \leftarrow Y_{\text{in}} + (X_{\text{in}} \times W)$

If you chain these PEs together, an input signal $X(k)$ entering the first PE will "march" down the line, one PE per clock cycle. At each step, it contributes to a partial sum that is also marching down the line, but accumulating results as it goes. By the time the final PE is reached, it has accumulated the weighted sum of several time-delayed inputs—exactly what an FIR filter does. The data flows, the computation is local and repeated, and the entire structure operates in a perfect, rhythmic lockstep. This is the systolic action: simple, local, and incredibly efficient.

An Army of Simpletons: The Power of SIMD

How should we classify this strange and wonderful machine? Computer architects have a useful classification system called Flynn's Taxonomy, which categorizes parallel computers based on their instruction and data streams. A traditional CPU that executes one instruction on one piece of data at a time is Single Instruction, Single Data (SISD). A large supercomputer where many processors run their own programs on their own data is Multiple Instruction, Multiple Data (MIMD).

Where does a systolic array fit? Let's consider a two-dimensional array for matrix multiplication, as described in . It's a grid of identical PEs. A single control unit broadcasts the same command—for example, "perform multiply-accumulate"—to every single PE in the array. They all execute this command in lockstep. This is the very definition of a Single Instruction stream.

However, the data each PE operates on is different. At any given moment, a PE at position $(i, j)$ might be multiplying element $a_{i,k}$ with $b_{k,j}$ , while its neighbor at $(i, j+1)$ is working on a completely different pair of data elements. Streams of data from the input matrices flow across the rows and down the columns, ensuring that each PE receives a unique sequence of operands over time. We therefore have Multiple Data streams.

Putting it together, the systolic array is a quintessential example of a Single Instruction, Multiple Data (SIMD) architecture. It's not a collection of independent thinkers; it's a highly disciplined army of simple workers all doing the same task, but on the different pieces of the problem that flow to them. This specialization is its strength, trading away generality for massive parallelism on specific tasks.

The Shape of Computation: Wavefronts and Efficiency

When you turn on a systolic array, it doesn't instantly operate at full capacity. The computation spreads across the grid of PEs like a wave emanating from where the data first enters. This is called the computational wavefront. The time it takes for this wave to reach the furthest PE and get it started on useful work is called the pipeline fill time. For a square $P \times P$ array, this initial setup takes $2P-2$ cycles. Similarly, once the last inputs have entered the array, it takes time for the final results to be completed by the last PEs and "drain" out. This pipeline drain time is also typically $2P-2$ cycles ****.

The total time to compute a tile of data of size $P \times P$ with an inner dimension of $K_t$ isn't just $K_t$ cycles; it's closer to $K_t + 2P - 2$ cycles, accounting for this fill and drain overhead ****. This means that for the entire duration, not all PEs are doing useful work. At the beginning and end, many are idle. This unavoidable inefficiency is a consequence of the array's spatial nature.

This brings us to a crucial, practical point: what happens when the problem size doesn't perfectly match the hardware size? Imagine you need to multiply two $192 \times 192$ matrices, but your systolic array is a fixed $128 \times 128$ grid. You must break the problem into smaller tiles. Some of these tiles will be full $128 \times 128$ chunks, but at the edges, you'll have partial tiles, perhaps $64 \times 128$ or $64 \times 64$ . When the hardware processes a partial tile, the PEs that fall outside the active region do no useful work, yet the entire array is still clocked for the full duration of that tile's computation.

This effect directly impacts the overall efficiency, or utilization, of the hardware. The utilization can be expressed as the ratio of the true work area to the "paid for" padded area. For an $r \times c$ problem on an $m \times n$ array, the utilization is given by the elegant formula:

U = \frac{rc}{mn \left\lceil \frac{r}{m} \right\rceil \left\lceil \frac{c}{n} \right\rceil}

This expression from **** beautifully captures the efficiency loss from this mismatch. It tells us that the effective performance of an accelerator depends not just on its peak speed, but on how well the shape of the problem maps to the shape of the hardware.

The Art of Juggling: Data Reuse and Beating the Memory Wall

The primary reason for all this architectural effort is to conquer the memory wall—the ever-widening gap between processor speed and memory speed. A systolic array's design is a masterclass in minimizing data movement.

The key principle is data reuse. In many algorithms, like matrix multiplication or convolution, a single input value is needed for many different output calculations. A traditional processor would have to fetch this value from memory again and again. A systolic array, by contrast, fetches the value from main memory just once. It is then passed from PE to PE along a row or column, participating in a new calculation at every step. The on-chip connections between PEs become the primary mechanism for data reuse, almost completely eliminating redundant, expensive off-chip memory accesses ****.

This rhythmic flow also leads to another profound advantage: performance determinism. A conventional processor, like a DSP, often relies on a cache to hide memory latency. But if the data isn't in the cache (a cache miss), the processor must stall for many cycles while it fetches the data from main memory. Because cache misses can be unpredictable, the performance becomes variable and non-deterministic ****. A systolic array, however, is designed to be fed by a steady, predictable stream of data. Its performance, once the initial pipeline is filled, is constant and completely predictable, depending only on the array's size and clock frequency, not the whims of memory access patterns. This is a huge benefit for real-time applications and system analysis.

Of course, this steady stream doesn't appear by magic. In a real System-on-Chip (SoC), this is accomplished through careful orchestration. Large problems are broken into tiles that can fit in fast on-chip memory (like BRAMs on an FPGA). While the systolic array is busy computing with the data for Tile N, a Direct Memory Access (DMA) engine works in the background, like a tireless stagehand. It fetches the data for the next tile (Tile N+1) from slow off-chip DRAM and simultaneously writes the results of the previous tile (Tile N-1) back to DRAM. This technique, called double-buffering, hides the long memory-access latency behind the computation time ****. The system's overall speed is then limited by the slower of the two tasks: computation or data transfer.

This reveals the intricate dance of hardware design. The size of the array ( $S \times S$ ) you can build is limited by the total chip area and power budgets you can afford . The size of the tiles you can work with is limited by the amount of fast on-chip memory you have. And the rate at which you can process tiles is limited by the bandwidth to external memory . Designing a systolic array accelerator is an art of balancing all these physical and architectural constraints to create a machine that is both powerful and well-fed.

The Roads Between the Cells: A Deeper Look

We've pictured the PEs as a simple grid. But what if we connect the last PE in each row and column back to the first, forming a torus? This can shorten the average distance data needs to travel. However, this elegant wrap-around connection introduces a subtle and dangerous new problem: deadlock.

Imagine a circular road where traffic is so dense that every car is waiting for the car in front of it to move. Nobody can move, and the system is frozen. The same can happen in the network of a torus. If the buffers in the channels connecting the PEs all fill up, a cycle of dependencies can form where every packet is waiting for a buffer that is held by the next packet in the cycle ****.

The solution to this is just as elegant as the problem is subtle. One common technique is to use virtual channels. Think of this as adding a second, parallel lane to our circular road. We then establish a rule: you must drive in Lane 0 for most of your journey, but if you cross a specific, designated point—a "dateline"—you must switch to Lane 1 and are forbidden from switching back. This simple rule makes it impossible to complete a full circle in the same lane. It breaks the cyclic dependency in the resource graph, guaranteeing that traffic gridlock—deadlock—can never occur. This allows us to keep the performance benefits of the torus topology without succumbing to its hidden dangers. It is a beautiful illustration of how deep theoretical concepts from network theory become critical for building robust, high-performance computing hardware.

Applications and Interdisciplinary Connections

So, we have this marvelous idea of a systolic array—a beautiful, rhythmic machine where data pulses through a line of processors like a heartbeat. We've explored the principles that make it tick, the elegant simplicity of its local communication, and the immense parallelism it unlocks. But the most exciting part of any scientific idea isn't just its internal beauty; it's what happens when it meets the real world. What is this machine good for? Where does this rhythmic pulse of computation find its rhythm in the universe of problems we want to solve?

You might be surprised. It turns out that the simple, repetitive "multiply and accumulate" pattern, which systolic arrays do so well, is not some obscure mathematical curiosity. It is, in fact, one of the fundamental refrains of computation, appearing in a dazzling variety of fields. Let's take a journey and see where this elegant piece of architecture shows up, from the artificial minds of our computers to the very code of life itself.

The Heartbeat of Artificial Intelligence

If there is one domain that has been utterly transformed by systolic arrays, it is artificial intelligence. Modern neural networks, especially those that "see" and "hear," are gargantuan structures built on a foundation of two key operations: matrix multiplication and convolution. And as it happens, these operations are a perfect match for the systolic dataflow.

Why? The secret lies in a concept we can call arithmetic intensity—a fancy term for a simple, profound idea: for every piece of data you fetch from the faraway, slow world of main memory, how much useful work do you get done? A naive approach to, say, a convolution might involve a processor fetching a small patch of an image and a set of filter weights, multiplying them together, and then... throwing them away to fetch the next patch and weights. This is incredibly wasteful! It's like a carpenter running to the lumber yard for every single nail. The processor spends most of its time waiting for data, not computing.

A systolic array turns this on its head. Imagine an image tile flowing in from one side of the array and the neural network's weights flowing in from another. As a piece of data enters a processing element (PE), it's used in a calculation. But instead of being discarded, it's passed to its neighbor. That neighbor uses it for its calculation and passes it on again. A single data element, fetched only once from memory, gets reused again and again as it marches across the array. The result is a staggering increase in arithmetic intensity. The processors are kept constantly busy, performing hundreds of operations for each byte they originally requested from memory. This is precisely the principle that allows architectures like Google's Tensor Processing Units (TPUs) to achieve mind-boggling performance on AI workloads, leaving more general-purpose processors far behind.

This isn't just a simple trick for simple convolutions, either. The world of AI is filled with an ever-growing zoo of operations. Consider something like a depthwise separable convolution, a more efficient type of layer common in mobile and edge AI models. It breaks the problem into two distinct steps. How do you map this onto a rigid array of processors? Here, the art of architecture design comes into play. Should you use a long, one-dimensional array or a square, two-dimensional one? It turns out the choice has deep consequences for how data can be reused and how much on-chip memory is needed to store intermediate results. One layout might be brilliant for the first step but awkward for the second, while another offers a better balance. Designing a Domain-Specific Architecture (DSA) for these tasks is a delicate dance between the algorithm's dataflow and the hardware's topology, a puzzle of minimizing data movement at every turn.

The Rhythms of Nature and Engineering

But it would be a mistake to think systolic arrays are only for AI. That same computational pattern echoes in many other corners of science and engineering.

Think about digital signal processing (DSP), the foundation of our digital world of audio and communication. A Finite Impulse Response (FIR) filter, used for everything from cleaning up noisy audio to shaping radio signals, is essentially a sliding dot product—the very same "multiply and accumulate" pattern. It's no surprise, then, that a systolic array can execute an FIR filter with incredible efficiency. When we map a filter to this architecture, we see the same principles at play: the input signal stream flows across the array, interacting with the stationary filter coefficients stored in each processor. This connection also forces us to think about the practicalities of computation, such as the trade-offs between different number formats, like the traditional fixed-point arithmetic of DSPs versus the more exotic [bfloat16](/sciencepedia/feynman/keyword/bfloat16) format used in AI, and how these choices affect the final accuracy of our results.

Let's turn from the world of electronics to the world of biology. How do scientists compare DNA sequences to find similarities that might indicate an evolutionary relationship or a genetic disease? One of the most powerful tools is the Smith-Waterman algorithm, a method from the world of dynamic programming. It involves filling out a large grid where each cell's value depends on its neighbors. If you look closely at the dependencies, you'll notice something wonderful: all the cells along any given anti-diagonal can be computed at the same time! They don't depend on each other. This creates a "wavefront" of computation that can sweep across the grid.

What kind of hardware is perfect for this? A one-dimensional systolic array! You can assign each processor to one cell on the wavefront. In the first cycle, one processor works. In the next, two processors work. The wave of activity grows, sweeps across the grid, and then shrinks, looking just like a ripple expanding and contracting. This perfect mapping of algorithm to architecture makes systolic arrays a cornerstone of high-speed bioinformatics and genomic analysis.

Or consider the world of robotics. A robot navigating through space, a drone flying, or a self-driving car needs to constantly figure out where it is. It does this by fusing data from various sensors—cameras, GPS, inertial sensors—using an algorithm called the Kalman Filter. At the core of this filter lies a series of matrix operations, including the critical step of inverting a matrix. As we've seen, systolic arrays excel at structured matrix algebra. But not all algorithms are created equal. Trying to solve the system using a classic method like Gauss-Jordan elimination is a poor fit; it requires broadcasting rows of data across the entire array, breaking the "local communication only" rule and creating a bottleneck. However, a more sophisticated method like Cholesky factorization, which works for the special symmetric matrices that appear in this problem, can be decomposed into a series of triangular solves that map beautifully onto a systolic array with only nearest-neighbor communication. The choice of algorithm and architecture are deeply intertwined, and the right pairing can mean the difference between a robot that can react in real-time and one that can't.

The Art of the Misfit: When the Rhythm Breaks

Now, it is a very important lesson in science that no idea, no matter how clever, is a solution to all problems. We must also ask: what are systolic arrays not good for? The answer teaches us something deep about the nature of algorithms.

Consider the Fast Fourier Transform (FFT), one of the most important algorithms ever discovered. It's used everywhere, from signal processing to image compression. It's incredibly fast, but it has a peculiar communication pattern. To compute the FFT, you need to combine data points that are far apart in the input array. The algorithm has a "global" nature.

What happens if you try to map this onto a 1D systolic array, where processors can only talk to their immediate neighbors? It's a disaster. To get a data point from one end of the array to the other, it has to be passed step-by-step through every processor in between. The communication cost, the time spent just moving data around, becomes astronomically high, completely overwhelming the time spent on actual computation. The asymptotic cost of communication on the 1D array grows as $\Theta(n^2)$ , while the computation is only $\Theta(n \log n)$ . This is a terrible mismatch. It's like trying to have a conversation across a crowded stadium by passing a note down the line. The systolic rhythm is broken. This teaches us a crucial lesson: the efficiency of a parallel architecture depends on how well its communication topology matches the communication graph of the algorithm.

Systolic Thinking: A Universal Paradigm

So we see that the systolic array is a specific piece of hardware, a domain-specific architecture that is brilliant for some problems and ill-suited for others. But does the story end there? Not at all.

Perhaps the most profound connection is this: we can separate the idea of a physical systolic array from the more general principle of systolic thinking. What if you don't have a special accelerator chip? What if you just have a regular multicore processor, with a grid of general-purpose cores connected by a network? You can still apply systolic thinking!

You can partition your problem, say a large convolution, into tiles and assign each tile to a core. Each core computes its own patch and then communicates the boundary data—the "halo"—to its neighbors, much like the PEs in a hardware array. This dataflow mimics a systolic computation. Of course, it won't be as efficient. The communication is handled by software messages over a general-purpose network, which has much higher latency and lower bandwidth than the dedicated wires in a custom chip. The overhead of this communication is significant. But the principle is the same: organize the computation to maximize local data reuse and create a regular, rhythmic flow of information.

And this, in the end, reveals the true power and beauty of the systolic concept. It is more than just a clever arrangement of silicon. It is a fundamental paradigm for parallel computation. It teaches us to think about not just the operations themselves, but about the flow of data through the machine. It reminds us that in the world of high-performance computing, the most expensive thing you can do is not to multiply two numbers, but to move a number from one place to another. By minimizing that movement, by keeping data constantly in motion and constantly at work, the systolic pulse brings a profound efficiency to an incredible range of problems, unifying the worlds of AI, signal processing, bioinformatics, and robotics under one simple, powerful rhythm.