Available Expressions Analysis

SciencePedia

Key Takeaways

Available Expressions Analysis is a conservative "must analysis" that ensures an expression is available for reuse only if it has been computed on all preceding execution paths.
It operates on a Control-Flow Graph, using GEN and KILL sets to track how each block of code generates new expressions or invalidates existing ones through variable assignments.
The primary application is Common Subexpression Elimination (CSE), but its correctness is complicated by pointers (requiring alias analysis), function side effects, and concurrent execution.
This forward data-flow analysis is part of a larger family of techniques and has a conceptual dual in backward analyses like Anticipable Expressions, highlighting the unified principles of program analysis.

Introduction

In the quest to transform human-readable source code into efficient machine instructions, compilers employ a vast arsenal of optimization techniques. One of the most intuitive yet challenging optimizations is the elimination of redundant work. If a program calculates the same value multiple times, can a compiler be smart enough to compute it just once and reuse the result? The difficulty lies in certainty. An incorrect optimization that reuses a stale value can lead to catastrophic bugs. A compiler needs a way to prove, beyond any doubt, that a previously computed value is still valid and available for reuse.

This is the domain of Available Expressions Analysis, a foundational data-flow analysis technique that provides the rigorous guarantees needed for safe optimization. This article delves into this powerful method. The first chapter, Principles and Mechanisms, will explore the core theory, explaining how the analysis uses concepts like GEN/KILL sets and set intersection to track the flow of information through a program. The second chapter, Applications and Interdisciplinary Connections, will demonstrate how this analysis enables critical optimizations like Common Subexpression Elimination and how it interacts with modern programming challenges like pointers, objects, and concurrency.

Principles and Mechanisms

Imagine you are a meticulous baker following a complex recipe. You measure out 1 cup of flour and 1/2 cup of sugar, mix them in a bowl, and set it aside. A few steps later, the recipe calls for 1 cup of flour and 1/2 cup of sugar again. Do you re-measure everything? Of course not. You've already done the work. You simply grab the bowl you prepared earlier.

This is the essence of a compiler optimization called Common Subexpression Elimination (CSE). A compiler, like a smart baker, wants to avoid redoing work. If it has already computed a value, like x + y, it would love to just reuse the result instead of computing it again.

But this is a dangerous game. What if, between the first and second time you needed the dry mix, your mischievous assistant tossed in a handful of salt? Your pre-mixed bowl is now ruined, or "killed." If you used it, you'd bake a disastrously salty cake. For a compiler, using a stale value results in a program that produces wrong answers—a catastrophic bug.

So, the central question is: How can a compiler know, with absolute certainty, that a previously computed result is still valid and "available" for reuse? This is the detective work performed by a powerful technique called Available Expressions Analysis.

The Detective's Golden Rule: Proof Beyond All Doubt

To safely reuse a computed expression, the compiler must be a scrupulously careful detective. It cannot operate on hunches or probabilities. It needs proof beyond any doubt. In the world of compiler analysis, this is called a must analysis. An expression is only considered "available" at a certain point in the program if it is guaranteed to have the same value, no matter which execution path the program took to get there.

Think of a program's execution paths as different routes a person could take to a meeting point. If we want to be sure that everyone arriving at the meeting has a secret password, the password must have been given out on every single route leading to that point. If even one route fails to provide the password, we cannot assume a person arriving at the meeting will have it.

This "all paths" requirement is mathematically captured by the set intersection operator ( $\cap$ ). Let's say at a junction, two paths merge—one from Town A and one from Town B. If the set of available expressions from the Town A path is {"x+y", "a*b"} and the set from the Town B path is {"x+y", "c-d"}, what can we say is available for sure at the junction? Only the expression that is common to both sets: {"x+y"}. We take the intersection of the information from all incoming paths, because that's the only information that must be true regardless of the path taken.

This conservative, all-paths approach is the bedrock of safety for optimizations like CSE. We would rather miss an opportunity to optimize (be a less efficient baker) than perform an incorrect optimization (bake a salty cake).

The Flow of Information: A Ledger of GEN and KILL

To track which expressions are available, the compiler first creates a map of the program, called a Control-Flow Graph (CFG). In this map, basic blocks of straight-line code are the locations, and the potential jumps and branches are the roads connecting them. The analysis then works by figuring out how each block of code affects the set of available expressions. Each block can do two things:

GEN (Generate): When a block executes a statement like t := a * b, it generates the expression a * b. A new piece of information becomes available. We can add a * b to our list of available expressions at the exit of this block.
KILL: When a block executes a statement that changes a variable, like a := 10, it potentially invalidates any expression that uses a. The old value of a * b is no longer trustworthy. We say the assignment to a kills the expression a * b. We must remove it from our list.

This GEN and KILL logic forms a simple but powerful accounting system. As we move from one program point to the next, we update our ledger of available expressions using a transfer function:

\mathrm{OUT} = \mathrm{GEN} \cup (\mathrm{IN} \setminus \mathrm{KILL})

This equation is wonderfully intuitive. The set of expressions available at the OUTput of a block is whatever the block newly GENerates, plus whatever was available at its INput, minus anything the block KILLs.

Let's trace this through a small example. Consider the expression e = a * b in the following program map:

Block $B_1$  computes t1 := a * b. It generates e. Let's say e is now available. The program then branches to either $B_2$ or $B_3$ .
Path to $B_2$ : $B_2$ contains the statement a := a + 1. This assignment kills e because it changes an operand of a * b. So, at the exit of $B_2$ , e is no longer available.
Path to $B_3$ : $B_3$ simply computes t2 := a * b again. It also generates e, which was already available. So at the exit of $B_3$ , e is still available.
Merge at $B_4$ : Now, the paths from $B_2$ and $B_3$ merge to enter $B_4$ . To find out what's available at the start of $B_4$ , we apply our golden rule: we take the intersection of what's available from the incoming paths.
- From $B_2$ : e is not available.
- From $B_3$ : e is available.
- The intersection of not available and available is not available.
Conclusion: The compiler must conclude that a * b is not available at the start of $B_4$ , because there is a path to get there (through $B_2$ ) where its value was invalidated. The optimization is unsafe.

This same logic applies to subexpressions. If one path computes x+y and (x+y)*z, but another path only computes x+y, only x+y will survive the intersection at the merge point and be deemed available. Each expression is its own detective case.

The Complication of Pointers

In real-world programming, the KILL set can get tricky. A simple assignment like x := 5 clearly kills any expression involving x. But what about *p := 5, where p is a pointer? This statement modifies whatever memory location p is pointing to. To be safe, the compiler must know all the possible variables p could be pointing to (its alias set).

If a separate alias analysis tells the compiler that p may point to either x or z, our conservative detective has no choice but to assume the worst. The statement *p := 5 is treated as potentially modifying both x and z. As a result, it kills all expressions containing x AND all expressions containing z. This is a prime example of how different compiler analyses must collaborate to ensure safety.

Defining the Boundaries of the Analysis

To ensure our analysis is sound, we must be precise about its boundaries—where it starts, what it looks at, and what it's allowed to look at.

The Empty Notebook: A Sound Start

Where does our analysis begin? At the entry point of the program. What expressions are available there? Absolutely none. Before the program has executed a single instruction, nothing has been computed. Therefore, the analysis must begin with the IN set of the entry block being the empty set, $\emptyset$ . Starting with the optimistic assumption that everything is available is unsound and can lead to disaster, like assuming a cake is baked before you've even turned on the oven.

Atomic Statements: What the Analysis Sees

What does the analysis consider a single "step"? Typically, it operates on the boundaries between statements or basic blocks. It treats each statement as an atomic, "black box" operation. This means that if you have a single statement like t := (x + y) + (x + y), the analysis can't see the redundancy within that statement. It only registers that after the statement has executed, the expression x + y is now available for subsequent statements. To spot the intra-statement redundancy, the compiler would first need to break the complex statement down into a finer-grained representation, like temp := x + y; t := temp + temp;, creating a program point between the two computations.

The Safety Clause: Pure Functions Only

Most importantly, what kind of expressions can we even track? Can we try to eliminate a call to a function that prints to the screen, like printf("Hello")? Absolutely not. The "expression" is the function call, and its "value" is not just what it returns, but its side effects—actions that change the state of the world outside the computation, like modifying a global variable or performing I/O.

If we have a computation e := f(g(x)), and g is a pure function (no side effects, always returns the same output for the same input) but f has side effects, we cannot treat f(g(x)) as a candidate for common subexpression elimination. Eliminating a second call to f would also eliminate its side effects, fundamentally and incorrectly changing the program's behavior. Therefore, Available Expressions Analysis is restricted to the world of pure, side-effect-free computations. The optimization it enables is about saving computational work, not changing the program's observable actions.

Expanding the Jurisdiction: Crossing Function Boundaries

Modern programs are built from many functions calling one another. Does our analysis break down at a function call? No, it adapts beautifully. Instead of re-analyzing a function like callee() every time it's called, the compiler can perform the analysis once and create a summary. This summary elegantly captures the GEN/KILL behavior of the entire function:

Must-Generate Summary: What expressions (in terms of its own parameters) does callee() guarantee it computes on all of its internal paths?
Kill Summary: What global variables or aliased caller variables might callee() modify?

When the main program calls callee(), it simply consults the summary. An expression x+y can be available after a call to f(x, y) in two ways:

Preservation: x+y was already available before the call, and f's kill summary shows it doesn't modify x or y.
Generation: f's must-generate summary guarantees that it computes the equivalent expression on all its internal paths.

This modular approach allows the simple, elegant principles of GEN and KILL to scale from single blocks to entire programs, forming a unified theory of information flow.

A Look in the Mirror: The Duality of Analysis

The beauty of these data-flow principles is their symmetry. Available Expressions is a forward analysis that asks, "Looking back from this point, what has been computed on all paths?"

We can flip the entire analysis on its head and run it in reverse. A backward analysis can ask, "Looking forward from this point, what will be computed on all future paths before its inputs are changed?" This is known as Anticipable Expressions (or Very Busy Expressions) analysis. It's also a "must" analysis and uses intersection, but it propagates information backward from the program's exit. It's used for different optimizations, like hoisting code out of loops.

The existence of this dual analysis reveals a deeper unity. By defining a question ("what is available?"), a direction (forward), and a standard of proof (must/intersection), we construct a powerful analytical machine. Change the question or the direction, and the same machine reveals entirely new, equally valuable truths about the program. It is this underlying mathematical elegance that transforms the practical task of code optimization into a journey of discovery.

Applications and Interdisciplinary Connections

After our tour through the principles and mechanisms of data-flow analysis, you might be left with a feeling of neat, abstract satisfaction. We have built a tidy logical machine. But what is it for? Does this elegant formalism actually do anything useful? The answer is a resounding yes. It is not merely a theoretical curiosity; it is the silent, workhorse intelligence behind the curtain of modern software, making our code faster, safer, and more efficient. In this chapter, we will embark on a journey to see where this road leads, from the heart of the compiler to the wild frontiers of parallel computing. We will see that this one simple idea—of carefully tracking what the computer knows—has surprisingly far-reaching consequences.

The Art of Not Repeating Yourself

At its heart, available expressions analysis is about a principle we all learn as children: don't do the same work twice if you can help it. If you have just calculated that $7 \times 6 = 42$ , and someone immediately asks you again for the product of $7$ and $6$ , you don't re-multiply; you just state the answer you already have. You have, in essence, "memoized" the result.

A compiler can be taught this same common sense. When it sees an instruction like t := hash(x,y), it can think of this as populating a cache entry for the expression hash(x,y) with its current value. If it later encounters another instruction to compute hash(x,y), it can ask: "Have the values of x or y changed since I last did this?" If the answer is no, it can skip the re-computation and reuse the old result. But if an intervening instruction like x := x + 1 has occurred, the compiler must be smart enough to recognize that its cache is now stale. The old result is invalid, and the expression must be re-evaluated.

This is precisely what available expressions analysis formalizes. The GEN set of a basic block tells the compiler which expressions it has just "cached," while the KILL set tells it which previously cached expressions have been "invalidated" by assignments to their variables. This simple bookkeeping allows an optimization called Common Subexpression Elimination (CSE) to safely replace redundant computations with uses of previously stored results, directly translating the analysis into faster code.

But perhaps more importantly, this analysis also tells the compiler when not to perform an optimization. Consider an expression inside a loop. It's tempting to hoist the computation out of the loop so it's only performed once. But is this always safe? If a variable in the expression is modified inside the loop, the value of the expression might change with each iteration. Hoisting it would be a disastrous error. Available expressions analysis, by tracking whether an expression is killed on the loop's back edge, provides the rigorous check needed to prevent such mistakes, ensuring that optimizations like loop-invariant code motion are applied safely and correctly.

A Symphony of Optimizations

Available expressions analysis does not act alone; it is a player in a grand orchestra of compiler optimizations, and its performance is intricately linked to the other sections. What one optimization does can profoundly affect what another can see.

For instance, a purely syntactic analysis might see the expressions x + 0 and x as entirely different things. If a program computes x + 0 on one path and simply uses x on another, a standard available expressions analysis would conclude that the expression x + 0 is not available at the merge point, as it wasn't computed on all paths. The compiler would miss an optimization. However, if a "constant folding" pass runs first, it might transform all instances of x + 0 into just x. Suddenly, the optimization becomes trivial! This reveals a deep truth: the power of an analysis is tied to the representation of the program it sees. More advanced techniques like Global Value Numbering (GVN) go even further, focusing on the semantic value of computations rather than their syntactic form, allowing them to see equivalences that a classical analysis would miss.

The structure of the compiler itself also creates fascinating trade-offs. What if a useful computation, say x + y, is hidden inside a function call, f(x, y)? A standard intraprocedural analysis, which looks at only one function at a time, would treat the call as an opaque "black box." It cannot know that x + y was computed inside. But if the compiler chooses to inline the function—replacing the call with the function's body—the computation x + y suddenly becomes visible to the analysis! The expression becomes available, and a subsequent computation of x + y can be eliminated. Here we see a classic engineering trade-off: inlining increases the precision and power of the analysis, but at the cost of increasing the code size and analysis time. The concept of availability remains the same, but its effectiveness depends on the scope of its vision. This core idea is so robust that it has been adapted to work with modern compiler representations like Static Single Assignment (SSA) form, where the notion of availability elegantly maps onto the structure of $\phi$ -nodes that merge variable versions at control flow joins.

The Real World of Pointers and Objects

So far, our variables have been simple, well-behaved scalars. Real-world programs are far messier. They are filled with pointers, memory indirection, and complex objects. Can our simple analysis cope?

Consider the expression strlen(s), which computes the length of a string pointed to by s. Suppose we compute this length, and then on one path of our program, we execute the statement t[0] := '\0', where t is another pointer. Is the original value of strlen(s) still valid? The answer is a resounding "it depends!" If s and t point to different memory locations (they are not aliased), then the write to t has no effect on s, and the length is unchanged. But what if they point to the same string? Then the write has just changed the length of s to zero!

A compiler, faced with this ambiguity, must be conservative. Unless it can prove that s and t do not alias, it must assume the worst: that they might. Therefore, it must conclude that the write to t[0] kills the availability of the expression strlen(s). This is a beautiful example of how available expressions analysis, when extended to handle memory, becomes deeply intertwined with the challenging field of alias analysis. The correctness of the optimization now depends on a sophisticated understanding of memory locations.

The world of object-oriented programming introduces another layer of subtlety. In a language with dynamic dispatch, the expression x + y might not be a simple addition. It could be a method call, like x.plus(y), where the actual method executed depends on the dynamic type of the object x. Now, imagine we compute t := x + y when x is of type A. Then, on one path, we change x to be an object of type B, even if its numeric value is preserved. When we later encounter u := x + y, is the expression available? A classical analysis says no, for two profound reasons. First, syntactically, there was an assignment to x, which kills any expression containing it. Second, and more deeply, the expression x + y might now mean something completely different—it might invoke B.plus(y) instead of A.plus(y). The analysis must conservatively assume the computation is not the same, and thus not available. This shows how the analysis must respect the rich semantics of the programming language itself.

The Final Frontier: Concurrency

Our journey has, until now, taken place in a sequential world, where instructions follow one another in a predictable line. The ultimate test of any program analysis is whether it can survive contact with the chaotic, interleaved world of concurrent execution.

Imagine two threads. Thread 1 computes sum(x, y), then releases the locks protecting variables x and y. It does some other work, then re-acquires the locks and computes sum(x, y) again. Can it reuse the first result? In a sequential world, yes. But in a concurrent world, that interval when the locks were released is a window of opportunity. Thread 2 might have swooped in, acquired the locks, and modified x or y.

To reason about this, the analysis must be elevated, connecting with the theory of concurrency and memory models. It must understand the "happens-before" relationship. The act of Thread 1 releasing a lock and Thread 2 later acquiring that same lock establishes a flow of time. It ensures that everything Thread 1 did before the release is visible to Thread 2, and everything Thread 2 does before its corresponding release is visible to Thread 1 when it re-acquires the lock.

Therefore, in an execution where Thread 2 modifies x between Thread 1's release and re-acquire, that modification happens-before Thread 1's second computation. For the purposes of our analysis, this is no different from a modification in the same thread. The availability of sum(x, y) is killed. Because such an execution is possible, the compiler must conclude that the expression is not available. To do otherwise would be to risk a catastrophic bug based on stale data.

Here, at the intersection of compiler theory and parallel systems, we see the true power and unity of our simple idea. The meticulous bookkeeping of GEN and KILL sets, which started as a way to avoid re-calculating a * b, has become a tool for reasoning about the very flow of information and causality in a complex, multi-threaded universe. It is a testament to the fact that in science and engineering, the most beautiful ideas are often those that are not only simple, but also surprisingly, profoundly, and universally applicable.