Rejection Sampling Algorithm

SciencePedia

Key Takeaways

Rejection sampling generates samples from a complex target distribution by proposing candidates from a simpler distribution and accepting them based on a probabilistic rule.
The algorithm's efficiency is determined by the constant M, which represents the average number of proposals required to obtain one accepted sample.
The validity of the method critically depends on the envelope condition, $f(x) \le M g(x)$ , which ensures the generated samples correctly follow the target distribution.
Applications of rejection sampling range from geometric sampling in computer graphics to simulating complex temporal events like non-homogeneous Poisson processes via thinning.

Introduction

In many scientific and computational fields, from physics to finance, we often encounter complex systems whose behavior is described by a probability distribution. While we may know the mathematical shape of this distribution—like having a formula for the altitude of a mountain range at any point—we often lack a direct method to generate random data that follows this specific shape. How can we simulate particle hits, financial fluctuations, or system failures when we can't simply 'draw' a random number from the complex function governing them? This gap between knowing a distribution's form and being able to sample from it presents a fundamental challenge in simulation and modeling.

This article introduces rejection sampling, an elegant and powerful algorithm designed to solve exactly this problem. We will explore its foundational principles and diverse applications through a structured journey. The first section, Principles and Mechanisms, will demystify the algorithm's core logic using intuitive analogies, explain how to optimize its efficiency, and provide a clear mathematical proof of why it works. The second section, Applications and Interdisciplinary Connections, will showcase the algorithm's remarkable versatility, demonstrating its use in fields ranging from computer graphics and engineering to the simulation of complex, time-varying processes.

Principles and Mechanisms

Imagine you want to create a perfect map of a mountain range. You have a powerful satellite that can tell you the exact altitude, $f(x)$ , at any given coordinate $x$ , but you don't have a complete topographical survey. Your goal isn't to draw the map yourself, but to teach a computer how to randomly place "virtual hikers" on the terrain, such that the chance of a hiker landing at any spot is proportional to the altitude of that spot. In other words, you want to generate random points that follow the probability distribution described by the shape of the mountains, $f(x)$ .

This is a common problem in science and engineering. We often know the shape of a probability distribution—the target—but we don't have a direct way to pull random numbers from it. Rejection sampling offers an elegant and wonderfully intuitive solution.

The Core Idea: A Dartboard in Higher Dimensions

Let's stick with our mountain range. While we don't have a full map, we do have a very simple tool: a weather balloon that can drop a "probe" at any random coordinate $x$ with equal probability over a large rectangular area that completely encloses our mountain range. This is our proposal distribution, $g(x)$ . In this simple case, it's a uniform distribution.

Now, we add one more element. Let's build a giant, flat glass ceiling high above the entire mountain range. The height of this ceiling is a constant, $M$ . The crucial rule is that this ceiling must be high enough that no mountain peak ever touches or pokes through it. In mathematical terms, the height of our target mountain, $f(x)$ , must always be less than or equal to the height of our ceiling, $M$ , for all locations $x$ . If our proposal distribution $g(x)$ wasn't uniform (imagine the wind making some drop zones more likely), the rule becomes slightly more general: $f(x) \le M g(x)$ for all $x$ . This expression, $M g(x)$ , is our envelope distribution. It forms a "safety canopy" over our target distribution.

The algorithm is now as simple as a game of chance:

Use the proposal distribution $g(x)$ to pick a random horizontal coordinate, let's call it $Y$ . This is like our weather balloon dropping a probe at a random spot.
Generate a second random number, $U$ , uniformly between 0 and 1. This number will represent a random vertical position.
We "accept" the sample $Y$ if our random vertical position, scaled by the height of the envelope at that spot, $U \times M g(Y)$ , falls below the mountain's altitude $f(Y)$ . In other words, we accept if $U \times M g(Y) \le f(Y)$ , which is the same as saying $U \le \frac{f(Y)}{M g(Y)}$ .

Think about it: at a location $Y$ where the mountain is very high (large $f(Y)$ ), the fraction of the vertical space below the mountain is large, so there's a high chance of acceptance. Where the mountain is low (small $f(Y)$ ), the chance of acceptance is small. This simple rule ensures we are more likely to keep samples from the high-altitude regions, which is exactly what we wanted! The rejected samples are simply discarded, and we try again.

The Perfect Envelope: Finding the Optimal $M$

The constant $M$ is not just a mathematical formality; it's the heart of the algorithm's efficiency. Imagine our glass ceiling is absurdly high. We'd be dropping probes and most of them would land in the vast empty space between the mountaintops and the ceiling. These are all rejected samples—wasted effort. To make our process efficient, we want to lower the ceiling as much as possible without it ever being pierced by a mountain peak.

This means we must choose the smallest possible value of $M$ that still satisfies the condition $f(x) \le M g(x)$ for all $x$ . This optimal constant, $M^*$ , is found by looking at the ratio of the target to the proposal and finding its peak value:

$M^* = \sup_{x} \frac{f(x)}{g(x)}$

The "sup" (supremum) is just a fancy mathematical term for the least upper bound, which in most practical cases is simply the maximum value.

Let's see this in action. Suppose we want to sample from a Beta(2,2) distribution, which is used to model probabilities and has a nice bell shape defined by $f(x) = 6x(1-x)$ on the interval $[0, 1]$ . A simple choice for a proposal is the uniform distribution on that same interval, $g(x) = 1$ . To find the optimal $M$ , we just need to find the maximum value of $f(x)/g(x) = 6x(1-x)$ on $[0,1]$ . A little bit of calculus shows the peak occurs at $x = 1/2$ , where the function's value is $f(1/2) = 6(\frac{1}{2})(1-\frac{1}{2}) = \frac{3}{2}$ . So, our optimal constant is $M = 3/2$ .

Sometimes, the target distribution isn't given to us fully normalized. For instance, a physical model might suggest a distribution's shape is proportional to $\sin(\pi x)$ on $[0,1]$ . Before we can find $M$ , we first have to find the proper scaling to make it a true probability density function (one that integrates to 1). The integral of $\sin(\pi x)$ from 0 to 1 is $2/\pi$ , so the normalized target PDF is $f(x) = \frac{\pi}{2} \sin(\pi x)$ . Using a uniform proposal $g(x)=1$ , the optimal $M$ is simply the maximum of $f(x)$ , which is $\pi/2$ .

The choice of the proposal distribution is an art. It doesn't have to be uniform. If we want to sample from a half-normal distribution (often used for modeling magnitudes), we might find that an exponential distribution makes a better proposal, as it has a similar shape. The process remains the same: find the peak of the ratio $f(x)/g(x)$ to determine the most efficient $M$ . Sometimes, this can involve more advanced functions, like using a Laplace distribution to sample from a logistic distribution, but the principle is identical.

The Price of a Sample: Efficiency and Waiting Times

So, we've carefully chosen our proposal $g(x)$ and found the tightest possible envelope constant $M$ . What is the probability that any given candidate we draw from $g(x)$ will be accepted? The answer is astonishingly simple: the acceptance probability, $p_{acc}$ , is exactly $1/M$ .

Why? The total "volume" under our envelope curve, $\int M g(x) dx$ , is simply $M \int g(x) dx = M \times 1 = M$ . The total volume under our target curve, $\int f(x) dx$ , is 1 (since it's a probability density). The acceptance process is essentially picking a random point under the envelope curve and checking if it's also under the target curve. The probability of this happening is the ratio of the volumes: $\frac{1}{M}$ .

This beautiful result gives $M$ a direct, tangible meaning:  $M$ is the average number of attempts required to get one accepted sample.

If we are simulating defect positions in a material modeled by $f(x) \propto x^3$ on $[0,1]$ with a uniform proposal, we find that the optimal $M=4$ . This means the efficiency of our simulation is only $1/M = 1/4$ , or 0.25. On average, we'll have to throw away three samples for every one we keep. In contrast, for the Beta(2,2) example where $M=3/2$ , the efficiency is a much healthier $1/(3/2) = 2/3$ .

This also tells us about the patience we'll need. Each attempt is an independent event with a success probability of $p = 1/M$ . The number of trials needed to get the first success follows a geometric distribution. If you're curious about the probability of having to wait a while, the chance of needing more than $k$ attempts is simply the probability of failing $k$ times in a row: $(1 - 1/M)^k$ . This underscores the importance of choosing a good proposal distribution $g(x)$ that "fits" the target $f(x)$ well, as this leads to a smaller $M$ and a far more efficient algorithm.

The Magic Trick: Why It Actually Works

At this point, a skeptical student might ask: "This is a clever trick, but how do we know the samples we keep actually follow the original target distribution $f(x)$ ? Aren't we biasing the results by throwing some away?" This is the most crucial question, and the answer reveals the true elegance of the method.

Let's find the probability density of an accepted sample, which we'll call $h(x)$ . By the laws of conditional probability, the probability of observing an accepted value in a tiny interval near $x$ is the probability of proposing a value near $x$ multiplied by the probability of accepting it.

The probability of proposing a value in the tiny interval $[x, x+dx]$ is $g(x)dx$ .
The probability of accepting it, given we proposed $x$ , is $\alpha(x) = \frac{f(x)}{M g(x)}$ .

So, the joint probability of proposing a value in $[x, x+dx]$ and having it accepted is: $P(\text{accepted sample is in } [x, x+dx]) = g(x)dx \times \frac{f(x)}{M g(x)} = \frac{f(x)}{M} dx$

Look closely at that result! The proposal distribution $g(x)$ has completely canceled out. The probability density for an accepted sample is simply proportional to the target density $f(x)$ . The proportionality constant is $1/M$ .

To get the final, normalized probability density $h(x)$ for the accepted samples, we just need to divide this by the total probability of accepting any sample, which we already know is $1/M$ .

$h(x) = \frac{f(x)/M}{\text{Total Acceptance Probability}} = \frac{f(x)/M}{1/M} = f(x)$

And there it is. The distribution of the accepted samples is exactly the target distribution $f(x)$ . The proposal distribution $g(x)$ acts as a temporary scaffold; once its job is done, it vanishes from the final result, leaving behind a perfect sample from our desired distribution. It's a beautiful piece of mathematical reasoning.

When the Fence Is Too Short: The Danger of a Bad $M$

The magic trick only works if the fundamental rule— $f(x) \le M g(x)$ —is obeyed everywhere. What happens if, by mistake, our "glass ceiling" is too low and a mountain peak pokes through?

Consider a simple target $f(x)=2x$ on $[0,1]$ . Its peak is at $x=1$ , with value $f(1)=2$ . Using a uniform proposal $g(x)=1$ , the correct optimal constant is $M=2$ . But suppose we make a mistake and set $M=1$ .

For the range $0 \le x \le 1/2$ , we have $f(x) = 2x \le 1$ , so the condition $f(x) \le M g(x)$ holds. The acceptance probability is $\frac{f(x)}{M g(x)} = 2x$ , which is correct.
However, for the range $1/2 < x \le 1$ , we have $f(x) = 2x > 1$ . Our condition is violated! The algorithm's acceptance rule is $u \le \frac{f(x)}{M g(x)}$ , which becomes $u \le 2x$ . Since $u$ is drawn from $[0,1]$ , this inequality is always true whenever $2x > 1$ . The acceptance probability gets artificially "clipped" at 1.

The algorithm can no longer "see" the true height of the distribution in this region. It treats all values of $x$ between $1/2$ and $1$ as equally likely to be accepted, even though the target density is still rising. The result is a distorted distribution of accepted samples. Instead of generating samples from the desired triangular distribution $f(x)=2x$ , we end up sampling from a strange, composite shape that is part ramp, part plateau. This serves as a critical lesson: the envelope condition is not a suggestion; it is the absolute guarantee of the algorithm's correctness. Without it, the magic fails.

Applications and Interdisciplinary Connections

After our journey through the nuts and bolts of rejection sampling, you might be thinking it's a clever but perhaps niche mathematical trick. Nothing could be further from the truth. This simple idea of "propose and check" is one of the most versatile tools in the scientist's and engineer's toolkit. It appears, sometimes in disguise, in an astonishing range of fields, revealing a beautiful unity in how we can interrogate complex systems. Its power lies not in complicated machinery, but in its elegant simplicity—a testament to the idea that sometimes the most profound methods are the most direct.

Let's begin our tour of applications with the most intuitive one: geometry. Imagine you are a computer graphics artist tasked with simulating the sparkle of a diamond, or a physicist modeling particle deposition inside a strangely shaped container. You need to generate points uniformly, but inside a complex volume like a cone, a torus, or something even more bizarre. How do you do it? You could try to invent a complicated system of coordinates for every new shape, but that's a headache. Rejection sampling offers a wonderfully simple alternative.

Think of a right circular cone. It’s hard to generate points directly inside it. But it’s incredibly easy to generate points inside the cylinder that just barely encloses it. So, that's what we do! We generate a random point in the larger, simpler cylinder and then ask a single question: "Is this point also inside the cone?" If it is, we keep it. If not, we discard it and try again. It's that simple. What's remarkable is that the collection of points we keep will be perfectly, uniformly distributed inside the cone. The efficiency of this process, meaning the probability that we accept any given proposal, is simply the ratio of the two volumes. For a cone inside its minimal bounding cylinder, this turns out to be exactly $\frac{1}{3}$ , a neat and tidy result that is independent of the cone's specific dimensions. This same principle allows us to sample from far more complex regions, like the area caught between a circle and a parabola, where the only thing we need is a way to test for inclusion in the target set. We just need a bounding box and a test—the algorithm does the rest.

This geometric intuition is the perfect springboard to the next level: sculpting probability itself. Many problems in science don't involve uniform distributions in weird shapes, but rather non-uniform distributions in simple shapes. Imagine a square detector plate where a particle is more likely to hit the center than the edges. The "shape" we are sampling from is no longer just geometric; it's a landscape of probability, with peaks where events are likely and valleys where they are not.

Rejection sampling handles this beautifully. We can, for example, sample from a simple uniform distribution over the square (our "block of marble") and then use the height of the target probability landscape at that point to decide whether to keep the sample. We propose a point $(x, y)$ and accept it with a probability proportional to the target density $f(x, y)$ . This is like throwing a dart at the square and then, based on the probability value at that spot, deciding whether that "hit" is a real event. This allows us to simulate everything from the distribution of particle hits on a detector to sampling from distributions confined to unusual domains, like a triangle, where the probability also varies from point to point.

The real power of this approach shines when the target distribution is born from a complex physical process. Consider a system with two components that fail one after another, like in a satellite or a deep-sea probe. If each component's lifetime follows a simple exponential distribution, the total system lifetime follows a more complex distribution—the convolution of the two. Writing down and sampling from this new distribution directly can be a mathematical nightmare. But with rejection sampling, we can propose a lifetime from a simpler, related distribution (like one of the original exponentials) and use the ratio of the true density to the proposal density as our acceptance rule. This allows an engineer to simulate system reliability without ever needing to invert a complicated cumulative distribution function. The algorithm elegantly sidesteps the analytical difficulty.

So far, we have been sampling points or vectors—static snapshots. But what about processes that unfold in time? Here, rejection sampling reveals its full power. Consider the arrivals of data packets at a network router or customers at a store. Often, the rate of arrival changes with time; perhaps traffic is high at noon and low at midnight. This is a "non-homogeneous Poisson process." How can we possibly simulate it?

The solution, a technique known as "thinning," is just rejection sampling in a clever disguise. We start by finding the absolute maximum arrival rate, $\lambda_{\max}$ , that ever occurs. Then, we generate a stream of "potential" events from a much simpler homogeneous Poisson process that has this constant, maximum rate. This is like a firehose of potential arrivals. For each potential arrival time $t_p$ , we check the true arrival rate $\lambda(t_p)$ at that moment. We then "thin" the stream by accepting the potential event with probability $\frac{\lambda(t_p)}{\lambda_{\max}}$ . The events that survive this thinning process form a perfect realization of our complex, time-varying process. This single, beautiful idea is used to model everything from neural spikes to financial trades to incoming calls at a data center.

Of course, there is no free lunch. The efficiency of rejection sampling depends entirely on how well our simple proposal distribution matches the complex target. If our proposal "box" is too large and ill-fitting, we will be rejecting samples almost all the time, wasting immense computational effort. The number of proposals we need to get a certain number of accepted samples is itself a random variable, and its variance can be punishingly large if the acceptance probability is low.

This very limitation has spurred the development of more intelligent, adaptive methods. In "Adaptive Rejection Sampling" (ARS), we don't use a fixed proposal. Instead, we learn about the target distribution as we go. For a special class of distributions (log-concave ones, common in statistics), we can build an envelope not from a simple box, but from a series of tangent lines that hug the distribution's shape much more tightly. Each time we reject a sample, we use that information to add a new tangent line, refining our proposal envelope and making future samples more likely to be accepted. It's a sampling algorithm that learns!

Finally, let us take a truly breathtaking leap. What if the thing we want to sample is not a number, or a vector, but an entire function—an infinite-dimensional object? Imagine trying to simulate the possible paths of a stock price, or the quantum-mechanical path of a particle. These are functions of time. Suppose we want to generate paths that not only follow the basic rules of the process but also satisfy some global constraint, for example, that the path's average squared distance from zero remains below a certain threshold. This is a problem of incredible complexity. Yet, the philosophy of rejection sampling provides a path forward. We can use a simpler, unconstrained process as our proposal mechanism to generate entire random paths. Then, for each complete path we generate, we check if it satisfies our global constraint. If it does, we keep it; if not, we discard the entire function and try again. This allows us to sample from extraordinarily complex conditional distributions on function spaces, a cornerstone of modern statistical physics and quantitative finance.

From drawing points in a cone to simulating the intricate dance of stochastic processes, rejection sampling is a golden thread running through computational science. Its enduring appeal lies in its simplicity and generality. It reminds us that by combining a source of simple randomness with a clear criterion for success, we can explore and understand worlds of staggering complexity.

Rejection Sampling Algorithm

Introduction

Principles and Mechanisms

The Core Idea: A Dartboard in Higher Dimensions

The Perfect Envelope: Finding the Optimal MMM

The Price of a Sample: Efficiency and Waiting Times

The Magic Trick: Why It Actually Works

When the Fence Is Too Short: The Danger of a Bad MMM

Applications and Interdisciplinary Connections

Rejection Sampling Algorithm

Introduction

Principles and Mechanisms

The Core Idea: A Dartboard in Higher Dimensions

The Perfect Envelope: Finding the Optimal MMM

The Price of a Sample: Efficiency and Waiting Times

The Magic Trick: Why It Actually Works

When the Fence Is Too Short: The Danger of a Bad MMM

Applications and Interdisciplinary Connections

The Perfect Envelope: Finding the Optimal $M$

When the Fence Is Too Short: The Danger of a Bad $M$

The Perfect Envelope: Finding the Optimal $M$

When the Fence Is Too Short: The Danger of a Bad $M$