Rejection Sampling

SciencePedia

Key Takeaways

Rejection sampling generates samples from a complex target distribution by using a simpler proposal distribution and an accept-reject rule.
The method's efficiency critically depends on how tightly the proposal envelope fits the target, with an average of M proposals needed per accepted sample.
Although it produces independent samples, rejection sampling becomes catastrophically inefficient in high-dimensional problems due to the curse of dimensionality.
The technique is widely applied in fields like physics, Bayesian statistics, and even for model comparison in complex systems through Approximate Bayesian Computation (ABC).

Introduction

Generating random numbers that follow a specific, complex probability distribution is a foundational challenge across science, from physics to finance. Standard methods often fail when faced with these "strangely shaped" distributions for which direct sampling is impossible. This article addresses this gap by providing a comprehensive guide to rejection sampling, an elegant and intuitive Monte Carlo method that solves this problem with a clever "accept-reject" strategy. First, in the Principles and Mechanisms chapter, we will dissect the method's core logic, explore the mathematics that guarantees its correctness, and examine its critical limitations, such as the curse of dimensionality. Subsequently, the Applications and Interdisciplinary Connections chapter will tour the diverse fields where this technique is applied, from simulating particle physics to powering modern Bayesian statistics and cutting-edge computational biology. Let's begin by uncovering the simple yet powerful idea at the heart of this statistical magic trick.

Principles and Mechanisms

Imagine you want to simulate something a bit more complex than flipping a fair coin or rolling a standard die. What if you needed to draw numbers from a probability distribution that looks like a strangely shaped hill? You can’t just pick a number from a list, because some numbers are much more likely than others. Direct sampling seems impossible. How do you generate numbers that faithfully trace the contours of this specific, peculiar landscape? This is a fundamental problem in science, from modeling particle energies in physics to predicting market fluctuations in finance. The answer lies in a wonderfully clever idea known as rejection sampling. It’s a method that feels like a magic trick, but one whose inner workings are as beautiful as they are logical.

A Genius Idea: The Accept-Reject Game

Let’s visualize the problem. Suppose the probability distribution you want to sample from, your target distribution $p(x)$ , looks like a smooth hill over a flat plane. For instance, consider the Beta(2,2) distribution, whose density on the interval $[0, 1]$ is given by the neat little hump $p(x) = 6x(1-x)$ . The peak of this hill is at $x=0.5$ . We can't easily draw samples that follow this curve directly.

However, we can easily sample from a much simpler distribution, like the uniform distribution, which is just a flat line. This will be our proposal distribution, $q(x)$ . Let's build a simple rectangular "roof" that completely covers our target hill. To do this, we find a constant, which we'll call $M$ , such that the function $M \cdot q(x)$ is always greater than or equal to our target $p(x)$ for all values of $x$ . This new function, $M \cdot q(x)$ , is our "envelope". To make our life as easy as possible, we should choose the smallest $M$ that works—this is like making the roof just touch the highest peak of the hill. For our example $p(x) = 6x(1-x)$ and a uniform proposal $q(x)=1$ , the peak of $p(x)$ is at $x=0.5$ , where $p(0.5) = 6(0.5)(0.5) = 1.5$ . So, the tightest possible envelope is a an envelope with height $M=1.5$ .

Now, the game begins. It's like throwing darts at a rectangular board that has our target hill shape painted on it. The game has two simple steps:

Propose: We first generate a random point along the horizontal axis, let's call it $x'$ , from our simple proposal distribution $q(x)$ . This is like choosing where our dart lands horizontally. Since our $q(x)$ is uniform, any spot between 0 and 1 is equally likely.
Accept or Reject: Next, we need to decide if this dart "hits" the target area under the hill's curve. We generate a second random number, $u$ , from a uniform distribution between 0 and 1. We accept our proposed point $x'$ if this random number satisfies the condition: $u \le \frac{p(x')}{M q(x')}$ Otherwise, we reject $x'$ and start over from step 1.

This process might seem strange, but think about it visually. The ratio $\frac{p(x')}{M q(x')}$ represents the fraction of the envelope's height that is filled by the target distribution at position $x'$ . Where the hill $p(x')$ is high (close to the peak), this ratio is close to 1, and we are very likely to accept the sample. Where the hill is low, the ratio is small, and we will almost certainly reject it. We are simply throwing darts randomly at the entire rectangular area defined by $M q(x)$ and only keeping the ones that land under the curve of $p(x)$ .

The Magic Trick: Why Does This Even Work?

At this point, you should be a little skeptical. We're using one simple distribution to generate candidates and then throwing most of them away. How can the remaining points possibly follow the complicated shape of our original target distribution? This is where the quiet beauty of the mathematics shines through. The proof is so simple and elegant it feels like a revelation.

Let’s ask: what is the probability that a sample we generate and accept falls within a tiny interval around some point $x$ ?

The probability of proposing a value in this tiny interval $[x, x+dx]$ is given by our proposal distribution: $P(\text{propose } x) = q(x)dx$ .

Given that we've proposed $x$ , the probability of accepting it is the rule we just defined: $P(\text{accept } | \text{ propose } x) = \frac{p(x)}{M q(x)}$ .

The total probability of both these things happening—proposing a value in the interval and having it be accepted—is the product of these two probabilities: $P(\text{accepted sample is near } x) = \left( q(x)dx \right) \times \left( \frac{p(x)}{M q(x)} \right)$

Now, watch closely. The proposal distribution $q(x)$ appears in both the numerator and the denominator, so it cancels out! $P(\text{accepted sample is near } x) = \frac{p(x)dx}{M}$

This is a stunning result. The distribution of the samples we keep is no longer related to the proposal distribution $q(x)$ we used to generate them. Instead, it is directly proportional to our target distribution $p(x)$ . The term $1/M$ is just a constant. When we consider all the accepted points together, they form a distribution that is an exact replica of $p(x)$ . The procedure works perfectly. It isn't an approximation; it generates true, unadulterated samples from our target distribution.

The Price of Simplicity: Efficiency and Optimization

The method is correct, but is it practical? The cost of this clever trick is the number of rejected samples. If our envelope $M q(x)$ is a very loose fit for our target $p(x)$ —like a giant circus tent over a tiny shed—we will be rejecting an enormous number of proposals for every one we accept.

The overall probability of accepting any given proposal can be found by integrating the probability of acceptance over all possible values of $x$ . As we saw, this is simply the area under the target distribution (which is 1) divided by the area under the envelope distribution (which is $M$ ). Thus, the acceptance probability is simply $1/M$ . This means that, on average, we will need to make $M$ proposals to get a single accepted sample.

This immediately tells us that to make our sampler efficient, our goal must be to make $M$ as small as possible. We need to find a proposal distribution $q(x)$ and a constant $M$ that form the "tightest possible" envelope around $p(x)$ .

Often, our proposal distribution $q(x)$ comes from a family of distributions with adjustable "knobs" or parameters. By tuning these parameters, we can change the shape of $q(x)$ to better match $p(x)$ and thereby reduce $M$ . Consider a more complex example: sampling from a Rayleigh distribution (often used in communications) using a Gamma distribution as a proposal. The Gamma proposal has a tunable scale parameter, let’s call it $\theta$ . For each choice of $\theta$ , the shape of the proposal changes, and we get a different minimum envelope constant $M(\theta)$ . Using calculus, we can solve for the optimal value of $\theta$ that minimizes $M$ . This optimization is a powerful part of the art of designing a good rejection sampler, and the same principle applies whether we are tuning an exponential proposal for a Gamma target or a Laplace proposal for a Normal target. It transforms the method from a brute-force tool into a tailored, efficient strategy.

The Achilles' Heel: The Curse of Dimensionality

So far, rejection sampling seems like a perfect solution. It's exact, and we can even optimize its efficiency. But it harbors a deep and devastating weakness, one that becomes apparent when we move from one or two dimensions to many. This weakness is famously known as the curse of dimensionality.

Let's return to our geometric intuition. Imagine we want to sample points uniformly from a sphere that is perfectly inscribed inside a cube. This is a rejection sampling problem in disguise: our target region is the sphere, and our proposal region is the easy-to-sample-from cube. The acceptance probability is simply the ratio of the volume of the sphere to the volume of the cube.

In two dimensions (a circle in a square), the acceptance rate is $\frac{\pi r^2}{(2r)^2} = \frac{\pi}{4} \approx 0.785$ . We accept about 78.5% of samples. Very efficient!
In three dimensions (a sphere in a cube), the rate is $\frac{\frac{4}{3}\pi r^3}{(2r)^3} = \frac{\pi}{6} \approx 0.523$ . Still a respectable 52.3%.

But something strange and profoundly counter-intuitive happens as we increase the number of dimensions, $d$ . While the inscribed hypersphere and the surrounding hypercube seem to be snug fits, the volume of the hypersphere shrinks to almost nothing compared to the volume of the hypercube.

In 10 dimensions, the acceptance rate plummets to about $0.000025$ , or $1$ in $40,000$ .
In 20 dimensions, the rate is a staggering $2.5 \times 10^{-12}$ . You would need to generate nearly a trillion points just to expect to get one inside the hypersphere.
By 50 dimensions, the acceptance rate is so close to zero that you would likely never accept a single sample in the lifetime of the universe.

This phenomenon is the curse of dimensionality. In high-dimensional spaces, "volume" behaves weirdly. Most of the volume of a hypercube is concentrated in its "corners," far away from its center where the hypersphere lies. Our simple accept-reject strategy becomes catastrophically inefficient. For many modern problems in machine learning and physics that involve thousands or millions of dimensions, this makes simple rejection sampling a non-starter.

A Different Path: Independence versus The Long Walk

The downfall of rejection sampling in high dimensions forces us to consider other philosophies of sampling. The most prominent alternative is a family of methods called Markov Chain Monte Carlo (MCMC).

The fundamental difference between these two approaches lies in the properties of the samples they produce.

Rejection Sampling gives you independent and identically distributed (i.i.d.) samples. Each accepted point is a completely fresh, independent draw from the true target distribution. This is the statistical gold standard.
MCMC methods, like the famous Metropolis-Hastings algorithm, generate samples by taking a kind of "random walk." The algorithm starts at some point and then iteratively proposes a new point based on its current location. This generates a sequence, $x_1, x_2, x_3, \dots$ , where each sample is correlated with the one before it. The walk is cleverly designed so that, over the long run, the path it traces explores the target distribution's landscape correctly. But the samples are not independent.

So we have a trade-off. When rejection sampling is feasible (in low dimensions and with a good, tight proposal), it's often preferred because it delivers perfect, independent samples. When it fails due to the curse of dimensionality or the difficulty of finding an envelope, MCMC provides a powerful, if different, way forward.

Interestingly, these two worlds are not entirely separate. In a beautiful piece of mathematical unity, it can be shown that under certain conditions, the MCMC acceptance rule can become identical to the rejection sampling rule. This hints at a deep and shared logic underlying the diverse methods we've developed to explore the intricate landscapes of probability.

Applications and Interdisciplinary Connections

Now that we've peered under the hood and understood the elegant mechanics of rejection sampling, you might be wondering, "What is this clever trick actually good for?" The answer, delightfully, is almost everything. The genius of rejection sampling lies in its beautiful generality. It doesn't need to know the intimate secrets of the distribution it's sampling from; it doesn't require a neat, analytic formula that can be inverted. All it asks is that we can evaluate the target function's height at any given point and that we can find some simpler "roof" function to draw our proposals from. This simple "propose-and-check" strategy, like an artist using a stencil, allows us to carve out the most complex shapes from a simple block of random numbers. It has become a universal dartboard, appearing in the toolkits of physicists, statisticians, biologists, and engineers.

Let's take a tour through some of these fields to see this principle in action, watching it morph and adapt to solve an astonishing variety of problems.

The Physicist's Toolkit: From Atoms to Nuclei

Imagine you are a computational physicist trying to simulate the growth of a new material, one atom at a time. Your theories tell you that atoms settling onto a surface aren't just spread out evenly. Perhaps they prefer the center, with their probability distribution forming a shape like a semi-circle. How do you tell a computer to place atoms according to this specific curve? You can't just tell it, "pick a point from a semi-circle distribution." This is a perfect, and perhaps the most intuitive, job for rejection sampling. We simply draw a bounding box around our semi-circular curve, generate random points $(x, y)$ uniformly within this box, and keep only the points that fall under the curve. The accepted $x$ coordinates will now perfectly follow the desired distribution, giving us a realistic simulation of matter taking form, one virtual atom at a time.

The same logic that governs the position of an atom can also describe the energy of a particle emitted from an atomic nucleus. In the nuclear beta decay of a neutron, the emitted electron doesn't always come out with the same energy. Its kinetic energy follows a specific probability distribution predicted by the fundamental laws of particle physics. This distribution, which might look something like a symmetric mound described by $f(T) = T^2(Q-T)^2$ , is another one of those "non-standard" shapes. To simulate this process, we can again use our trusty rejection sampler. By throwing darts at a rectangle enclosing this energy spectrum, we are, in effect, letting the computer roll the dice in exactly the way nature does, generating electron energies with the correct quantum mechanical probabilities.

But rejection sampling is not the only game in town. In the sprawling world of statistical mechanics, we often want to simulate the collective behavior of countless interacting particles, like the tiny magnetic spins in an Ising model. We could try to use rejection sampling to pick an entire configuration of spins, but this is often terribly inefficient, like trying to guess a password by typing random letters. The vast majority of states are high-energy and thus have a vanishingly small probability. An alternative is to use a "smarter" method like the Metropolis algorithm, which takes a random walk through the space of configurations, preferentially moving towards lower-energy states. Comparing the two methods for a simple system shows that while rejection sampling is straightforward, its efficiency can plummet at low temperatures where only a few "ground states" are probable. The Metropolis algorithm, by contrast, is designed to find and explore these important regions more effectively. This doesn't make rejection sampling useless; it simply places it in a broader context, as one tool among many in the physicist's computational arsenal.

The Statistician's Secret Weapon: Taming Intractable Probabilities

Let's now move from the tangible world of particles to the abstract world of inference and belief. Here, the "shapes" we want to sample are not physical distributions but probability distributions that represent our state of knowledge. This is the realm of Bayesian statistics, and it is where rejection sampling truly shines as a workhorse.

Suppose you're an engineer testing a new electronic component, and you observe it fails at a certain time $t_0$ . You have some prior beliefs about the component's failure rate, $\lambda$ , and you want to update these beliefs with your new data. Bayes' theorem tells you how to do this, but very often, the resulting "posterior" distribution—the shape of your new belief—is a messy, complicated function with no familiar name. This is where a non-conjugate prior meets the data. But to our rejection sampler, a name doesn't matter! As long as we can calculate the height of this posterior curve at any $\lambda$ , we can sample from it. We pick a simpler, easy-to-sample proposal distribution, find a constant $M$ that makes our proposal's envelope cover the posterior everywhere, and start throwing darts. The collection of accepted samples for $\lambda$ gives us a direct, tangible representation of our updated knowledge, allowing us to calculate probabilities and confidence intervals for a parameter we can never know for certain.

The modularity of this technique means it can even be plugged in as a component within larger, more complex algorithms. Consider a sophisticated financial model with many interacting parameters. A powerful method called Gibbs sampling tries to understand the joint distribution of all parameters by sampling each one individually from its "full conditional" distribution—its distribution given the current values of all the others. But what if one of these conditional distributions is a non-standard shape? No problem! We can simply embed a rejection sampler inside our Gibbs loop to handle that specific step. A particularly clever and efficient variant called Adaptive Rejection Sampling (ARS) can be used if the target happens to be log-concave (a very common property!). ARS automatically constructs a tight, piecewise linear "roof" over the log of the function, making it an incredibly fast and effective "sampler-within-a-sampler" for many modern statistical models, from finance to machine learning.

At the Frontiers of Science: When You Can't Even Write Down the Formula

The true power and generality of the rejection principle become apparent when we face problems so complex that we cannot even write down the probability function we want to sample from. This sounds impossible, but it is a common challenge in fields like population biology, epidemiology, and cosmology, where the models are massive simulations. The solution is a profound extension of rejection sampling called Approximate Bayesian Computation, or ABC.

Imagine you are an ecologist with field data on a fish population, and you have two competing models of population dynamics—say, a Ricker model and a Beverton-Holt model. The models are stochastic, meaning they involve randomness, which makes calculating the likelihood of your observed data under either model analytically intractable. How can you decide which model is better? ABC provides a brilliant workaround. You simply use the models to generate a huge number of simulated datasets. For each simulation, you compare it to your real dataset. If the simulation looks "close enough" (based on some summary statistics and a tolerance $\epsilon$ ), you "accept" the parameters that generated it. If not, you "reject" them.

After running this process thousands of times for both models, you can compare how many accepted simulations each model produced. The ratio of acceptance counts provides a direct estimate of the Bayes factor, a formal measure of evidence favoring one model over the other. This is rejection sampling on a heroic scale. We are no longer rejecting points under a curve; we are rejecting entire simulated universes that fail to match our own. The underlying principle is identical to our simple dartboard, yet it allows us to perform statistical inference in the complete absence of a computable likelihood function, pushing the very boundaries of what is scientifically knowable.

A Different Dimension: Sampling Events in Time

Finally, let's see how this versatile idea can be used to sample not just values, but the timing of events. In many physical and chemical systems, events like chemical reactions or radioactive decays don't happen at a constant rate. For example, a reaction might be driven by a laser pulse, meaning its rate $R(t)$ changes dramatically over time. If we want to simulate this, we need to answer the question: given that we're at time $t_0$ , when will the next event occur?

This is a tricky problem because the waiting time distribution is non-trivial. The solution is an elegant algorithm known as "thinning," which is rejection sampling applied to the time domain. First, we find a constant rate $\bar{R}$ that is always greater than or equal to our true, time-varying rate $R(t)$ . We then generate a proposal event time from a simple, homogeneous process with this constant rate $\bar{R}$ (the waiting time for which is just an exponential random variable). This gives us a candidate time, $t'$ . Now, we perform our rejection step: we "accept" this proposed time with a probability equal to the ratio of the true rate to our majorant rate at that instant, $p_{\text{acc}}(t') = R(t') / \bar{R}$ . If we accept, an event happens at time $t'$ . If we reject, it's a "phantom" event; nothing happens in our system, but time advances to $t'$ , and we start over from there. In this way, we are "thinning out" a simple stream of proposal events, keeping only a fraction of them to create a new process that perfectly mimics the complex, time-inhomogeneous behavior of the real system.

From placing atoms, to weighing evidence for scientific theories, to orchestrating the dance of molecules in time, the core principle of rejection sampling proves its worth again and again. It is a testament to the power of simple, intuitive ideas in computational science—a universal method for making correct choices by being very, very good at saying "no."