Distribution of a Function of a Random Variable

SciencePedia

Key Takeaways

The distribution of a transformed variable $Y=h(X)$ is found by expressing the event $\{Y \le y\}$ in terms of $X$ and using $X$ 's known distribution.
The Probability Integral Transform shows that for any continuous variable $X$ , the transformed variable $Y=F_X(X)$ is always uniformly distributed on $[0, 1]$ .
The inverse transform method leverages this principle to generate random numbers from complex distributions, forming the basis of Monte Carlo simulations.
Physical laws and engineering principles often act as functions that transform one random distribution into another, modeling phenomena from stock prices to signal processing.

Introduction

Random variables are the building blocks of probability theory, providing a framework for quantifying uncertainty. In nearly every scientific and engineering discipline, we encounter processes where a random quantity is transformed by a function—a signal passes through a filter, a financial asset grows over time, a physical measurement is converted to a different unit. This transformation creates a new random variable with a new, derived probability distribution. The central challenge, and the focus of this article, is to understand and calculate the distribution of this new variable based on the original. This article will equip you with the essential tools for this task. The first chapter, "Principles and Mechanisms," will lay the theoretical groundwork, introducing the core methods like the CDF approach and the change-of-variables formula. Following this, the "Applications and Interdisciplinary Connections" chapter will demonstrate how this mathematical machinery is a powerful language for modeling, simulating, and understanding complex systems in fields ranging from physics to finance and computer science.

Principles and Mechanisms

We have met the idea of a random variable—a number whose value is subject to the whims of chance. We learned to characterize it through its distribution, a kind of probabilistic identity card. But what happens when we take this number and change it? What if we square it, take its logarithm, or pass it through some electronic circuit that transforms it? We get a new random variable, with a new identity. The fascinating question is, how is the new identity card related to the old one? This journey of transformation is not just a mathematical curiosity; it is the heart of how we model the world, from the energy of a physical signal to the logic of a computer simulation.

The Master Key: The Cumulative Distribution Function

Before we can understand transformations, we must have a solid grasp on the single most important tool for describing any random variable: the Cumulative Distribution Function (CDF), which we denote as $F_X(x)$ . Its definition is deceptively simple: $F_X(x)$ is the probability that the random variable $X$ will take a value less than or equal to $x$ .

$F_X(x) = \mathbb{P}(X \le x)$

Think of it as an accountant of probability. As you move from left to right along the number line (increasing $x$ ), this function keeps a running total of all the probability you've accumulated. It always starts at 0 (far to the left, you haven't accumulated any probability) and ends at 1 (far to the right, you've accumulated all of it).

For a simple discrete variable, the CDF is a staircase. Imagine a digital memory cell that can be in state 0 or 1. If it has a probability $p$ of being in state 0, its CDF is flat at 0 until you reach $x=0$ . At that exact point, the function suddenly jumps up by an amount $p$ , because you've just included the "lump" of probability sitting at 0. It then stays flat at this new level until you reach $x=1$ , where it jumps again by the remaining probability, $1-p$ , to a final height of 1. We can see the same staircase structure in the output of a simple Digital-to-Analog Converter (DAC) that randomly outputs one of four voltages. The CDF jumps at each possible voltage value, with the height of each step corresponding to the probability of that specific outcome. This staircase picture is the signature of a discrete random variable.

A Simple Stretch: The First Transformation

Now, let's start transforming things. Suppose we have a random variable $X$ that can only be 0 or 1 (a Bernoulli variable), and we create a new variable $Y$ by a simple linear transformation, say $Y = 2X$ . What is the distribution of $Y$ ?

This is wonderfully straightforward. The possible outcomes for $X$ were 0 and 1. The new outcomes for $Y$ are simply $2 \times 0 = 0$ and $2 \times 1 = 2$ . The probabilities themselves don't change; they are just carried over to these new values. If $\mathbb{P}(X=0) = \frac{2}{3}$ , then $\mathbb{P}(Y=0) = \frac{2}{3}$ . If $\mathbb{P}(X=1) = \frac{1}{3}$ , then $\mathbb{P}(Y=2) = \frac{1}{3}$ . The distribution has been "stretched".

To find the CDF of $Y$ , $F_Y(y) = \mathbb{P}(Y \le y)$ , we simply translate the question back into the language of $X$ :

$\mathbb{P}(Y \le y) = \mathbb{P}(2X \le y) = \mathbb{P}(X \le \frac{y}{2}) = F_X(\frac{y}{2})$

This simple equation contains the essence of all transformations. To find a probability for the new variable $Y$ , you must find the set of corresponding values for the original variable $X$ and then use the known distribution of $X$ .

The Continuous World: Stretching Probability Density

For continuous variables, where probability is not in lumps but is spread smoothly like butter on bread, we use the Probability Density Function (PDF), $f_X(x)$ . The area under the PDF curve between two points gives the probability that $X$ falls in that interval. Now, when we transform a continuous variable with a function $Y=h(X)$ , we are essentially stretching and compressing the number line on which this probability "butter" is spread.

Imagine drawing a picture on a rubber sheet and then stretching it. Where the sheet is stretched, the ink thins out. Where it's compressed, the ink becomes denser. The PDF behaves just like this ink density. The mathematical rule for this is the change of variables formula. For a transformation $Y=h(X)$ that is monotonic (always increasing or always decreasing), the new density $f_Y(y)$ is related to the old density $f_X(x)$ by:

$f_Y(y) = f_X(x) \left| \frac{dx}{dy} \right|$

Here, $x$ is the value that produces $y$ (i.e., $x = h^{-1}(y)$ ), and the term $\left| \frac{dx}{dy} \right|$ is the "stretching factor." It tells us how much an infinitesimal interval around $x$ is stretched or compressed when it is mapped to an interval around $y$ .

Let's see this in action. In physics, the energy of a signal is often proportional to the square of its amplitude. Suppose a random energy $X$ follows a chi-squared distribution with one degree of freedom, and we are interested in the amplitude $Y = \sqrt{X}$ . The transformation is monotonic for positive energies. Applying the change of variables formula, we find that the PDF for the amplitude $Y$ is given by $f_Y(y) = \sqrt{\frac{2}{\pi}} \exp\left(-\frac{y^{2}}{2}\right)$ for $y > 0$ . This is a beautiful result: the square root of a chi-squared variable with one degree of freedom is a "half-normal" variable. Energy and amplitude have this deep probabilistic connection.

The same principle applies to any monotonic transformation, like a simple linear shift and scale, $Y = 5X - 2$ . Even for a distribution as peculiar as the Cauchy distribution, this method works perfectly, showing how the new density is a scaled and shifted version of the old one.

A Universal Truth: The Probability Integral Transform

Now for a piece of pure mathematical magic. We've used various functions to transform variables. What if we choose a very special function: the variable's own CDF, $F_X(x)$ ? Let's define a new variable $Y = F_X(X)$ .

Let that sink in. We are mapping each value $x$ to the total probability accumulated up to that point. What kind of variable is $Y$ ? The astounding answer is that, for any continuous random variable $X$ , the new variable $Y$ is always uniformly distributed on the interval $[0, 1]$ .

This is the probability integral transform, and it's a universal law of probability. The proof is as elegant as the result itself. We want to find the CDF of $Y$ , which is $G(y) = \mathbb{P}(Y \le y)$ .

$G(y) = \mathbb{P}(Y \le y) = \mathbb{P}(F_X(X) \le y)$

Since the CDF $F_X$ is an increasing function, we can apply its inverse $F_X^{-1}$ to both sides of the inequality inside the probability:

$G(y) = \mathbb{P}(X \le F_X^{-1}(y))$

But the probability that $X$ is less than or equal to some value is, by the very definition of the CDF, just the CDF evaluated at that value! So,

$G(y) = F_X(F_X^{-1}(y)) = y$

The CDF of our new variable $Y$ is simply $G(y)=y$ for $y \in [0,1]$ . This is precisely the CDF of a uniform distribution on $[0,1]$ . It's as if the transformation $F_X$ "flattens out" all the bumps and valleys of the original distribution, spreading the probability perfectly evenly.

This is no mere party trick. It is the engine behind most of the random numbers used in scientific computing. Computers are good at generating uniform random numbers. If we want a number from, say, an exponential distribution, we can use this principle in reverse. We start with a uniform random number $U$ and apply the inverse of the exponential CDF to it. For an exponential distribution with rate 1, the CDF is $F(y) = 1 - \exp(-y)$ . The transformation $Y = -\ln(1-U)$ , where $U$ is uniform, will produce a variable $Y$ that is perfectly exponentially distributed.

Beyond Monotonicity: When Functions Fold Back

Our simple change-of-variables rule worked for monotonic functions. But what about functions like $Y = \cos(X)$ or $Y = X^2$ ? These functions "fold back" on themselves; multiple values of $X$ can lead to the same value of $Y$ . For instance, $\cos(\frac{\pi}{3}) = \cos(\frac{5\pi}{3}) = 0.5$ .

In these cases, we must return to the fundamental definition of the CDF. Let's consider the phase of a signal, $X$ , being uniformly random on $[0, 2\pi]$ , and we measure the amplitude $Y = \cos(X)$ . To find $F_Y(y) = \mathbb{P}(Y \le y)$ , we must solve the inequality $\cos(X) \le y$ . We look at the graph of the cosine function and identify all the regions of $x$ in $[0, 2\pi]$ that satisfy this condition. For any $y$ in $[-1,1]$ , this region consists of an interval $[\arccos(y), 2\pi - \arccos(y)]$ . The total probability is the length of this interval divided by the length of the total space, $2\pi$ . This careful, direct approach allows us to find the CDF even when the transformation is complex and non-monotonic.

Sometimes, a transformation can even turn a continuous variable into one that has discrete components. A function might map an entire interval of $X$ values to a single point for $Y$ , creating a "lump" of probability at that point and turning $Y$ into a mixed discrete-continuous variable. The key, as always, is to patiently ask: for my desired range of $Y$ values, what are all the corresponding $X$ values?

A Different Angle: The Quantile Function

Finally, let's look at the problem from a completely different direction. We have the CDF, $F(x)$ , which takes a value $x$ and gives a probability $p$ . Its inverse, the quantile function $Q(p) = F^{-1}(p)$ , does the opposite: it takes a probability $p$ and gives the value $x$ below which that much probability is accumulated.

It turns out there's a deep connection between the PDF and the quantile function. The PDF, $f(x)$ , is related to the derivative of the quantile function:

$f(x) = \frac{1}{Q'(p)}$ where $p = F(x)$

This makes intuitive sense. If $Q(p)$ is changing rapidly (a large $Q'(p)$ ), it means you have to go far along the x-axis to accumulate a little more probability. This implies that the probability density $f(x)$ must be low in that region. Conversely, if $Q(p)$ is flat (a small $Q'(p)$ ), you accumulate a lot of probability over a small range of $x$ , meaning the density $f(x)$ must be high. Knowing the quantile function gives us an elegant, alternative path to finding the density of our random variable.

From simple stretches to universal truths and back to first principles, the study of transformed random variables is a beautiful demonstration of unity in probability theory. It all boils down to one idea: carefully tracking how regions of probability are mapped, stretched, compressed, or folded from one space to another.

Applications and Interdisciplinary Connections

Now that we have explored the machinery of transforming random variables, we might ask, "What is all this good for?" It is a fair question. Are these just elegant mathematical exercises, or do they connect to the world in a meaningful way? The answer, you will be delighted to find, is that this machinery is not just connected to the world; it is a fundamental language for describing it. The universe, it turns out, is a grand master of applying functions to random variables. From the decay of a subatomic particle to the fluctuations of the stock market, nature is constantly transforming probability distributions. By understanding this process, we gain a powerful lens to view, model, and even predict the behavior of complex systems across an astonishing range of disciplines.

The Art of Creation: Forging Reality in a Computer

One of the most immediate and powerful applications of our topic is in the world of simulation. Often, we need to study a system that is too complex, too expensive, or too dangerous to experiment with directly. The solution is to build a replica inside a computer—a Monte Carlo simulation. But to do that, we need a way to generate random numbers that behave just like the random processes in the real system.

The beautiful trick is that we don't need a special machine for every type of randomness. We can start with the simplest, most "boring" kind of randomness imaginable: a number picked uniformly from the interval $[0, 1]$ , which we denote as $U \sim U(0, 1)$ . Think of this as our primordial clay. From this single, simple ingredient, we can sculpt almost any distribution we desire. The tool for this sculpting is the inverse transform method.

Suppose we want to simulate the waiting time until a radioactive atom decays, a process that follows an exponential distribution. How do we do it? We start with our uniform random number, $u$ , and apply a specific "recipe"—a transformation function. In one such case, the transformation $Y = -2 \ln(X)$ , where $X \sim U(0,1)$ , was shown to produce a random variable $Y$ that follows an exponential distribution with rate $\lambda = 1/2$ . By simply taking the natural logarithm of a uniform random number and scaling it, we have "created" a virtual particle decay time. The general method involves finding the inverse of the cumulative distribution function (CDF) we wish to simulate. For example, to generate a variable from a specific Beta distribution, one can derive the transformation $X = 1 - (1 - u)^{1/\beta}$ to convert a uniform variate $u$ into a realization of the desired Beta variate $X$ .

This technique is incredibly versatile. Want to generate a number from the strange and wonderful Cauchy distribution, known for its heavy tails and undefined mean? There's a recipe for that too. A clever transformation involving the tangent function, $X = \tan(\pi(U - 1/2))$ , will mold a uniform variable $U$ into a perfect specimen of a Cauchy variable. This ability to generate arbitrary forms of randomness on demand is the bedrock of computational statistics, machine learning, and quantitative finance.

The Universe as a Grand Transformation

Beyond our computer simulations, the principles we've discussed are at play all around us. Physical laws and natural processes often act as functions that transform one type of randomness into another.

Consider a particle physics experiment where particles decay at random distances from a source. If the decay distance $X$ follows a simple exponential law, what can we say about the intensity $I$ of the energy pulse we measure? The intensity isn't random in the same way; it's governed by a physical law—the inverse-square law, $I = \alpha/X^2$ . Here, the law of physics itself is the function $g(X)$ that transforms the distribution of distances into a new distribution for intensities. By applying our change-of-variables formula, physicists can predict the probability of observing any given signal intensity, a crucial step in designing detectors and interpreting experimental results.

This same idea is the cornerstone of modern financial mathematics. The price of a stock is notoriously unpredictable, but its movement is not entirely without structure. A widely used model, the geometric Brownian motion, proposes that the stock price $Y(t)$ at time $t$ is the result of an exponential function applied to a random walk, or Brownian motion, $W(t)$ . That is, $Y(t) = \exp(W(t))$ . The underlying randomness $W(t)$ represents the cumulative effect of countless small, unpredictable shocks to the market. The exponential function transforms this randomness into a model of compound growth, resulting in the famous log-normal distribution for stock prices. Understanding this transformation is the first step toward pricing financial derivatives and managing risk in a volatile world.

Sometimes, these transformations reveal surprising and deep connections. In signal processing, a critical metric is the signal-to-noise ratio (SNR), often modeled as the ratio of two random variables, $Z = X_1/X_2$ . If both the signal $X_1$ and the noise $X_2$ are modeled by the well-behaved, bell-shaped normal distribution, what does their ratio look like? One might intuitively guess it would also be "nice." The mathematics, however, reveals a shock: the ratio $Z$ follows a Cauchy distribution. This is the same "wild" distribution we learned to simulate earlier, with heavy tails that account for unexpectedly large values. This result is profound. It tells engineers that even when dealing with well-behaved noise and signal sources, the ratio can be prone to extreme spikes, a fact with critical implications for designing robust communication systems.

The Logic of Extremes: Order from Chaos

Nature often selects from a collection of random outcomes. A chain is only as strong as its weakest link. A convoy of ships is only as fast as its slowest vessel. These simple truths are, in fact, statements about the distribution of a function of random variables—specifically, the minimum or the maximum.

In reliability engineering, a system might consist of two components working in series. The system fails as soon as the first component fails. If the lifetimes of the two components, $X_1$ and $X_2$ , are independent random variables, the lifetime of the system is $Y = \min(X_1, X_2)$ . If the components have exponential lifetimes, a remarkable thing happens: the system's lifetime, the minimum of the two, is also exponentially distributed, but with a faster failure rate. This elegant result is a cornerstone of survival analysis and helps engineers design more robust systems.

The flip side of this is the maximum. Imagine a multi-core processor running $N$ parallel tasks. The entire job is not finished until the last task is complete. If the time for each task $X_i$ is a random variable, the total time for the job is $Y = \max(X_1, X_2, \ldots, X_N)$ . Even if each individual task time is uniformly distributed (meaning any time up to a maximum is equally likely), the distribution of the total job time $Y$ is anything but uniform. The probability becomes heavily skewed towards the maximum possible time, because it only takes one "straggler" task to delay the entire computation. This simple model helps computer scientists understand and mitigate bottlenecks in parallel computing, a critical challenge in the age of big data.

The Frontier: Guiding Discovery with Probability

Perhaps the most exciting application of our concept is not just in describing the world, but in actively making decisions to learn more about it. This is the domain of Bayesian optimization, a cutting-edge technique used in fields from drug discovery to materials science and protein engineering.

Imagine you are an engineer trying to design a new enzyme with the highest possible activity. Testing each possible protein sequence is impossible. Instead, you test a few, and use the results to build a statistical model (a Gaussian process) of the "sequence-to-function" landscape. For any new sequence you haven't tested, your model doesn't give you a single predicted value of its fitness; it gives you a whole probability distribution for its fitness, say $Y(x) \sim \mathcal{N}(\mu(x), \sigma^2(x))$ .

Now comes the crucial question: which sequence should you test next? You want to choose the one that is most likely to be better than the best you've found so far, $y^{\star}$ . This leads to the idea of "Improvement," defined as $I(x) = \max\{0, Y(x) - y^{\star}\}$ . Notice that because $Y(x)$ is a random variable, the Improvement $I(x)$ is also a random variable. We don't know what the improvement will be, but we can calculate its expected value, known as the Expected Improvement, $\text{EI}(x) = \mathbb{E}[I(x)]$ . This calculation is a direct application of our core topic: finding the expectation of a function of a random variable. The final formula, $\text{EI}(x) = (\mu(x) - y^{\star}) \Phi\left(\frac{\mu(x) - y^{\star}}{\sigma(x)}\right) + \sigma(x) \phi\left(\frac{\mu(x) - y^{\star}}{\sigma(x)}\right)$ , beautifully balances exploiting known high-performance regions (high $\mu(x)$ ) with exploring uncertain ones (high $\sigma(x)$ ). By calculating the EI for all candidate sequences and choosing the one with the highest value, scientists can intelligently navigate vast search spaces, dramatically accelerating the pace of discovery.

From the heart of a computer to the heart of a star, from the price of a stock to the design of a life-saving protein, the transformation of random variables is a unifying thread. It is a testament to how a single, elegant mathematical idea can provide a powerful and versatile toolkit for understanding, simulating, and interacting with a world steeped in randomness and uncertainty.