Functions of Random Variables: Deriving New Distributions

SciencePedia

Key Takeaways

The distribution of a function of a random variable, such as $Y=g(X)$ , can be found directly using the change of variables formula, which accounts for the stretching or compressing of probability density.
Generating functions (PGF, MGF, CF) transform random variables into unique signature functions, simplifying complex operations like finding the distribution of a sum into simple multiplication.
The Characteristic Function (CF), $E[e^{itX}]$ , is a universally applicable tool that always exists and whose properties can reveal deep structural information about a distribution, such as symmetry.
The algebra of generating functions allows for the derivation of complex distributions that model real-world phenomena, including sums of independent variables and mixtures in hierarchical models.

Introduction

In science, engineering, and finance, we constantly work with quantities that exhibit randomness. While we might model a single phenomenon with a random variable, the questions we truly need to answer often involve combinations or transformations of these variables. For instance, what is the distribution of total error from multiple independent sources, or how does the price of an asset (a function of its returns) behave over time? The core challenge lies in moving from the known probability distribution of an input variable to the unknown distribution of the output function.

This article provides a comprehensive guide to the essential mathematical tools used to solve this problem. It is structured to take the reader on a journey from direct, intuitive methods to more powerful and abstract techniques. The first chapter, "Principles and Mechanisms," introduces the direct change of variables method for simple transformations and then reveals the elegance and power of generating functions—the PGF, MGF, and Characteristic Function—which transform difficult convolutions into simple multiplication. The second chapter, "Applications and Interdisciplinary Connections," demonstrates how this theoretical machinery is used to build realistic models of complex systems across diverse fields, from signal processing to mathematical finance, showing how simple probabilistic building blocks can be composed into rich, descriptive models of the real world.

Principles and Mechanisms

Imagine you are a physicist, an engineer, or a biologist. The world you study is teeming with randomness. The thermal noise in a circuit, the measurement error from a sensitive instrument, the lifetime of a radioactive atom—all these are quantities we can't predict with certainty. Instead, we describe them with random variables and their probability distributions. We might know, for instance, the distribution for the temperature in a room, let's call it $X$ . But what we really care about might be the pressure, which is a function of that temperature, say $Y = g(X)$ . Or perhaps we are interested in the total error from several independent sources, $Z = X_1 + X_2 + \dots + X_n$ .

Our central question is this: if we know everything about our original random variables, how can we describe the behavior of the new variables we create from them? This is not just an academic exercise. It is the heart of modeling complex systems. The journey to the answer reveals a beautiful and profound principle in mathematics and science: sometimes, the most direct path is not the easiest. Sometimes, a clever detour through an abstract world makes an impossible problem surprisingly simple.

The Brute Force Method: Stretching and Squeezing Probabilities

Let's start with the most direct approach. If we have a continuous random variable $X$ with a known probability density function (PDF), $f_X(x)$ , and we create a new variable $Y=g(X)$ , can we find its PDF, $f_Y(y)$ ? Yes, we can, using a method called the change of variables technique.

Think of the PDF as describing how "probability mass" is spread out over the possible values of $X$ . When we apply the function $g$ , we are stretching and squeezing this number line. If the function is steep in a certain region, a wide range of $x$ values gets compressed into a narrow range of $y$ values, making the probability density there higher. If the function is flat, a narrow range of $x$ gets stretched out, making the density lower.

The mathematics formalizes this intuition. For a strictly monotonic (always increasing or always decreasing) function $g$ , the formula is:

$f_Y(y) = f_X(g^{-1}(y)) \left| \frac{d}{dy}g^{-1}(y) \right|$

Here, $g^{-1}(y)$ is the inverse function—it tells us which $x$ produced a given $y$ . The term $\left| \frac{d}{dy}g^{-1}(y) \right|$ is the "stretching factor." It’s the absolute value of the derivative of the inverse function, and it precisely accounts for how much the probability density needs to be adjusted.

A classic example comes from finance. The price of a stock is always positive. A common model assumes that its continuously compounded return, $X = \ln(S_t/S_0)$ , follows a normal distribution (the famous "bell curve"). What, then, is the distribution of the price ratio itself, $Y = S_t/S_0 = \exp(X)$ ? Here, $X$ can be positive or negative, but $Y$ must be positive. Using our formula, we find that the PDF of $Y$ is the celebrated log-normal distribution. This shows how a symmetric distribution for returns (the bell curve) can lead to a skewed distribution for prices, a phenomenon observed in real markets.

Another example comes from signal processing. The energy of a random signal might follow a chi-squared distribution, $X \sim \chi^2(1)$ . What is the distribution of the signal's amplitude, $Y = \sqrt{X}$ ? By applying the change of variables formula, we discover the amplitude follows a half-normal distribution. This direct method works beautifully for these one-to-one transformations. But what if we want to add two random variables together? The "brute force" method for that involves a messy calculation called a convolution, which can quickly become a mathematical nightmare. We need a more elegant weapon.

The Elegant Detour: From Variables to Functions

Here is where the magic begins. Instead of working with the probability distributions directly, we can transform our entire random variable into a special kind of function. This idea might seem strange at first. Why make things more abstract? The reason is that operations that are hard in the world of random variables (like addition) often become incredibly easy in the world of their transformed functions (like multiplication). This is a trick that physicists and engineers have used for centuries with tools like the Fourier and Laplace transforms to solve problems in waves, heat flow, and circuits.

We are going to explore a family of such transforms for probability: the generating functions. Each acts as a unique fingerprint or signature for a random variable. If two variables have the same generating function, they have the same distribution. But their true power lies in how they behave under algebraic manipulation.

A Family of Fingerprints: PGF, MGF, and CF

Let's meet the three most important members of this family. Each is defined as an expected value, which is a probability-weighted average of some function.

The PGF: A Discrete Catalogue

For a random variable $X$ that can only take on non-negative integer values $\{0, 1, 2, \dots\}$ , the Probability Generating Function (PGF) is defined as:

$G_X(s) = E[s^X] = \sum_{k=0}^{\infty} P(X=k) s^k$

Look closely at this definition. It’s a power series in a dummy variable $s$ , and the coefficient of $s^k$ is precisely the probability that the random variable equals $k$ ! The PGF is literally a catalogue of all the probabilities of $X$ , neatly packaged into a single function.

Imagine a simple 2-bit computer register that holds a value from $\{0, 1, 2, 3\}$ with equal probability. The PGF for the value $X$ in the register would be $G_X(s) = \frac{1}{4}s^0 + \frac{1}{4}s^1 + \frac{1}{4}s^2 + \frac{1}{4}s^3$ . The function itself contains all the probabilistic information in its coefficients.

The MGF: The Moment Machine

The PGF is great, but it only works for non-negative integers. We need something more general. Enter the Moment Generating Function (MGF), defined for both discrete and continuous variables:

$M_X(t) = E[\exp(tX)]$

Where does the name come from? This function is a "machine" for generating the moments of a distribution (like the mean $E[X]$ , the variance-related $E[X^2]$ , and so on). If you take the derivatives of $M_X(t)$ with respect to $t$ and then set $t=0$ , you get the moments! Specifically, the $n$ -th derivative at zero gives $E[X^n]$ .

Let’s start with the simplest possible case: a "random" variable that isn't random at all. Imagine a manufacturing process so perfect that every component has a characteristic value of exactly $c$ . The random variable $X$ is just the constant $c$ . Its MGF is $E[\exp(tX)] = E[\exp(tc)] = \exp(tc)$ . It's a simple exponential.

Now for a real coin flip, a Bernoulli trial where $X=1$ with probability $p$ (heads) and $X=0$ with probability $1-p$ (tails). The MGF is the weighted average of $\exp(tX)$ for these two outcomes: $M_X(t) = (1-p)\exp(t \cdot 0) + p\exp(t \cdot 1) = (1-p) + p\exp(t)$ . This compact function elegantly encodes the two-pronged nature of the event.

The MGF truly shines when we perform linear transformations. If you have a variable $X$ with MGF $M_X(t)$ and create a new variable $Y = aX + b$ , you don't need to go back to the PDF. The new MGF is simply $M_Y(t) = \exp(bt) M_X(at)$ . This simple rule is incredibly powerful for analyzing scaled and shifted data.

The CF: The Universal Translator

The MGF is a fantastic tool, but it has a small catch: for some distributions with very "heavy tails," the expectation $E[\exp(tX)]$ might not exist because the exponential grows too fast. To solve this, we introduce the king of all generating functions: the Characteristic Function (CF). It is defined as:

$\phi_X(t) = E[\exp(itX)]$

The only difference is that tiny letter $i$ , the imaginary unit. By venturing into the world of complex numbers, we have created a tool that is universally applicable—the characteristic function always exists for any random variable. This is because $\exp(itX) = \cos(tX) + i\sin(tX)$ , and since sine and cosine are bounded between -1 and 1, the expectation is always finite.

The CF is, in fact, the Fourier transform of the probability density function. This deep connection to one of the most fundamental tools in physics and engineering tells us we are on the right track. Many properties carry over from the MGF. For our degenerate variable $X=c$ , the CF is $\phi_X(t) = \exp(itc)$ .

The complex nature of the CF reveals profound structural information about the distribution. For instance, what is the CF of $Y = -X$ ? It is $\phi_Y(t) = E[\exp(it(-X))] = E[\exp(i(-t)X)] = \phi_X(-t)$ . Using the properties of complex numbers, this is also equal to the complex conjugate of the original CF, $\overline{\phi_X(t)}$ .

This leads to a beautiful insight: what if a distribution is symmetric around the origin (like the normal distribution)? This means $X$ and $-X$ have the same distribution, so they must have the same CF. This implies that $\phi_X(t) = \phi_X(-t)$ . But we just saw that $\phi_X(-t) = \overline{\phi_X(t)}$ . Therefore, for a symmetric distribution, we must have $\phi_X(t) = \overline{\phi_X(t)}$ , which is the definition of a real number. The characteristic function of any symmetric random variable must be purely real-valued! The imaginary parts, generated by the sine components, perfectly cancel out over the distribution.

The Payoff: The Symphony of Sums

We have now assembled our toolkit of generating functions. Why did we go on this long, abstract detour? Here is the spectacular payoff: sums of independent random variables.

Suppose $X$ and $Y$ are independent. What is the distribution of their sum, $Z = X+Y$ ? Let's look at the MGF of $Z$ :

$M_Z(t) = E[\exp(t(X+Y))] = E[\exp(tX)\exp(tY)]$

Because $X$ and $Y$ are independent, the expectation of their product is the product of their expectations:

$M_Z(t) = E[\exp(tX)] E[\exp(tY)] = M_X(t) M_Y(t)$

This is the punchline. The difficult operation of convolution in the original space has become simple multiplication in the transform space. To find the distribution of a sum of independent variables, you just multiply their generating functions.

Let's see this in action. Consider the total number of flipped bits in a sequence of $n$ transmissions, where each bit has an independent probability $p$ of flipping. The total number of flips, $X$ , is the sum of $n$ independent Bernoulli variables, $X = Y_1 + Y_2 + \dots + Y_n$ . We already know the CF for a single Bernoulli trial is $\phi_Y(t) = (1-p) + p\exp(it)$ . Since the trials are independent, the CF for the sum is just the product:

$\phi_X(t) = \prod_{i=1}^{n} \phi_{Y_i}(t) = \left( (1-p) + p\exp(it) \right)^n$

This is the characteristic function for the Binomial distribution. We have derived it not through complex combinatorial arguments, but with a few lines of simple algebra. This elegance and power is the reason generating functions are a cornerstone of probability theory. They transform daunting problems of addition into trivial ones of multiplication, revealing a hidden simplicity in the laws of chance. It is this very principle that, when taken to its limit, gives rise to the most important result in all of probability: the Central Limit Theorem.

Applications and Interdisciplinary Connections

We have spent some time learning the formal machinery for dealing with functions of random variables—the rules of the game, so to speak. We have our change-of-variable formulas, our Jacobian determinants, and our powerful moment-generating and characteristic functions. But what is it all for? Why do we bother with this elaborate calculus of probability? The answer, and it is a truly beautiful one, is that this machinery is our toolkit for building mathematical descriptions of the real world.

The world is not, in general, made of simple, elementary random events. It is a grand, composite structure. The noise in a communication signal is not a single entity; it is the superposition of countless tiny thermal agitations. The number of insurance claims a company receives in a year is not governed by one simple parameter, but by a complex interplay of risk factors, some continuous and some discrete. The price of a stock is not just a random number, but the result of a long, meandering random walk through time.

The art and science of modeling this complexity lies in understanding how to compose it from simpler, more manageable probabilistic building blocks. The mathematics of functions of random variables is the language of this composition. It allows us to take simple, well-understood distributions—the Gaussian, the Poisson, the Exponential—and combine them through addition, multiplication, division, and even more exotic transformations, to generate new distributions that capture the richness of reality. Let us now embark on a journey to see how this works, to witness how these abstract tools breathe life into models across science, engineering, and finance.

From Simple Components to Complex Systems

Perhaps the most basic way to combine things is through arithmetic. What happens when we add, subtract, or divide random quantities?

Imagine you are tracking two independent processes that involve discrete counts—for example, the number of particles of a certain type created ( $X$ ) versus annihilated ( $Y$ ) in a small volume of space over a minute, or the number of goals scored by the home team versus the away team in a soccer league. If both of these counts can be modeled by independent Poisson processes, what can we say about the difference $W = X - Y$ ? Using the algebra of moment-generating functions, we can find the MGF of $W$ by simply multiplying the MGF of $X$ with the MGF of $Y$ evaluated at $-t$ . The resulting distribution, known as the Skellam distribution, gives us a precise way to calculate the probability of any given net difference, a tool immensely useful in fields from physics to sports analytics.

Now, let’s try a different operation: division. In signal processing, noise is a constant companion. A common model for noise in a communication channel involves two independent components, an "in-phase" component $X$ and a "quadrature" component $Y$ , both of which fluctuate around zero according to a standard normal (Gaussian) distribution. An engineer might be interested in the "noise aspect ratio," $Z = X/Y$ . What does the distribution of this ratio look like? We start with two of the most well-behaved distributions imaginable, the iconic bell curves. But their ratio, as revealed by the change-of-variables formula, is something entirely different and rather wild: the Cauchy distribution. This distribution has the peculiar property that its mean and variance are undefined! It has "heavy tails," meaning that extreme values are far more likely than for a Gaussian. This is a profound lesson: combining simple, well-behaved components can lead to complex systems with surprising, "pathological" behavior. This insight is crucial for designing robust systems that can handle occasional but very large noise spikes.

Reality is often a mix of the continuous and the discrete. Consider a system that has some baseline, continuous random behavior, but is also subject to sudden, discrete "shocks." This could model an insurance portfolio with a steady stream of small claims ( $X$ , a Gamma-distributed variable) plus the possibility of a single, massive catastrophic claim ( $Y$ , a Bernoulli variable that is either 0 or 1). The total loss is $Z = X + cY$ . Again, because the two sources of risk are independent, the MGF of the total loss is simply the product of the individual MGFs. This allows actuaries to construct a precise model of their total risk, blending a continuous process with a discrete event, all through the simple multiplication of their corresponding MGFs.

Unveiling Hidden Structures in Nature and Engineering

Sometimes, applying a function to a random variable doesn't just combine things—it reveals an entirely new and unexpected mathematical structure, a hidden symmetry of the random world.

Consider a point chosen on a circle by picking an angle $\Theta$ uniformly at random from $[0, 2\pi)$ . What can we say about its x-coordinate, $X = \cos(\Theta)$ ? This seems like a simple geometric transformation. Yet, if we compute the characteristic function of $X$ , a calculation that involves a simple integral over the uniform distribution of the angle, we find something remarkable. The result is no elementary function, but $J_0(t)$ , the Bessel function of the first kind of order zero. These Bessel functions appear everywhere in physics and engineering, describing the modes of a vibrating drumhead, the diffraction of light through a circular aperture, and the propagation of electromagnetic waves in a cylindrical guide. It is astonishing that this fundamental function, central to the physics of waves and oscillations, emerges directly from the simple act of projecting a random point on a circle.

Let's look at another transformation: squaring. In physics, the kinetic energy of a particle is proportional to the square of its velocity. If the velocity components of gas molecules in a container are modeled by centered normal distributions, what is the distribution of their kinetic energy? This question leads us to study the variable $Y = X^2$ , where $X \sim N(0, \sigma^2)$ . By computing the moment-generating function, we find that $Y$ follows a Gamma distribution, which is a scaled version of the famous Chi-squared distribution. This Chi-squared distribution is the bedrock of modern statistical testing, used for everything from checking if a die is fair (goodness-of-fit tests) to making inferences about the variance of a population. The link is direct: the randomness of position or velocity (Gaussian) is transformed into the randomness of energy or squared error (Chi-squared).

Modeling Uncertainty about Uncertainty: Hierarchical Models

Our models so far have assumed that the parameters—the $\lambda$ of a Poisson or the $\sigma$ of a Gaussian—are fixed, known numbers. But what if we are uncertain about the parameters themselves? What if the "rate" of an event is not constant, but fluctuates randomly?

This leads to the powerful idea of hierarchical, or mixed, models. Imagine modeling traffic accidents at an intersection. We might start by assuming the number of accidents per month, $X$ , follows a Poisson distribution with rate $\lambda$ . But is the "riskiness" $\lambda$ of every intersection the same? Of course not. Some are inherently more dangerous than others. So, we can model the rate parameter $\Lambda$ itself as a random variable, drawn from, say, a Gamma distribution, which is a flexible distribution for positive quantities. We now have a two-level model: $\Lambda$ is drawn from a Gamma distribution, and then $X$ is drawn from a Poisson distribution with that specific $\Lambda$ .

To find the unconditional distribution of $X$ , we must average over all possible values of the rate parameter $\Lambda$ . The law of total expectation provides an elegant way to do this with moment-generating functions. The result of this Poisson-Gamma mixture is a new distribution: the Negative Binomial distribution. This distribution has a larger variance than a simple Poisson, a property called "overdispersion," which is exactly what we observe in countless real-world datasets where the underlying rate is not constant. This hierarchical approach is a cornerstone of modern Bayesian statistics, allowing us to build far more realistic and robust models of complex phenomena.

The Dynamics of Randomness: Journeys in Time

So far, we have mostly considered static random variables. But many phenomena unfold in time: the random jitter of a particle in water, the fluctuating price of a financial asset. These are described by stochastic processes, which are essentially random variables with a time index. The mathematics of functions of random variables extends beautifully to this dynamic realm.

The archetypal continuous-time process is the Wiener process, or Brownian motion, $\{W_t\}_{t \ge 0}$ . One of its defining features is that for any time $t \gt 0$ , the random variable $W_t$ is normally distributed with mean 0 and variance $t$ . This leads to a fascinating scaling property. If we define a new random variable by scaling the process at time $t$ like so: $Z = W_t / \sqrt{t}$ , a simple change-of-variables calculation shows that $Z$ is a standard normal variable, $N(0,1)$ . This means that the process looks statistically the same at all time scales, a property known as self-similarity. A graph of Brownian motion over a day looks qualitatively just like a graph of it over a second, just stretched out.

We can apply more complex functions. What if we are interested not just in the position at time $t$ , but in the total accumulated area under the random path up to that time? This corresponds to the stochastic integral $I_t = \int_0^t W_s ds$ . This is a function of the entire path history of the process. Since the integral is a linear operation and the underlying process is Gaussian, the resulting random variable $I_t$ will also be Gaussian. Its mean is zero, and a more involved (but beautiful) calculation using the covariance of the Wiener process shows that its variance is $\frac{t^3}{3}$ . The characteristic function immediately follows. This tells us precisely how the uncertainty in this accumulated quantity grows with time—not linearly with $t$ , but much faster, as $t^3$ . Such integrated processes are vital in mathematical finance for pricing exotic financial instruments whose payoff depends on the average price of an asset over a period of time.

A Deeper Foundation: The License to Calculate

Finally, let us take a step back and ask a very fundamental question. We have been happily calculating distributions for things like "the ratio of two variables" or "the rank of a matrix." But what gives us the right to assume that these are well-defined random variables in the first place? What guarantees that we can meaningfully ask, "What is the probability that the rank of a random $n \times n$ matrix is equal to $k$ ?"

This is not a trivial question. For a quantity to be a random variable, the sets of outcomes corresponding to certain events must be "measurable"—they must belong to the $\sigma$ -algebra on which our probability measure is defined. In simpler terms, they must be "well-behaved" sets for which we can assign a probability. So, is the function $R(A) = \text{rank}(A)$ a random variable on the space of matrices?

The answer is yes, and the reason is quite elegant. The set of all matrices whose rank is less than or equal to $k$ can be defined by a clear condition: it is the set of matrices where the determinants of all possible $(k+1) \times (k+1)$ sub-matrices (the minors) are zero. Since the determinant is a polynomial (and thus continuous) function of the matrix entries, the set where a minor is zero is a closed set. The set of matrices with rank $\le k$ is the intersection of a finite number of these closed sets, and is therefore itself a closed set. Closed sets are always "well-behaved" Borel sets. Since the sets $\{A | R(A) \le k\}$ are measurable for all $k$ , the function $R(A)$ is indeed a measurable function, a legitimate random variable. This result gives us the "license to operate" for the entire field of Random Matrix Theory, a subject with profound applications in nuclear physics, number theory, and wireless communications.

From the simple to the complex, from the concrete to the abstract, the theory of functions of random variables is the essential bridge that connects our idealized probability models to the messy, composite, and dynamic world we seek to understand. It is a testament to the unifying power of mathematics, revealing a deep structural coherence in the nature of randomness itself.