Predictive Parity

SciencePedia

Key Takeaways

Predictive parity is a fairness criterion requiring that the Positive Predictive Value (PPV) of a model is equal across all demographic groups.
A model's PPV is determined not only by its accuracy but also by the prevalence (base rate) of the condition, which often differs between groups.
"Impossibility theorems" in algorithmic fairness prove that when base rates differ, it is mathematically impossible to simultaneously satisfy predictive parity and other key metrics like equalized odds.
Choosing between fairness criteria like predictive parity and equal opportunity involves an explicit ethical trade-off between the reliability of alerts and the equal detection of true cases.

Introduction

As artificial intelligence becomes integral to critical decisions in fields like medicine and finance, ensuring these systems are fair is paramount. One of the most intuitive fairness criteria is predictive parity, the idea that a positive prediction from an AI tool should be equally trustworthy for every demographic group. However, this seemingly simple goal masks a profound challenge: different, equally valid, notions of fairness often stand in direct mathematical opposition to one another. This conflict creates a critical knowledge gap for practitioners who must choose which definition of 'fair' to implement, often without a clear understanding of the inevitable trade-offs. This article demystifies this complex landscape. The first section, Principles and Mechanisms, will dissect the mathematical foundation of predictive parity, revealing why it is so sensitive to real-world population differences. The following section, Applications and Interdisciplinary Connections, will then explore the tangible consequences of these mathematical truths in high-stakes domains, demonstrating how the choice of a fairness metric is not just a technical decision, but an ethical one with profound human impact.

Principles and Mechanisms

Imagine you are a doctor in a busy emergency room. A new AI tool flashes an alert: "Patient X is at high risk for sepsis!" What is the very first, most practical question you would ask? It's likely not about the algorithm's architecture or its training data. It's simply: "How often is this alert actually correct?"

This question, about the reliability of a positive prediction, is the heart of what we call Positive Predictive Value (PPV). It is the probability that a patient who gets a positive alert ( $\hat{Y}=1$ ) truly has the condition ( $Y=1$ ). Now, let's add a layer of fairness. It seems only natural to demand that the alert's reliability doesn't depend on the patient's demographic group. An alert for a patient from Group A should carry the same weight, the same degree of certainty, as an alert for a patient from Group B. This beautifully simple and intuitive notion is called predictive parity. It mandates that the PPV is equal across all groups.

If a tool's alerts are 60% reliable for one group but only 45% for another, it means clinicians will experience more "false alarms" for the second group. This can lead to alert fatigue, where crucial warnings are ignored, and it can subject patients in the less-favored group to unnecessary, costly, and potentially risky follow-up procedures. The quest for predictive parity is a quest to ensure an AI's warning is equally meaningful for everyone it serves.

But as we peel back the layers, we find that this simple ideal of fairness collides with the stubborn realities of probability in a way that is both fascinating and deeply challenging.

The Recipe for a Prediction's Power

To understand why predictive parity is so elusive, we need to look under the hood at the engine that determines the Positive Predictive Value. It's not magic; it's a direct consequence of Bayes' theorem, a fundamental rule of probability. The formula for PPV is like a recipe with three crucial ingredients:

$\mathrm{PPV} = \frac{\mathrm{TPR} \cdot \pi}{\mathrm{TPR} \cdot \pi + \mathrm{FPR} \cdot (1 - \pi)}$

Let's break this down without getting lost in the symbols.

Sensitivity, or the True Positive Rate ( $TPR$ ): This is the test's ability to correctly identify people who do have the disease. A $TPR$ of $0.90$ means the test catches 90% of all true cases. It’s the first term in the numerator, $\mathbb{P}(\hat{Y}=1 \mid Y=1)$ .
False Positive Rate ( $FPR$ ): This is the rate at which the test mistakenly flags people who are healthy. It is defined as $\mathbb{P}(\hat{Y}=1 \mid Y=0)$ . A low $FPR$ is desirable. (You might be more familiar with Specificity, which is just $1 - FPR$ ).
Prevalence ( $\pi$ ): This is the base rate of the disease in a given population, $\mathbb{P}(Y=1)$ . How common is the condition before we even run the test?

The numerator, $\mathrm{TPR} \cdot \pi$ , represents the fraction of the whole population that are sick and correctly flagged by the test. The denominator, $\mathrm{TPR} \cdot \pi + \mathrm{FPR} \cdot (1 - \pi)$ , represents everyone who gets flagged, both the correct flags (true positives) and the incorrect ones (false positives). So, the PPV is simply the ratio: of all the people the test flags, what fraction are actually sick?

The Cruel Arithmetic of Prevalence

The presence of prevalence, $\pi$ , in this recipe is the source of much of the trouble. Let's see why with a thought experiment, inspired by a real-world screening scenario.

Suppose we have an excellent screening test for depression. We've ensured it works identically well for two subgroups, A and B, meaning it has the same sensitivity (let's say $0.90$ ) and the same specificity ( $0.85$ , which means $FPR = 0.15$ ) for both. This seems eminently fair.

Now, let's say depression is more common in subgroup A ( $\pi_A = 0.20$ ) than in subgroup B ( $\pi_B = 0.08$ ). What happens to the reliability—the PPV—of a positive test result?

For subgroup A: $\mathrm{PPV}_A = \frac{(0.90)(0.20)}{(0.90)(0.20) + (0.15)(1 - 0.20)} = \frac{0.18}{0.18 + 0.12} = 0.60$ A positive test for someone in subgroup A means there's a 60% chance they truly have depression.

For subgroup B: $\mathrm{PPV}_B = \frac{(0.90)(0.08)}{(0.90)(0.08) + (0.15)(1 - 0.08)} = \frac{0.072}{0.072 + 0.138} \approx 0.343$ For someone in subgroup B, the very same positive test result from the very same instrument means there's only a 34.3% chance they have depression.

This is a startling and profound result. Even when a test's intrinsic properties (sensitivity and specificity) are perfectly equal across groups, the meaning of its prediction changes dramatically. The simple fact that the condition is rarer in Group B means that a random positive test is more likely to be a false alarm. This isn't a flaw in the algorithm; it's an inherent feature of probability. Predictive parity is violated not because the tool is biased, but because the world is.

The Fairness Dilemma: Robbing Peter to Pay Paul

If nature won't give us predictive parity, can we force it with a smarter algorithm? We can certainly try. An algorithm can be tuned to satisfy a specific fairness goal. But this is where we encounter the deep trade-offs at the heart of algorithmic fairness.

Let's look at a hypothetical pneumonia-detection model that has been explicitly designed to achieve predictive parity for two groups, A (high prevalence) and B (low prevalence). On a validation set, the engineers succeed: the PPV is exactly $0.60$ for both groups. Predictive parity is achieved! But what was the cost? Let's examine the other performance metrics that resulted from this tuning:

Group A (High Prevalence): Sensitivity ( $TPR_A$ ) = $0.30$
Group B (Low Prevalence): Sensitivity ( $TPR_B$ ) = $0.15$

To make the alarms equally reliable, the model had to become much less sensitive for the low-prevalence group. It now correctly identifies only 15% of pneumonia cases in Group B, while catching 30% in Group A. To achieve fairness in one dimension (equal reliability of alerts), we have introduced a dramatic unfairness in another (unequal ability to detect the disease). Sick patients in Group B are now twice as likely to be missed by the system.

This brings up a competing notion of fairness: equal opportunity. This criterion demands that the True Positive Rate be equal for all groups. It prioritizes ensuring that everyone who is sick has an equal chance of being identified and helped by the system. In the triage scenario above, a clinically conservative approach might prefer to equalize the TPR, accepting that the PPV will differ, to avoid the grave error of disproportionately missing true cases of pneumonia in one group.

A Fundamental Unity: The Impossibility Theorems

These are not just isolated examples. They are manifestations of a deep and unifying mathematical principle. Groundbreaking work by researchers like Alexandra Chouldechova and Jon Kleinberg revealed what are now known as "impossibility theorems" in algorithmic fairness.

The theorems state, in essence, that for any imperfect classifier, it is mathematically impossible to simultaneously satisfy three desirable fairness properties when the underlying base rates of the outcome differ between groups:

Predictive Parity (Equal PPV)
Equalized Odds (Equal TPR and Equal FPR)
Calibration (A risk score of $s$ means an $s$ probability of being positive, for all groups)

The conflict we saw between predictive parity and equal opportunity (equal TPR) is a direct consequence of this. If you enforce equalized odds (which includes equal TPR), the PPV formula we saw earlier, $\mathrm{PPV}_g = \frac{\mathrm{TPR} \cdot p_g}{\mathrm{TPR} \cdot p_g + \mathrm{FPR} \cdot (1-p_g)}$ , shows that PPV becomes a direct function of the prevalence $p_g$ . If the prevalences $p_A$ and $p_B$ are different, the PPVs must be different. You simply cannot have both.

Even the seemingly unassailable property of calibration—that a risk score of 0.7 should mean a 70% risk for everyone—cannot save us. In fact, it is one of the conflicting ingredients. If a score is calibrated for two groups with different base rates, it's impossible for it to also satisfy equalized odds across those groups for a single decision threshold.

This is a beautiful, if somewhat sobering, piece of mathematics. It unifies our observations into a single, powerful statement: there is no single, perfect definition of fairness. We are forced to choose. The different notions of fairness are not just different programming goals; they are different ethical stances. Do we prioritize making our predictions equally reliable for all groups (predictive parity), or do we prioritize making our system equally effective at finding those in need (equal opportunity)? The mathematics doesn't give us the answer. It only reveals, with brilliant clarity, the choice that we must make.

Applications and Interdisciplinary Connections

Having journeyed through the mathematical principles of predictive parity, we now arrive at the most crucial part of our exploration: seeing these ideas come to life. The concepts of fairness we've discussed are not mere abstractions confined to blackboards and textbooks. They are the very tools we must use to scrutinize and shape a world where algorithms increasingly make decisions that touch every facet of our lives, from the doctor's office to the insurance underwriter's desk. This is where the mathematical machinery meets the messy, beautiful, and complex reality of human society.

The Doctor's New Assistant: Algorithms in Medicine

Nowhere are the stakes of algorithmic fairness higher than in healthcare. Imagine an AI model designed to assist in a hospital's emergency department. Its job is to analyze a patient's data and raise an alarm if it detects a high risk of a life-threatening condition like sepsis. Or consider a system that reads medical images, looking for the faint, early signs of cancer, or a tool that screens for depression during a routine check-up. These are not science fiction; they are the present and future of medicine.

How do we ensure these digital assistants are fair to everyone? We must first define what we mean by "fair." As we've seen, fairness can wear many hats. Is it ensuring that the AI gives a positive prediction at the same rate for all demographic groups (demographic parity)? Is it ensuring the tool is equally good at identifying the condition in everyone who is actually sick (equal opportunity)? Or perhaps that it makes errors at the same rate for all groups (equalized odds)? Each of these criteria formalizes a distinct, and often noble, ethical intuition.

Predictive parity, the focus of our discussion, introduces another powerful idea of fairness: a positive prediction should mean the same thing for everyone, regardless of their group. If an AI flags a patient as "high-risk" for a suicide attempt, the actual probability that the patient is truly at high risk should be the same whether that patient belongs to a minoritized community or the majority population. This is the essence of predictive parity: it demands that the positive predictive value (PPV), or $\mathbb{P}(Y=1 \mid \hat{Y}=1)$ , is constant across groups. In other words, the trustworthiness of a positive result should not depend on your demographic background.

The Inescapable Trade-Off

Here we stumble upon a discovery of profound importance, one that emerges not from ethical debate alone, but from the unyielding logic of probability. Let us consider an AI-driven screening program for cervical cancer. The population is diverse, and due to factors like access to HPV vaccination, the prevalence of precancerous conditions differs between a vaccinated group and an unvaccinated one.

Suppose we design our AI tool to be impeccably "fair" in the sense of equalized odds. That is, its sensitivity—the ability to detect cancer in those who have it—is identical for both groups. And its specificity—the ability to correctly clear those who are healthy—is also identical. This sounds perfectly equitable. The test itself works equally well for everyone.

But a surprising and mathematically necessary consequence arises: the positive predictive value will not be the same. A positive result for a person from the high-prevalence (unvaccinated) group will indicate a higher probability of actual disease than a positive result for a person from the low-prevalence (vaccinated) group. The test satisfies equalized odds, but it violates predictive parity.

Why must this be so? The answer lies in Bayes' rule, the engine that connects a test result to the underlying probability of disease. The positive predictive value, $\mathrm{PPV}$ , is not just a function of the test's intrinsic accuracy (its sensitivity and specificity); it is also a function of the disease's base rate, or prevalence ( $\pi$ ), in the population being tested. As the formula reveals, $\mathrm{PPV} = \frac{\mathrm{TPR} \cdot \pi}{\mathrm{TPR} \cdot \pi + \mathrm{FPR} \cdot (1-\pi)}$ . If you hold sensitivity and the false positive rate (FPR) constant for two groups but their prevalences ( $\pi$ ) differ, their PPVs must also differ (unless the test is perfect or completely useless). This isn't a flaw in the algorithm; it's a law of probability. You simply cannot, in general, satisfy both equalized odds and predictive parity at the same time when base rates are unequal.

When Metrics Meet Reality: The Pathways of Harm

This mathematical tension is not a mere academic puzzle. It has grave, real-world consequences. Let's return to the suicide risk prediction model. An analysis might reveal that the model, while satisfying predictive parity (a positive flag means a $30\%$ chance of an attempt for everyone), simultaneously violates equal opportunity. For instance, it might have a true positive rate of $0.90$ for one group but only $0.75$ for another.

What does this mean in human terms? It means that for every $100$ at-risk individuals in the first group, the model correctly identifies $90$ . But for every $100$ at-risk individuals in the second, it only identifies $75$ , leaving $25$ people without the life-saving intervention they need. This is a disparity in the distribution of benefit—a harm of undertreatment.

At the same time, the analysis might show that the model has a higher false positive rate in the second group. This creates a different harm pathway: members of this group who are not at risk are more likely to be incorrectly flagged, subjecting them to unnecessary, stressful, and potentially coercive interventions. Satisfying one fairness metric, like predictive parity, can hide or even create other disparities. There is no single "fairness" button to push; there are only trade-offs to be understood and navigated with wisdom and care.

Beyond the Hospital: Insuring Our Future

The same principles and trade-offs extend far beyond medicine. Consider the world of insurance, where AI models are increasingly used to set premiums and decide on coverage. Here, predictive parity takes on a clear financial meaning: if a model places you in a "high-risk" category, the expected financial cost you represent to the insurer should be the same, regardless of your demographic group. Equalized odds would mean that the rates of being mis-categorized as high-risk (when you are not) or low-risk (when you are not) are the same across groups. Just as in medicine, if the underlying base rates of claims differ between groups, an insurer cannot simultaneously achieve both forms of fairness. This forces a societal conversation: What kind of fairness do we want our financial systems to embody?

A Deeper Question: Fairness in Prediction vs. Fairness in Outcome

This brings us to the deepest question of all. We have spent our time trying to make the predictions of our algorithms fair. But what if fairness is ultimately about outcomes?

Let's go back to the sepsis prediction model one last time. Suppose we have two distinct patient groups: postpartum patients, for whom missing a sepsis case (a false negative) is utterly catastrophic, and patients with severe drug allergies, for whom an unnecessary empiric treatment (a false positive) can trigger a dangerous reaction. The cost of an error is not the same for these two groups.

A strict procedural fairness might demand we use the same risk threshold for both. But a decision-theoretic approach, one grounded in minimizing expected harm, leads to a startling conclusion. To achieve the best possible outcome for everyone—to minimize the total burden of suffering—we should use a lower threshold for the postpartum patient (being very quick to alert) and a higher threshold for the allergy-prone patient (being more cautious).

Here, treating everyone "the same" by using one threshold would be demonstrably unfair, because it would lead to worse outcomes. True fairness, in this view, is not about the equality of statistics, but about the equity of consequences. It requires us to distinguish fairness in the process from fairness in the result. This doesn't make our journey through fairness metrics any less important. On the contrary, it makes it more so. Only by understanding the precise behavior of our tools, their inherent trade-offs, and their real-world implications can we begin to make the wise choices needed to build a future where our powerful technologies serve the welfare of all humanity.