Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Scarce Resource Allocations That Rely On Machine Learning
Should Be Randomized

Shomik Jain    Kathleen Creel    Ashia Wilson
Abstract

Contrary to traditional deterministic notions of algorithmic fairness, this paper argues that fairly allocating scarce resources using machine learning often requires randomness. We address why, when, and how to randomize by proposing stochastic procedures that more adequately account for all of the claims that individuals have to allocations of social goods or opportunities.

fairness, randomness, systemic exclusion

1 Introduction

Sometimes resources or opportunities are scarce: jobs, welfare benefits, or life-saving medicines cannot be divided among all those who deserve them. Worse yet, it is often unclear which individuals are most deserving. Perhaps they all are. Decision-makers hope to use algorithmic systems to allocate scarce resources and goods fairly. But without careful attention, it is easy for algorithms to replicate or amplify the biases and inequalities in their training data.

The fair machine learning community has developed sophisticated theoretical and formal tools to reduce algorithmic bias, increase fairness, and promote justice. However, these tools are almost exclusively deterministic. For example, employers with more qualified applicants than job openings often rely on hiring algorithms to screen applicants for interviews (Raghavan et al., 2020). These algorithms assign a score or ranking to candidates. Employers then threshold these scores or rankings to deterministically pick candidates to interview. Similarly, healthcare providers often have a limited supply of life-saving medical resources such as ventilators, therapeutics, or organs. Patients are often triaged based on algorithms that predict their survival rate or life expectancy post-treatment (Chin et al., 2023). Most existing work on algorithmic fairness relies on deterministic algorithms to incorporate fairness. Once algorithmic bias has been reduced to the extent possible, the algorithm allocates resources to the top candidate(s). If Alice is the top-ranked candidate for every job or has the most expected quality-adjusted life-years, she should deterministically receive the job offer or organ every time.

Recent works on arbitrariness and fairness suggest that even counterfactual non-determinism can be unfair. If there exist many possible models with similar predictive performance but slightly different decisions on individuals, a state of affairs called “predictive multiplicity” (Marx et al., 2020) or “model multiplicity” (Black et al., 2022), it is unfair to naively pick one of the models for our decision-making algorithm (Hsu & Calmon, 2022). Instead, we should reduce multiplicity by altering the training process to reduce the variance that leads to diverging predictions (Cooper et al., 2023), especially on under-represented individuals (Ganesh et al., 2023), iterate until predictions agree about individuals (Roth et al., 2023) or even abstain from making predictions on some people altogether (Cooper et al., 2023).

While sharing the goal of reducing bias and increasing fairness, this work argues that the fair machine learning community has underutilized non-determinism and randomization as tools to achieve fairness. In some settings that involve algorithmic decision-making, we contend that non-determinism is required for fair outcomes. In what follows, we first motivate why and when fairness requires randomization. We adopt philosopher John Broome’s concept of the value of lotteries in fairness to argue that randomization is needed in scarce resource settings to respect the claims that individuals have to resources, even if they do not receive them, by giving each person with a claim a chance.

Second, we argue that because algorithmic predictions involve uncertainty, it is unfair to those on whom we make mistakes to deterministically commit to those mistakes. This is especially true in multi-shot contexts in which each individual is affected by multiple decision-makers or a series of decisions over time. Across an ecosystem of multiple decision-makers, consistently allocating scarce resources and goods to the same candidate(s) can be sub-optimal, as it prevents the decision-makers from learning (Kleinberg & Raghavan, 2021; Peng & Garg, 2023), or unfair, as it means that many decision-makers reject or make mistakes on the same individuals (Ajunwa, 2021; Creel & Hellman, 2022; Bommasani et al., 2022; Toups et al., 2023; Jain et al., 2024). And when individuals receive a series of decisions over time, biased early allocations often affect the data available for the next decision-maker. An initial negative judgement leads to a series of further negative judgements, forming a “patterned inequality” or compounding injustice (Eidelson, 2021; Hellman, 2018). For these reasons, deterministic judgements under uncertainty can reinforce structural injustices. Therefore, we take the position that many scarce resource allocations that rely on machine learning111We consider any data-driven algorithmic decision-making process to be under the umbrella of the term “machine learning”. should be randomized.

Having motivated randomizing allocations (section 3), we then formalize how to randomize to bring about more fair outcomes. We consider two settings: when claims are known (subsection 4.1), and when claims are uncertain (subsection 4.2). Finally, we discuss why existing deterministic methods may not be enough to achieve both fair and efficient allocations (section 5).

2 Related Work

The idea that randomizing decisions might promote fairness when resources are scarce is not new. Lotteries have been used to admit students to some public schools (Hastings et al., 2006) and medical schools (Cohen-Schotanus et al., 2006). In healthcare, lotteries were also used to allocate COVID-19 treatments (McCreary et al., 2023).

We build on work that extends this concept to machine learning and advocates for using randomness to increase fairness in algorithmic decision-making. For example, concerned with decision quality and the loss of diversity in the decision-making process that comes from relying on machine learning, Grgić-Hlača et al. (2017) propose randomizing among models in classifier ensembles obtained by retraining multiple times. We second their concern, connecting it to the growing literature on the homogenization of outcomes that results from automated decision-making (Kleinberg & Raghavan, 2021; Ajunwa, 2021; Creel & Hellman, 2022; Bommasani et al., 2022; Toups et al., 2023; Jain et al., 2024) and from the use of foundation models (Bommasani et al., 2021). We extend the idea of randomizing among models with similar performance to the setting of fair allocations based on claims.

Agarwal & Deshpande (2022) are concerned with the loss of accuracy that results when we impose fairness constraints and introduce a randomized framework for classification to address this concern. We follow them in demonstrating that many randomization procedures have minimal impact on accuracy while improving fairness. Furthermore, Singh et al. (2021) argue that evaluating candidate merit without accounting for uncertainty is unfair in rankings. To address this unfairness, they introduce the notion of a “posterior merit distribution,” suggesting that goods should be allocated based on the probability that an individual is among the top k𝑘kitalic_k candidates. We agree that fairness sometimes requires quantifying uncertainty and extend the argument to specify when and how to incorporate randomness based on uncertainty.

The reinforcement learning literature also considers randomization under uncertainty in multi-shot contexts. However, these works focus on stochastic methods that can help a single decision-maker learn and improve their own utility over time (Agrawal & Goyal, 2012; Joseph et al., 2016b; Li et al., 2020). We consider the broader multi-shot context that may involve multiple decision-makers making any number of allocations, and center fairness and individual claims as our motivation for randomness. Past work also centers individual fairness under uncertainty (Joseph et al., 2016a) and the distribution of errors across individuals (Sharifi-Malvajerdi et al., 2019).

3 Why and When To Randomize Allocations

In this section, we motivate randomization by arguing that (weighted) lotteries better achieve fairness than deterministic allocation algorithms in certain settings. We give two reasons. First, relying on John Broome’s characterization of fairness (Broome, 1990), we show that when more individuals deserve a good than can receive it, the best way to respect each person’s claim to that good is to give them a chance to receive it by holding a lottery.

Second, deterministic decision-making over-represents the certainty of our predictions. Many ML-based social allocation problems have uncertain parameters. We cannot be sure that we have formulated the problem-to-be-solved well, that we have chosen appropriate variables or parameters, or that our data is accurate (Passi & Barocas, 2019). Indeed, in many cases we suspect that our problem formulation, variable choice, and data gathering all may have been systematically skewed by social conditions. Our uncertainty leads us to underestimate the claims of some to the good and overestimate the claims of others, which has moral implications in situations when not all claims can be satisfied.

3.1 Fairness and Individual Claims

To motivate why lotteries are sometimes needed to ensure fairness, we turn to John Broome’s influential theory of fairness (Broome, 1990). Broome’s theory is based on the moral concept of a “claim.” An individual has a claim to a good, resource, or opportunity when she is owed it for reasons of fairness (Broome, 1990, p.96). For example, if a kidney is being allocated and there are two individuals who have been on the waiting list for an equal amount of time and are in all other respects equivalent, both individuals have a claim on the kidney.

Some claims stem from desert: the person deserves the good (Broome, 1990, p.93). All claimants on kidneys deserve the chance at life, so all have a claim, even if there is disagreement about what factors attenuate the strength of their claims. While claims based on a right to life always exist, other claims only arise when a good is being allocated (Broome, 1990, p.97). The best candidate for a job does not have a claim to work at a company with no openings, but once there is an opening she may have stronger claims than others based on her merit. Desert, need, and merit can all ground claims.

Claims are different from both the more familiar utility calculations and side-constraints such as rights. Utilities can be weighed against each other: a kidney is allocated according to utilitarian principles such that its allocation produces the greatest overall benefit. Once the allocation is calculated, nothing remains to be said about the individual who did not receive a kidney. Unlike utilities, however, claims linger. The otherwise-identical individual who did not receive a kidney still had a claim to that kidney, and she was not fairly treated if her claims was simply “overridden” by a deterministic allocation (Broome, 1990, p.98). Claims are also unlike side-constraints such as rights. If someone has a right to a good, that right cannot be discharged or outweighed: it “directly … determines what ought to be done” (Broome, 1990, p.91). Rights are not comparative between individuals: they simply mandate what must be done to respect the right.

Claims, by contrast, are essentially comparative, as they are a matter of fairness. If both people in need of a kidney are equal in all morally relevant senses, then they have equally strong claims on the kidney. It would be unfair to deterministically allocate the kidney to one person because doing so would override or ignore the equivalent claim that the other patient has on the same kidney (Broome, 1990, p.95). But what if Person a𝑎aitalic_a has a slightly stronger claim than Person b𝑏bitalic_b? Broome argues that if “fairness requires everyone to have an equal chance when their claims are exactly equal, then it is implausible it should require some people to have no chance at all when their claims fall only a little below equality” (Broome, 1990, p.99). In other words, b𝑏bitalic_b’s claim does not go away just because a𝑎aitalic_a’s claim is marginally stronger. This motivates Broome’s proposal to allocate scarce and indivisible goods using a lottery weighted by the strength of claims. A weighted lottery allows stronger claims to have a proportionately stronger chance while not overriding weak claims. By giving everyone with a claim a chance, lotteries and other randomization techniques give a “surrogate satisfaction” to the claimants – the next best thing to actually receiving the good (Broome, 1990, p.99). In summary, according to Broome, the following two conditions should be met in order for an allocation to be considered fair:

  1. BF.1

    The chance of a positive outcome should be greater for those with stronger claims.

  2. BF.2

    Stronger claims should not completely override weaker claims.

3.2 Multi-Shot Contexts: Systemic Denial of Claims

The perspective of individual fairness concerns whether each allocation respects individual claims, but we should also consider whether the structure of allocations as a whole is fair. Concerns of structural injustice arise when certain individuals find their claims repeatedly denied, whether by multiple decision-makers at the same time (systemic exclusion) or by multiple decision-makers across time (patterned inequality). In conditions of systemic exclusion, decision-makers across an ecosystem are correlated in their decisions such that they make mistakes on the same people (Creel & Hellman, 2022; Bommasani et al., 2022; Toups et al., 2023). For example, different companies attempting to hire candidates in the same sector often rely on the same third-party vendors for their automated hiring tools. In fact, over 60% of Fortune 100 companies use the same vendor (HireVue) (Nawrat, 2023). Relying on the same vendor can lead to identical outcomes if different companies use the same underlying model to rank candidates, or correlated outcomes if each company personalizes their model. In either case, correlation between decision-makers can lead to the same individuals being “algorithmically blackballed” and excluded from opportunities (Ajunwa, 2021, 681).

Patterned inequality (Eidelson, 2021) is another form of systemic injustice. It occurs when receiving one allocation increases your likelihood of receiving future allocations (and likely also your claims to those allocations) such that clear patterns in social inequality emerge as a result of initial conditions. This situation is also referred to as the “Matthew effect,” in which the rich (or otherwise advantaged) get richer over time due to their starting condition of advantage. Patterned inequality is visible in domains such as healthcare, where allocations of life-saving medical services are often conditioned on projections of life expectancy or evaluations of current health. Both of these, however, are influenced by past access to treatment and health insurance, for which there are well-documented inequalities between socioeconomic groups (Schmidt et al., 2021). Algorithmic decision-making can exacerbate these inequalities by recognizing the current gap in health without recognizing the unequal starting conditions of health, wealth, and stability that gave rise to them (Eidelson, 2021; Hellman, 2018; Jain et al., 2024). Though both these forms of structural injustice involve a myriad of dynamics, we argue that randomization can help to address both concerns.

3.3 Inherent Uncertainties in Predicting Claims

While Broome argues that a lottery weighted on the strength of claims is fair, he acknowledges that it is not always clear which particular reasons are claims and which are not (Broome, 1990, p.93). This makes it difficult to compare the strength of claims, leading some to reject the claims framework altogether (Kirkpatrick & Eastwood, 2015). However, this so-called “calculation objection” applies to any problem formulation for allocating resources, including utilitarian and rights-based frameworks (Passi & Barocas, 2019; Mitchell et al., 2021). Given the inherent uncertainty in any problem formulation, deterministic allocations on any basis will be unfair to some individuals, especially if we acknowledge that claims exist and some will not have been fairly calculated.

Let us assume there exists some problem formulation for the strength of claims (e.g. worker productivity in hiring, life-expectancy in healthcare). Many formulations require certainty about what an individual will do (e.g. individual risk). However, individual future outcomes are fundamentally unknowable, especially since the events in question are typically realized only once (Dawid, 2017; Dwork et al., 2021; Roth et al., 2023). Instead, decision-makers often must estimate what an individual is likely to do based on their features and data about what other people with similar features have done in the past. For instance, in the kidney allocation, we might determine that a patient’s claim should be based on how much longer they are expected to live. If we have data on prior patients, we could develop an algorithm to gauge the strength of a patient’s claim based on features such as their age, medical history, and lifestyle.

When posed as a supervised learning problem, the choice of features, training data, and model class each contribute additional uncertainty to our estimates of the strength of claims. First, a person’s features may or may not be predictive of their claim or even measurable. In many social settings, the vast majority of people remain inseparable on the basis of the features that can be measured (also referred to as there being no “margin”). For example, in the canonical New Adult Census dataset, 95% of individuals have feature representations for which there exist examples in both prediction classes of high and low income. An individual could have a strong claim that is predicted to be weak because there were only a few examples of similar individuals in the dataset and they all happened to have weak claims. Deterministically picking the strongest predicted claims may constitute a kind of stereotyping in this situation. Likewise, people may “look risky” because the features measured in the data are not adequate to evaluate their claims. In both cases, people are systemically denied opportunities they deserve. We argue that in these situations randomizing can increase the fairness of allocations.

Moreover, the uncertainty in predictions of claims may be higher for some individuals than for others. Consider the phenomenon of predictive multiplicity, wherein there exist multiple models with similar accuracy that yield different predictions for certain individuals (Marx et al., 2020; Black et al., 2022). For individuals with high variance in their predictions, it seems unfair to weight their chances based on the prediction of a naively chosen model. The related concept of leave-one-out unfairness highlights how some individuals can receive radically different predictions due to the inclusion or removal of a single other person in the training data (Black & Fredrikson, 2021; Broderick et al., 2020). This may be due to the fact that some individuals are outliers based on the selected features or under-represented in the training data. Uncertainty quantification methods such as conformal prediction can help to identify these individuals (Angelopoulos & Bates, 2021). As we describe further in subsection 4.2, these methods offer a way to account for varying levels of uncertainty in a weighted lottery using predicted claims.

4 How To Randomize Allocations

We now formalize how randomization can help to address the ethical demand of satisfying individual claims in algorithmic decision-making. Specifically, we propose different methods for randomization when claims are known or uncertain and also show how these methods can help alleviate the structural concerns of systemic exclusion and patterned inequality. We also discuss the potential tradeoff between randomization and accuracy and how to interpolate between them when the tradeoff exists.

4.1 When Claims Are Known

Consider a setting in which there are n𝑛nitalic_n individuals and each individual i𝑖iitalic_i is assigned a score ci[0,1]subscript𝑐𝑖01c_{i}\in[0,1]italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ] in perfect accordance with their claim. We say that individual i𝑖iitalic_i has a stronger claim than individual j𝑗jitalic_j if ci>cjsubscript𝑐𝑖subscript𝑐𝑗c_{i}>c_{j}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. A decision-maker allocates outcomes oi{0,1}subscript𝑜𝑖01o_{i}\in\{0,1\}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } to each individual i𝑖iitalic_i. Importantly, there is scarcity in that not all claims can be satisfied: only k𝑘kitalic_k out of n𝑛nitalic_n individuals can receive positive outcomes with knmuch-less-than𝑘𝑛k\ll nitalic_k ≪ italic_n, for a selection rate of k/n𝑘𝑛k/nitalic_k / italic_n.

Definition 4.1.

An iterative weighted selection chooses one individual in each round t𝑡titalic_t without replacement until k𝑘kitalic_k individuals are selected. Specifically, an individual i𝑖iitalic_i in round t𝑡titalic_t has probability wi,tsubscript𝑤𝑖𝑡w_{i,t}italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT of being selected. For all rounds t{1,,k}𝑡1𝑘t\in\{1,\ldots,k\}italic_t ∈ { 1 , … , italic_k }, we require j=1nt+1wj,t=1superscriptsubscript𝑗1𝑛𝑡1subscript𝑤𝑗𝑡1\sum_{j=1}^{n-t+1}w_{j,t}=1∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - italic_t + 1 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT = 1 so that exactly one individual is selected per round.

Note that the formulation above encapsulates many kinds of selections. For example, deterministically selecting the top k𝑘kitalic_k claims222This is equivalent to selecting all claims greater than or equal to the threshold T=c(k)𝑇subscript𝑐𝑘T=c_{(k)}italic_T = italic_c start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT where c(k)subscript𝑐𝑘c_{(k)}italic_c start_POSTSUBSCRIPT ( italic_k ) end_POSTSUBSCRIPT is the k-th largest claim. would take wi,t=𝟙[ci=maxj{1nt+1)}cj]w_{i,t}=\mathds{1}[c_{i}=\max_{j\in\{1...n-t+1)\}}c_{j}]italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = blackboard_1 [ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_j ∈ { 1 … italic_n - italic_t + 1 ) } end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ]. Recall that Broome’s notion of fairness calls for weights that are chosen in proportion to a claim’s strength.

Definition 4.2.

An allocation A𝐴Aitalic_A involves the assignment of outcomes oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT through an iterative weighted selection based on weights wi,tsubscript𝑤𝑖𝑡w_{i,t}italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT for each claim cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and each round t𝑡titalic_t. It satisfies Broome-Fairness (BF) if for all rounds t𝑡titalic_t and all individuals i,j𝑖𝑗i,jitalic_i , italic_j not yet selected:

  1. 1.

    ci>cjwi,t>wj,tsubscript𝑐𝑖subscript𝑐𝑗subscript𝑤𝑖𝑡subscript𝑤𝑗𝑡c_{i}>c_{j}\implies w_{i,t}>w_{j,t}\;italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⟹ italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT > italic_w start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT (c.f. BF.1)

  2. 2.

    ci>0wi,t>0subscript𝑐𝑖0subscript𝑤𝑖𝑡0c_{i}>0\implies w_{i,t}>0\;italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 ⟹ italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT > 0 (c.f. BF.2)

Example 4.3 (BF Lottery).

An allocation with wi,t=ciCtsubscript𝑤𝑖𝑡subscript𝑐𝑖subscript𝐶𝑡w_{i,t}=\frac{c_{i}}{C_{t}}italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = divide start_ARG italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG satisfies BF, where wi,tsubscript𝑤𝑖𝑡w_{i,t}italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT is calculated among the remaining individuals i𝑖iitalic_i in round t𝑡titalic_t and Ct=j=1nt+1cjsubscript𝐶𝑡superscriptsubscript𝑗1𝑛𝑡1subscript𝑐𝑗C_{t}=\sum_{j=1}^{n-t+1}c_{j}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - italic_t + 1 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents the sum over claims not selected in previous rounds.

Importantly, deterministic allocations do not satisfy BF because some weights are zero (violating BF.2) and no distinction is made among rejected claims of varying strengths (violating BF.1). An unweighted lottery also violates BF.1 by assigning the same weight to all individuals: wi,t=1nt+1subscript𝑤𝑖𝑡1𝑛𝑡1w_{i,t}=\frac{1}{n-t+1}italic_w start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n - italic_t + 1 end_ARG.

4.1.1 Systemic Harms

A lottery weighted by the strength of claims can help to alleviate the structural concerns of systemic exclusion and patterned inequality. Suppose there are m>1𝑚1m>1italic_m > 1 decision-makers conducting allocations either concurrently or across time. Let oi(j)superscriptsubscript𝑜𝑖𝑗o_{i}^{(j)}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT denote the outcome for individual i𝑖iitalic_i by decision-maker j𝑗jitalic_j. Our concern is with the proportion of individuals (or groups) who exclusively receive negative outcomes from all m𝑚mitalic_m decision-makers.

Definition 4.4.

The systemic exclusion rate (SER) (Bommasani et al., 2022) across m>1𝑚1m>1italic_m > 1 decision-makers is:

𝔼i[j=1m𝟙[oi(j)=0]]subscript𝔼𝑖delimited-[]superscriptsubscriptproduct𝑗1𝑚1delimited-[]superscriptsubscript𝑜𝑖𝑗0\textstyle\mathbb{E}_{i}\left[\prod_{j=1}^{m}\mathds{1}[o_{i}^{(j)}=0]\right]blackboard_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_1 [ italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = 0 ] ]

To illustrate why randomization can help reduce SER, we begin by considering two stylized models of allocations. As our stylized models will suggest, if the existing SER is sufficiently high across the m𝑚mitalic_m decision-makers or m𝑚mitalic_m is sufficiently large, then in expectation, randomization will help both systemic exclusion and patterned inequality.

Example 4.5 (Systemic Exclusion).

Suppose there are m𝑚mitalic_m allocations at the same time and that there are many more individuals with similarly strong claims than available positive outcomes. If the SER is sufficiently high for the existing set of allocations, than any allocation satisfying BF will decrease the SER in expectation.

Refer to caption
(a) Many Possible Distributions of Claims
Refer to caption
(b) Reduction in SER using BF Lottery (k/n𝑘𝑛k/nitalic_k / italic_n = 0.25)
Example 4.6 (Patterned Inequality).

Suppose there are m𝑚mitalic_m sequential allocations across time, and that receiving a positive outcome increases an individual’s claim in the next allocation. Also assume that in the first allocation, there are many more individuals with similarly strong claims than available positive outcomes. Then any set of sequential allocations satisfying BF will decrease the SER in expectation when compared to a set of deterministic allocations if the benefit from a positive outcome is sufficiently high.

In general, the ability of allocations satisfying BF to reduce the SER will depend on the distribution of claims, correlation between allocations across decision-makers, and selection rate (k/n𝑘𝑛k/nitalic_k / italic_n). We simulate how much randomization can reduce SER for various distributions of claims, when each decision-maker has a noisy estimation of these claims (±N(0,σ2)plus-or-minus𝑁0superscript𝜎2\pm\,N(0,\sigma^{2})± italic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )). As Figure LABEL:sub@fig:claims_dist illustrates, we consider the following distributions:

  • Uniform: all claims equally likely

  • Normal: more average claims

  • Inverted Normal: more strong and weak claims

  • Pareto: more weak claims

  • Inverted Pareto: more strong claims

For all these distributions and many different selection rates and noise amounts, we observe a substantial reduction in SER if each decision-maker uses the weighted lottery satisfying BF in Example 4.3 rather than deterministically selecting their top k𝑘kitalic_k claims. Figure LABEL:sub@fig:claims_ser provides a snapshot of our results in the setting where k/n=0.25𝑘𝑛0.25k/n=0.25italic_k / italic_n = 0.25 and σ=0.025𝜎0.025\sigma=0.025italic_σ = 0.025 (Appendix Figure 6 shows other cases are similar).

4.1.2 Utility and Randomization

Why might one want to allocate deterministically? From the viewpoint of risk-averse decision-makers, the objective is to maximize their own utility. In hiring, for instance, a company might allocate job interview slots based on a candidate’s likelihood of being hired. We simplify to the case where each individual has some utility oi{0,1}subscriptsuperscript𝑜𝑖01o^{*}_{i}\in\{0,1\}italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } (e.g. whether or not the candidate would be hired).

Definition 4.7.

The utility of an allocation is 1ki=1noi𝟙{oi=1}1𝑘superscriptsubscript𝑖1𝑛subscriptsuperscript𝑜𝑖1subscript𝑜𝑖1\frac{1}{k}\sum_{i=1}^{n}o^{*}_{i}\cdot\mathds{1}\{o_{i}=1\}divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ blackboard_1 { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 }, which is simply the precision: i.e. the proportion of selected individuals k𝑘kitalic_k that provide utility.

Note that oisubscriptsuperscript𝑜𝑖o^{*}_{i}italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can only be observed if the individual receives the resource being allocated (oi=1subscript𝑜𝑖1o_{i}=1italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1), which motivates the idea of expected utility.

Definition 4.8.

Suppose an individual’s chance of providing utility is pi=(oi=1)subscript𝑝𝑖subscriptsuperscript𝑜𝑖1p_{i}=\mathbb{P}(o^{*}_{i}=1)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = blackboard_P ( italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ). Accordingly, the expected utility of an allocation is 1ki=1npi𝟙{oi=1}1𝑘superscriptsubscript𝑖1𝑛subscript𝑝𝑖1subscript𝑜𝑖1\frac{1}{k}\sum_{i=1}^{n}p_{i}\cdot\mathds{1}\{o_{i}=1\}divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ blackboard_1 { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 }.

Utility aligns with the strength of claims in some merit-based allocations. The individuals with the strongest claims are those with the most “merit” or closest fit between their skills and the needs of the role. These candidates therefore are also the most likely to be hired by the company. Individuals may have other claims besides merit, such as claims based in desert or entitlement. However, we adopt the decision-maker’s perspective because it is the least favorable viewpoint to motivate randomization and results in the worst-possible tradeoffs between the decision-maker’s notion of utility and desirable properties of randomization.

If claims are exactly the chance of providing utility, then deterministically selecting the k𝑘kitalic_k strongest claims will maximize the expected utility. But as we discussed above, this overrides the most claims and violates both BF.1 and BF.2. On the other hand, any amount of randomization will lead to some probability of allocating resources to those who do not have the strongest claims, resulting in some sacrifice of expected utility in favor of respecting more claims. Figure LABEL:sub@fig:claims_utility illustrates the difference in expected utility between the top k𝑘kitalic_k allocation and BF lottery in Example 4.3 for the normal and inverted Pareto distributions (other distributions are similar). Note that this tradeoff increases with scarcity (i.e. low selection rates).

Refer to caption
(c) Expected Utility for Top k𝑘kitalic_k v. BF Lotteries
(Partial BF: ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.5kabsent𝑘\cdot k⋅ italic_k, nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = k𝑘kitalic_k)
Refer to caption
(d) Expected Utility for Varying Partial BF Randomization Rates (Normal Dist; k/n𝑘𝑛k/nitalic_k / italic_n = 0.25)
Refer to caption
(e) SER v. Expected Utility Tradeoff for Varying
Partial BF Randomization Rates (k/n𝑘𝑛k/nitalic_k / italic_n = 0.25)

Balancing this tradeoff requires a consideration of the fairness arguments on each side. Consider two patients who both have a claim to a single kidney. Patient a𝑎aitalic_a’s claim is based in their survival probability of 0.51, while patient b𝑏bitalic_b has a survival probability of 0.49. If we deterministically chose to give the kidney to patient a𝑎aitalic_a, Broome would argue that this is unfair to patient b𝑏bitalic_b since they have no chance at all to receive the kidney despite only having a slightly weaker claim (Broome, 1990, p.99). But what if we instead compare patient c𝑐citalic_c, who has a survival rate of 0.99, to patient d𝑑ditalic_d, who only has a survival rate of 0.01? Brad Hooker points out in an objection to Broome that “a great unfairness would occur” if we held a weighted lottery and patient d𝑑ditalic_d won given the comparative strength of patient c𝑐citalic_c’s claim in this case (Hooker, 2005).333Hooker’s objection considers one patient (here, c𝑐citalic_c) who would die without the medicine and another (d𝑑ditalic_d) who would only lose a finger (Hooker, 2005). This motivates the idea of not randomizing some very strong or weak claims while still conducting a weighted lottery for the remaining claims.

Definition 4.9.

An allocation satisfies partial Broom-Fairness when the criteria BF.1 and BF.2 are met for:

  1. 1.

    a subset of resources: k(0,k]superscript𝑘0𝑘k^{\prime}\in(0,k]italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( 0 , italic_k ]

  2. 2.

    a subset of claims: n(k,nk+k]superscript𝑛superscript𝑘𝑛𝑘superscript𝑘n^{\prime}\in(k^{\prime},n-k+k^{\prime}]italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_n - italic_k + italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]

Example 4.10 (Partial BF Lottery).

The following allocation satisfies partial BF: Give kk𝑘superscript𝑘k-k^{\prime}italic_k - italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT resources to the top claims. Then conduct the iterative weighted selection in Example 4.3 for the remaining ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT resources over the nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT claims closest to the k𝑘kitalic_k-th largest claim.

We discuss how a lottery satisfying partial BF can result in a much smaller tradeoff with expected utility, yet still substantially reduce SER. Figure LABEL:sub@fig:claims_utility shows that the expected utility difference is <<< 0.05 across different selection rates for a partial BF lottery that uses ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.5kabsent𝑘\cdot k⋅ italic_k and nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = k𝑘kitalic_k. In other words, we first allocate half the available resources to the top 0.5k0.5𝑘0.5k0.5 italic_k claims, then randomize over the next k𝑘kitalic_k strongest claims for the other half of the available resources. Note that this has the effect of randomizing near the so-called “decision-boundary,” which represents the k𝑘kitalic_k-th largest claim in our framework. Figure LABEL:sub@fig:claims_partial_bf explores how varying partial BF randomization rates (i.e. different combinations of ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) change the difference in expected utility in the setting444Appendix A.1 replicates Figure LABEL:sub@fig:claims_partial_bf for different dist. and k/n𝑘𝑛k/nitalic_k / italic_n. where claims are normally distributed and k/n𝑘𝑛k/nitalic_k / italic_n = 0.25. We find that larger randomization rates have a disproportionately larger decrease in expected utility.

Figure LABEL:sub@fig:claims_ser_partial shows the lowest SER that we can achieve for a given tradeoff with expected utility. Specifically, we compare varying partial BF randomization rates for the same setting as Figure LABEL:sub@fig:claims_ser where k/n=0.25𝑘𝑛0.25k/n=0.25italic_k / italic_n = 0.25 and the noise is σ=0.025𝜎0.025\sigma=0.025italic_σ = 0.025. Consider, for instance, the 2% difference in expected utility from the partial BF lottery in Figure LABEL:sub@fig:claims_utility when claims are normally distributed. This yields greater than a 20% reduction in SER when there are m>2𝑚2m>2italic_m > 2 decision-makers. See Appendix A.1 for many other examples across different distributions of claims, selection rates, and amounts of noise added for each decision-maker.

4.2 When Claims Are Uncertain

A fundamental assumption in machine learning is that the targets of interest (in our case, claims) are predictable from a set of measurable features in some domain 𝒳𝒳{\mathcal{X}}caligraphic_X. While pi=(oi=1)subscript𝑝𝑖subscriptsuperscript𝑜𝑖1p_{i}=\mathbb{P}(o^{*}_{i}=1)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = blackboard_P ( italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) might be unknowable, the conditional probability p(xi)=(oi=1|xi)𝑝subscript𝑥𝑖subscriptsuperscript𝑜𝑖conditional1subscript𝑥𝑖p(x_{i})=\mathbb{P}(o^{*}_{i}=1\,|\,x_{i})italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = blackboard_P ( italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) can be estimated from data. The validity of taking p(xi)𝑝subscript𝑥𝑖p(x_{i})italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to be an estimate of pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT depends on the choice of features that are measured and predictability of the outcomes from those features. Putting these concerns aside, a machine learning model p^:𝒳[0,1]:^𝑝𝒳01\hat{p}:{\mathcal{X}}\rightarrow[0,1]over^ start_ARG italic_p end_ARG : caligraphic_X → [ 0 , 1 ] maps an individual’s features xi𝒳subscript𝑥𝑖𝒳x_{i}\in{\mathcal{X}}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X to a prediction p^(xi)^𝑝subscript𝑥𝑖\hat{p}(x_{i})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), which estimates the conditional probability p(xi)=(oi=1|xi)𝑝subscript𝑥𝑖subscriptsuperscript𝑜𝑖conditional1subscript𝑥𝑖p(x_{i})=\mathbb{P}(o^{*}_{i}=1\,|\,x_{i})italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = blackboard_P ( italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). In a healthcare allocation, p^(xi)^𝑝subscript𝑥𝑖\hat{p}(x_{i})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) might represent a model’s estimate based on prior patients in a hospital, whereas p(xi)𝑝subscript𝑥𝑖p(x_{i})italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents the conditional probability if we could measure all possible patients represented in feature space 𝒳𝒳{\mathcal{X}}caligraphic_X.

Standard practice in machine learning would be to deterministically assign oi=1subscript𝑜𝑖1o_{i}=1italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 to the individuals with the k𝑘kitalic_k highest value of p^(xi)^𝑝subscript𝑥𝑖\hat{p}(x_{i})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). While this doesn’t satisfy BF, it is unclear what implementable allocation does due to the distinction between p^(xi)^𝑝subscript𝑥𝑖\hat{p}(x_{i})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For example, we could use the weighted lotteries in Example 4.3 or 4.10 by replacing cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with p^(xi)^𝑝subscript𝑥𝑖\hat{p}(x_{i})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). However, the estimation error in p^(xi)^𝑝subscript𝑥𝑖\hat{p}(x_{i})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) could be higher for certain individuals than others, potentially violating BF.1.

Refer to caption
(f) Distribution of Swiss Unemployment Predictions

We take our working example555Appendix A.2 includes an additional example of income prediction using the New Adult Census dataset (Ding et al., 2021). to be the 2003 Swiss Unemployment dataset (Lechner et al., 2020). The goal is to allocate scarce unemployment assistance resources such as job search and training programs. Suppose an individual’s true claim to these benefits is how long they would remain unemployed without them, and that the programs want to target those who would have remained unemployed for at least 1 year. In this example, individual claims align with the decision-makers’ notion of utility as long-term unemployment. We can only estimate an individual’s probability of being long-term unemployed based on features such as their age, place of residence, education, previous job, prior income, etc. Figure LABEL:sub@fig:swiss_dist shows that predictions666We subset to individuals that did not receive an unemployment benefit (n=78,294𝑛78294n=78,294italic_n = 78 , 294) and use an 80-20 train-test split (with 5 repetitions). Randomization results avg. over 100 iterations. appear to follow a normal distribution for 3 different model classes: logistic regression, random forests, and decision trees.

For our main analysis, we use a selection rate of k/n𝑘𝑛k/nitalic_k / italic_n = 0.25 and explore other selection rates in the Appendix (which yield similar results). 22% of individuals in the dataset received some form of unemployment assistance, although the most effective programs only had capacity for <<<5% of individuals. Among those that did not receive assistance, 44% of individuals remained long-term unemployed (at least 1 year). For our selection rate of k/n𝑘𝑛k/nitalic_k / italic_n = 0.25, standard practice would choose the top 25% of predictions. This yields an (observed) utility of just 63.3% on average across all 3 models.

In what follows, we first explore using the partial weighted lottery in Example 4.10 to randomize over predictions near the decision-boundary. We then propose two other randomization methods that quantify and incorporate the varying levels of uncertainty in predictions across individuals. We discuss how each method changes how many resources (ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) and what kinds of people (nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) are randomized, while maintaining a minimal loss in utility.

Table 1: Randomizing using variance compared to randomizing near the decision-boundary & the top k𝑘kitalic_k allocation.
Model Random Rate Utility
k/ksuperscript𝑘𝑘k^{\prime}/kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_k n/nsuperscript𝑛𝑛n^{\prime}/nitalic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_n Variance Decision- Top k𝑘kitalic_k
Boundary
LR 14.0% 6.8% 62.9% 62.8% 63.1%
RF 32.2% 15.0% 64.1% 63.7% 64.3%
DT 73.7% 39.0% 61.5% 58.9% 62.9%

Randomizing Near Decision-Boundary.  We first consider using the partial weighted lottery in Example 4.10 by replacing cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with p^(xi)^𝑝subscript𝑥𝑖\hat{p}(x_{i})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Recall that this has the effect of randomizing near the decision-boundary or k𝑘kitalic_k-th largest prediction. We find small tradeoffs with utility that are very similar to those for expected utility that we saw for when claims are known and normally distributed (c.f. Figure 1d). For example, we observe just a 0.8% drop in utility for partial randomization with ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.5kabsent𝑘\cdot k⋅ italic_k and nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = k𝑘kitalic_k, which randomizes half the available resources across the k𝑘kitalic_k closest predictions to the decision-boundary on either side777In our working example with k/n𝑘𝑛k/nitalic_k / italic_n = 0.25, choosing ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0.5kabsent𝑘\cdot k⋅ italic_k and nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = k𝑘kitalic_k would first select the predictions above the 87.5-th percentile, and then randomize the remaining resources across people with predictions in the 62.5 to 87.5 percentile.. Figure 7 in the Appendix shows how utility is affected by different partial randomization rates (i.e. different ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT).

Randomizing Using Variance.  A variety of methods exist to estimate the variance of predictions (Black & Fredrikson, 2021; Cooper et al., 2023; Ganesh et al., 2023). For example, Cooper et al. (2023) propose re-training on bootstrapped sub-samples of the training data. Consider the set of predictions (p^(1),,p^(m))subscript^𝑝1subscript^𝑝𝑚(\hat{p}_{(1)},\ldots,\hat{p}_{(m)})( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT , … , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT ( italic_m ) end_POSTSUBSCRIPT ) across m𝑚mitalic_m bootstrapped models888Ganesh et al. (2023) show how to efficiently estimate the variance in predictions by changing the data order across epochs in a single training run.. We contend that if any of these models placed an individual among the top k𝑘kitalic_k claims, then they should have a chance to receive oi=1subscript𝑜𝑖1o_{i}=1italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1. Specifically, we propose directly assigning oi=1subscript𝑜𝑖1o_{i}=1italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 to individuals placed in the top k𝑘kitalic_k by all models, and then conducting an iterative weighted selection among the remaining individuals, where the weights represent the proportion of models that placed them in the top k𝑘kitalic_k.

When compared to randomizing near the decision-boundary, we observe that randomizing using this estimation of variance results in a smaller utility loss for all model classes. Table 1 shows the randomization rates and utility that result from randomizing according to 11 bootstrapped models trained on 50% of the available training data. For the same randomization rates, we compare the utility that results from randomizing near the decision-boundary, and also report the utility from no randomization (top k𝑘kitalic_k). Consider the random forest model as an example: randomizing using variance results in just a 0.2% utility loss while randomizing 32% of resources over 15% of people. These randomization rates yield a 0.6% utility loss for randomizing near the decision-boundary.

Table 2: Randomizing outliers (α=0.2𝛼0.2\alpha=0.2italic_α = 0.2) compared to randomizing near the decision-boundary & the top k𝑘kitalic_k allocation.
Model Random Rate Utility
k/ksuperscript𝑘𝑘k^{\prime}/kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_k n/nsuperscript𝑛𝑛n^{\prime}/nitalic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_n Outliers Decision- Top k𝑘kitalic_k
Boundary
LR 1.2% 20.1% 62.7% 63.0% 63.1%
RF 1.0% 20.1% 64.0% 64.3% 64.4%
DT 3.0% 20.1% 62.2% 62.8% 62.9%

Randomizing Outliers.  Many out-of-the-box methods exist for outlier detection, which quantify the uncertainty in a prediction that stems from a lack of similar individuals in the training data (Pimentel et al., 2014). For example, in the Swiss unemployment dataset there exists an individual i𝑖iitalic_i (and isuperscript𝑖i^{\prime}italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) with very high (and very low) predicted value across all bootstrapped models, but with oi=0subscript𝑜𝑖0o_{i}=0italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 (and oi=1subscript𝑜superscript𝑖1o_{i^{\prime}}=1italic_o start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = 1). Conformal prediction offers a way to assign a confidence measure to outlier detection methods, and produces low p-values for both individuals (<<<0.10). This motivates the use of conformal prediction to flag outliers (Angelopoulos & Bates, 2021) and then deploy a lottery for the resources that would have gone to “outliers individuals” based on a top k𝑘kitalic_k allocation.

Specifically, consider the pool of individuals that we believe are outliers with high confidence (p-valueαp-value𝛼\text{p-value}\leq\alphap-value ≤ italic_α) for some small α𝛼\alphaitalic_α. If some of these individuals fall in the top k𝑘kitalic_k, then we propose to randomize those resources over the entire pool of “outlier” individuals using an unweighted lottery. Note that the pool of individuals that we believe are outliers is model-agnostic, since it is computed based on the features. How many of these “outlier” individuals would have ended up in the top k𝑘kitalic_k depends on the model.

Table 2 shows the randomization rates and tradeoff with utility for α=0.2𝛼0.2\alpha=0.2italic_α = 0.2, which is slightly more than the utility loss for randomizing near the decision-boundary. We end up randomizing just 2% of the available resources over 20% of the total people (note this directly corresponds to our choice of α=0.2𝛼0.2\alpha=0.2italic_α = 0.2). This suggests that the individuals being randomized based on outlier detection are different than those near the decision-boundary or with high variance in predictions. Figure 9 in the Appendix visualizes how the predictions with high uncertainty are different for each method.

Reduction in SER.  Lastly, we turn to how much our randomization proposals could reduce the systemic exclusion rate (SER). Similar to the experiments when claims are known, we find that small tradeoffs with utility yield much larger reductions in SER. Figure LABEL:sub@fig:swiss_homogenization demonstrates our results for each randomization method using the decision tree model class. In this case, randomizing using variance has the best tradeoff, though results vary across model classes and selection rates (see Appendix A.2 for other cases).

Refer to caption
(k) SER v. Utility Tradeoff for each Randomization Method (Model Class: Decision Tree)

5 Discussion

We argued in section 3 that sometimes fairness requires randomizing allocations of scarce resources or opportunities, and in section 4 we provide randomization techniques that respect many claims while not losing significant predictive performance. We now extend the argument and explore implications of these findings.

Utility.  When claims are known, randomization sometimes trades off against expected utility or predictive success. Although some may find this tradeoff hard to endorse, we suggest two things. First, a claims-based moral framework holds that people’s claims must be satisfied (or acknowledged by the surrogate satisfaction of a lottery). Some reject claims and take utility to be the only currency of moral concern. However, anyone who agrees that it is more fair for a qualified candidate to have a chance than to never have had a chance can consider claims as an objective within a broader utility-maximization framework, such that overall utility can be improved by satisfying more claims. Second, our exploration of uncertainty suggests that what appears to be a tradeoff is, at times, a movement to a different point within the same bounds of uncertainty. Over-optimizing for apparent utility ignores our true uncertainty about the facts and moral claims of the case. Thus the utility we appear to give up in order to honor applicants’ valid claims may be illusory: there may be no tradeoff at all.

Human Randomness.  How does the intentional randomness we propose in this work compare to the natural variance of human decision-making? For example, despite being extensively trained in decision-making, judges who evaluate the same case (Ludwig & Mullainathan, 2021) often disagree, and judges disagree with their past selves evaluating similar cases over time (Collins, 2008). This property should make human decision-making less homogeneous than algorithmic decision-making (Creel & Hellman, 2022). However, we do not find human randomness to be a satisfactory substitute for intentional randomization. Although human decision-making is not consistent, its outcomes are not guaranteed to be distributed across people in accordance with their claims, as social biases concentrate bad outcomes on individuals from marginalized groups in many situations. Furthermore, we have showed above that fairness requires selecting the most appropriate form of randomness given the problem description and the underlying distribution of data. Human inconsistency is not subject to these matching constraints.

Scope.  We do not think that randomization is fair in all settings. For example, criminal justice is served by respecting the procedural rights of defendants and attempting to determine whether the accusations they face are true. Criminal justice is not a matter of comparative claims: each defendant must be evaluated separately, not in comparison to others. To randomize the outcomes would be unfair. However, we affirm the value of randomization in settings in which scarce resources must be fairly allocated on the basis of uncertain information. Since this encompasses many algorithmic decision-making contexts, we encourage the field of fair machine learning to consider randomization as an important element of fairness.

Impact Statement

This paper contributes to the literature on algorithmic fairness in two ways. (1) As a position paper, it encourages others to reconsider whether deterministic algorithms are always the right choice for fairness, arguing that randomization techniques are to be preferred in some settings. (2) In support of these arguments, the paper also presents concrete techniques that can be used to randomize and shows how they reduce systemic exclusion and patterned inequality. As such, we hope that it will have a positive impact in reducing bias and unfairness.

However, it is also possible that a decision-maker might use the tools presented here to randomize outcomes in a domain that the authors warn would be unjust or inappropriate, such as the domain of criminal justice, and in doing so wrong decision subjects. Since all of the randomization techniques that form the basis of the paper’s experiments are well established and easily implementable, the paper does not make improper use of these tools easier than it would have been before. But it is possible that its existence will suggest the idea to someone who might not have otherwise had it.

The authors have attempted to prevent this outcome by making it clear which uses of randomization they believe are appropriate or inappropriate.

References

  • Agarwal & Deshpande (2022) Agarwal, S. and Deshpande, A. On the power of randomization in fair classification and representation. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.  1542–1551, 2022.
  • Agrawal & Goyal (2012) Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp.  39–1. JMLR Workshop and Conference Proceedings, 2012.
  • Ajunwa (2021) Ajunwa, I. An auditing imperative for automated hiring systems. Harvard Journal of Law & Technology, 34(2), 2021.
  • Angelopoulos & Bates (2021) Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021.
  • Black & Fredrikson (2021) Black, E. and Fredrikson, M. Leave-one-out unfairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp.  285–295, 2021.
  • Black et al. (2022) Black, E., Raghavan, M., and Barocas, S. Model multiplicity: Opportunities, concerns, and solutions. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.  850–863, 2022.
  • Bommasani et al. (2021) Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  • Bommasani et al. (2022) Bommasani, R., Creel, K. A., Kumar, A., Jurafsky, D., and Liang, P. S. Picking on the same person: Does algorithmic monoculture lead to outcome homogenization? In Advances in Neural Information Processing Systems, volume 35, pp.  3663–3678, 2022.
  • Broderick et al. (2020) Broderick, T., Giordano, R., and Meager, R. An automatic finite-sample robustness metric: when can dropping a little data make a big difference? arXiv preprint arXiv:2011.14999, 2020.
  • Broome (1990) Broome, J. Fairness. In Proceedings of the Aristotelian Society, volume 91, pp. 87–101, 1990.
  • Chin et al. (2023) Chin, M. H., Afsar-Manesh, N., Bierman, A. S., Chang, C., Colón-Rodríguez, C. J., Dullabh, P., Duran, D. G., Fair, M., Hernandez-Boussard, T., Hightower, M., et al. Guiding principles to address the impact of algorithm bias on racial and ethnic disparities in health and health care. Journal of the American Medical Association, 6(12), 2023.
  • Cohen-Schotanus et al. (2006) Cohen-Schotanus, J., Muijtjens, A. M. M., Reinders, J. J., Agsteribbe, J., van Rossum, H. J. M., and van der Vleuten, C. P. M. The predictive validity of grade point average scores in a partial lottery medical school admission system. Medical Education, 40(10):1012–1019, October 2006. ISSN 1365-2923.
  • Collins (2008) Collins, P. M. The consistency of judicial choice. The Journal of Politics, 70(3):861–873, July 2008. ISSN 1468-2508.
  • Cooper et al. (2023) Cooper, A. F., Barocas, S., De Sa, C., and Sen, S. Variance, self-consistency, and arbitrariness in fair classification. arXiv preprint arXiv:2301.11562, 2023.
  • Creel & Hellman (2022) Creel, K. and Hellman, D. The algorithmic leviathan: Arbitrariness, fairness, and opportunity in algorithmic decision-making systems. Canadian Journal of Philosophy, 52(1):26–43, 2022. doi: 10.1017/can.2022.3.
  • Dawid (2017) Dawid, P. On individual risk. Synthese, 194(9):3445–3474, 2017.
  • Ding et al. (2021) Ding, F., Hardt, M., Miller, J., and Schmidt, L. Retiring adult: New datasets for fair machine learning. Advances in neural information processing systems, 34:6478–6490, 2021.
  • Dwork et al. (2021) Dwork, C., Kim, M. P., Reingold, O., Rothblum, G. N., and Yona, G. Outcome indistinguishability. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp.  1095–1108, 2021.
  • Eidelson (2021) Eidelson, B. Patterned inequality, compounding injustice, and algorithmic prediction. American Journal of Law and Equality, 1:252–276, 2021.
  • Ganesh et al. (2023) Ganesh, P., Chang, H., Strobel, M., and Shokri, R. On the impact of machine learning randomness on group fairness. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp.  1789–1800, 2023.
  • Grgić-Hlača et al. (2017) Grgić-Hlača, N., Zafar, M. B., Gummadi, K. P., and Weller, A. On fairness, diversity and randomness in algorithmic decision making. arXiv preprint arXiv:1706.10208, 2017.
  • Hastings et al. (2006) Hastings, J., Kane, T., and Staiger, D. Preferences and heterogeneous treatment effects in a public school choice lottery, 2006. URL http://dx.doi.org/10.3386/w12145.
  • Hellman (2018) Hellman, D. Indirect discrimination and the duty to avoid compounding injustice. Foundations of Indirect Discrimination Law, Hart Publishing Company, pp.  2017–53, 2018.
  • Hooker (2005) Hooker, B. Fairness. Ethical theory and moral practice, 8:329–352, 2005.
  • Hsu & Calmon (2022) Hsu, H. and Calmon, F. Rashomon capacity: A metric for predictive multiplicity in classification. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  28988–29000. Curran Associates, Inc., 2022.
  • Jain et al. (2024) Jain, S., Suriyakumar, V., Creel, K., and Wilson, A. Algorithmic pluralism: A structural approach to equal opportunity. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’24, pp.  197–206, 2024. URL https://doi.org/10.1145/3630106.3658899.
  • Joseph et al. (2016a) Joseph, M., Kearns, M., Morgenstern, J. H., and Roth, A. Fairness in learning: Classic and contextual bandits. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016a. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/eb163727917cbba1eea208541a643e74-Paper.pdf.
  • Joseph et al. (2016b) Joseph, M., Kearns, M., Morgenstern, J. H., and Roth, A. Fairness in learning: Classic and contextual bandits. Advances in neural information processing systems, 29, 2016b.
  • Kirkpatrick & Eastwood (2015) Kirkpatrick, J. R. and Eastwood, N. Broome’s theory of fairness and the problem of quantifying the strengths of claims. Utilitas, 27(1):82–91, 2015.
  • Kleinberg & Raghavan (2021) Kleinberg, J. and Raghavan, M. Algorithmic monoculture and social welfare. Proceedings of the National Academy of Sciences, 118(22):e2018340118, 2021.
  • Lechner et al. (2020) Lechner, M., Knaus, M., Huber, M., Frölich, M., Behncke, S., Mellace, G., and Strittmatter, A. Swiss active labor market policy evaluation [dataset]. Distributed by FORS, Lausanne, 2020.
  • Li et al. (2020) Li, D., Raymond, L. R., and Bergman, P. Hiring as exploration. Technical report, National Bureau of Economic Research, 2020.
  • Ludwig & Mullainathan (2021) Ludwig, J. and Mullainathan, S. Fragile algorithms and fallible decision-makers: lessons from the justice system. Journal of Economic Perspectives, 35(4):71–96, 2021.
  • Marx et al. (2020) Marx, C., Calmon, F., and Ustun, B. Predictive multiplicity in classification. In International Conference on Machine Learning, pp. 6765–6774. PMLR, 2020.
  • McCreary et al. (2023) McCreary, E. K., Essien, U. R., Chang, C.-C. H., Butler, R. A., Pathak, P., Sönmez, T., Ünver, M. U., Steiner, A., Chrisman, M., Angus, D. C., and White, D. B. Weighted lottery to equitably allocate scarce supply of covid-19 monoclonal antibody. JAMA Health Forum, 4(9):e232774, September 2023. ISSN 2689-0186. doi: 10.1001/jamahealthforum.2023.2774. URL http://dx.doi.org/10.1001/jamahealthforum.2023.2774.
  • Mitchell et al. (2021) Mitchell, S., Potash, E., Barocas, S., D’Amour, A., and Lum, K. Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application, 8:141–163, 2021.
  • Nawrat (2023) Nawrat, A. Inside hirevue’s acquisition of modern hire. https://www.unleash.ai/hr-technology/inside-hirevues-acquisition-of-modern-hire/, 2023.
  • Passi & Barocas (2019) Passi, S. and Barocas, S. Problem formulation and fairness. In Proceedings of the conference on fairness, accountability, and transparency, pp.  39–48, 2019.
  • Peng & Garg (2023) Peng, K. and Garg, N. Monoculture in matching markets. arXiv preprint arXiv:2312.09841, 2023.
  • Pimentel et al. (2014) Pimentel, M. A., Clifton, D. A., Clifton, L., and Tarassenko, L. A review of novelty detection. Signal processing, 99:215–249, 2014.
  • Raghavan et al. (2020) Raghavan, M., Barocas, S., Kleinberg, J., and Levy, K. Mitigating bias in algorithmic hiring: Evaluating claims and practices. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp.  469–481, 2020.
  • Roth et al. (2023) Roth, A., Tolbert, A., and Weinstein, S. Reconciling individual probability forecasts. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp.  101–110, 2023.
  • Schmidt et al. (2021) Schmidt, H., Roberts, D. E., and Eneanya, N. D. Rationing, racism and justice: advancing the debate around “colourblind” COVID-19 ventilator allocation. Journal of Medical Ethics, 2021.
  • Sharifi-Malvajerdi et al. (2019) Sharifi-Malvajerdi, S., Kearns, M., and Roth, A. Average individual fairness: Algorithms, generalization and experiments. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/0e1feae55e360ff05fef58199b3fa521-Paper.pdf.
  • Singh et al. (2021) Singh, A., Kempe, D., and Joachims, T. Fairness in ranking under uncertainty. In Advances in Neural Information Processing Systems, volume 34, pp.  11896–11908, 2021.
  • Toups et al. (2023) Toups, C., Bommasani, R., Creel, K., Bana, S., Jurafsky, D., and Liang, P. Ecosystem-level analysis of deployed machine learning reveals homogeneous outcomes. In Advances in Neural Information Processing Systems, 2023.

Appendix A Appendix

A.1 When Claims Are Known Experiments

We simulate different distributions of claims and compare 3 different allocation types: (1) deterministic selection of the top k𝑘kitalic_k claims, (2) the BF lottery in Example 4.3, and (3) the partial BF lottery in Example 4.10. Specifically, we analyze the tradeoff between utility and reduction in SER under various selection rates and levels of noise. We simulate 1000 individuals and average the results over 1000 iterations for each experiment. We consider the following distributions:

  • Normal: more average claims (Figure 1b)

  • Inverted Normal: more strong and weak claims (Figure 2b)

  • Pareto: more weak claims (Figure 3b)

  • Inverted Pareto: more strong claims (Figure 4b)

  • Uniform: all claims equally likely (Figure 5b)

For each distribution, we illustrate the following (letters correspond to sub-figures for each distribution):

  1. (a)

    Distribution of claims: Density under 3 different parameter choices for all distribution types except uniform (i.e. different σ𝜎\sigmaitalic_σ for normal or α𝛼\alphaitalic_α for pareto)

  2. (b)

    Expected utility for the top k𝑘kitalic_k selection and BF lottery in Example 4.3. We consider each parameter choice in (a).

  3. (c)

    - (e) Expected utility for varying partial BF randomization rates in Example 4.10. We consider different choices of k/ksuperscript𝑘𝑘k^{\prime}/kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_k and n/nsuperscript𝑛𝑛n^{\prime}/nitalic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_n (in increments of 0.1). We now use a fixed parameter choice (σ=0.15𝜎0.15\sigma=0.15italic_σ = 0.15 for normal and α=2𝛼2\alpha=2italic_α = 2 for pareto) in this sub-figure and all subsequent sub-figures. (c) uses k/n𝑘𝑛k/nitalic_k / italic_n = 0.1, (d) uses k/n𝑘𝑛k/nitalic_k / italic_n = 0.25, and (e) uses k/n𝑘𝑛k/nitalic_k / italic_n = 0.5.

  4. (f)

    - (h) Systemic exclusion rate v. expected utility across varying partial BF randomization rates. For each of the choices of k/ksuperscript𝑘𝑘k^{\prime}/kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_k and n/nsuperscript𝑛𝑛n^{\prime}/nitalic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_n in (c) - (e), we calculate the systemic exclusion rate for a given amount of decision-makers and noise added to each decision-maker’s claims (±N(0,σ2)plus-or-minus𝑁0superscript𝜎2\pm\,N(0,\sigma^{2})± italic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )). We then plot the tradeoff across randomization rates by showing the lowest SER possible for each percentage decrease in expected utility. We show this tradeoff for 2, 3, and 4 decision-makers, as well as for no noise, σ𝜎\sigmaitalic_σ = 0.025, and σ𝜎\sigmaitalic_σ = 0.05. (f) uses k/n𝑘𝑛k/nitalic_k / italic_n = 0.1, (g) uses k/n𝑘𝑛k/nitalic_k / italic_n = 0.25, and (h) uses k/n𝑘𝑛k/nitalic_k / italic_n = 0.5.

In Figure 6, we show the reduction in systemic exclusion rate using the full BF lottery in Example 4.3 for all distributions of claims together. This replicates Figure LABEL:sub@fig:claims_ser in the main text, but for different amounts of noise added to each decision-maker’s claims, as well as different selection rates.

A.2 When Claims Are Unknown Experiments

We test our randomization proposals on 2 datasets: (1) Swiss Unemployment Data (Lechner et al., 2020), and (2) Census Income Data (Ding et al., 2021). For each dataset, we test our 3 randomization methods: randomizing near the decision-boundary, randomizing using variance, and randomizing outliers. We report results for 3 different model classes (logistic regression, random forests, and decision trees) and 3 different selection rates (0.1, 0.25, and 0.5). Specifically, we analyze how each randomization method changes how many resources (ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) and what kinds of people (nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) are randomized, for some (often minimal) loss in utility. We also compare how much each method reduces the systemic exclusion rate. All our experiments involve an 80-20 train-test split (with 5 repetitions), and we average randomization results over 100 iterations. For each dataset, we provide the following:

  • Visualization of the distribution of predictions: Figure 7(a) for Swiss data and Figure 8(a) for Census data

  • Utility and randomization rates for randomizing near the decision boundary: Figure 7(b)-(d) for Swiss data and Figure 8(b)-(d) for Census data

  • Utility and randomization rates for randomizing using variance: Table 3 for Swiss data and Table 5 for Census data

  • Utility and randomization rates for randomizing outliers: Table 4 for Swiss data and Table 6 for Census data

  • Visualization of the tradeoff between utility and systemic exclusion rate for all randomization methods: Figure 7(e)-(g) for Swiss data and Figure 8(e)-(g) for Census data

  • Visualization of the density of predictions by point estimate and uncertainty metric: Figure 9

A.3 Pseudocode for Randomization Proposals

Algorithm 1 Partial BF Lottery
0:  selected people
0:  people, claims, k𝑘kitalic_k, n𝑛nitalic_n, ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
0:  k(0,k]superscript𝑘0𝑘k^{\prime}\in(0,k]italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( 0 , italic_k ], n(k,nk+k]superscript𝑛superscript𝑘𝑛𝑘superscript𝑘n^{\prime}\in(k^{\prime},n-k+k^{\prime}]italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_n - italic_k + italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]
1:  people, claims \leftarrow Sort(people, claims)
2:  deterministic selections \leftarrow people[ : kk𝑘superscript𝑘k-k^{\prime}italic_k - italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT]
3:  random selections \leftarrow Iterative Weighted Selection(ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, people[kk𝑘superscript𝑘k-k^{\prime}italic_k - italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : kk+n𝑘superscript𝑘superscript𝑛k-k^{\prime}+n^{\prime}italic_k - italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT], claims[kk𝑘superscript𝑘k-k^{\prime}italic_k - italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : kk+n𝑘superscript𝑘superscript𝑛k-k^{\prime}+n^{\prime}italic_k - italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT])
4:  return deterministic selections + random selections
Algorithm 2 Randomization Using Variance
0:  selected people
0:  people, claims, k𝑘kitalic_k, n𝑛nitalic_n, B𝐵Bitalic_B
1:  people, claims \leftarrow Sort(people, claims)
2:  y^Bsubscript^𝑦𝐵\hat{y}_{B}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT :=assign:=:= list()
3:  for i \in 1…n𝑛nitalic_n do
4:     vote :=assign:=:= 0
5:     for 1…B𝐵Bitalic_B do
6:        if Bootstrapped Claim(people[i]) >>> claims[k𝑘kitalic_kthen
7:           vote \leftarrow vote + 1
8:        end if
9:     end for
10:     y^Bsubscript^𝑦𝐵\hat{y}_{B}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT.append(vote / B𝐵Bitalic_B)
11:  end for
12:  
13:  deterministic selections \leftarrow people[i] for i \in 1…n𝑛nitalic_n if y^Bsubscript^𝑦𝐵\hat{y}_{B}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT[i] is 1
14:  uncertain people \leftarrow people[i] for i \in 1…n𝑛nitalic_n if y^Bsubscript^𝑦𝐵\hat{y}_{B}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT[i] \in (0,1)
15:  uncertain claims \leftarrow claims[i] for i \in 1…n𝑛nitalic_n if y^Bsubscript^𝑦𝐵\hat{y}_{B}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT[i] \in (0,1)
16:  kksuperscript𝑘limit-from𝑘k^{\prime}\leftarrow k-italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_k - Length(deterministic selections)
17:  random selections \leftarrow Iterative Weighted Selection(ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, uncertain people, uncertain claims)
18:  return deterministic selections + random selections
Algorithm 3 Randomization Using Outliers
0:  selected people
0:  people, claims, k𝑘kitalic_k, n𝑛nitalic_n, α𝛼\alphaitalic_α
1:  people, claims \leftarrow Sort(people, claims)
2:  deterministic selections \leftarrow people[i] for i \in 1…n𝑛nitalic_n if Outlier P-Value(people[i]) >αabsent𝛼>\alpha> italic_α
3:  uncertain people \leftarrow people[i] for i \in 1…n𝑛nitalic_n if Outlier P-Value(people[i]) αabsent𝛼\leq\alpha≤ italic_α
4:  uncertain claims \leftarrow claims[i] for i \in 1…n𝑛nitalic_n if Outlier P-Value(people[i]) αabsent𝛼\leq\alpha≤ italic_α
5:  kksuperscript𝑘limit-from𝑘k^{\prime}\leftarrow k-italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_k - Length(deterministic selections)
6:  random selections \leftarrow Iterative Weighted Selection(ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, uncertain people, uncertain claims)
7:  return deterministic selections + random selections

Notes:

  • The Iterative Weighted Selection refers to Ex 4.3 and can be performed using numpy’s random.choice method.

  • ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes how many resources (out of k𝑘kitalic_k) are randomized, where the randomization occurs over nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT people (out of n𝑛nitalic_n).

  • Array indexing is zero-based. Sort orders the arrays in descending order from the strongest to weakest claim.

  • We use B=11𝐵11B=11italic_B = 11 for randomization using variance, and train bootstrapped models on a 50% subset of the training data.

  • We provide additional details on how we compute the Outlier P-Value in the next section.

A.4 Conformal Prediction Methodology

We use conformal prediction to assign a confidence measure to outlier detection methods (Angelopoulos & Bates, 2021). Specifically, we use the following procedure:

  1. 1.

    Suppose we have a novelty score s:𝒳:𝑠𝒳s:{\mathcal{X}}\rightarrow\mathbb{R}italic_s : caligraphic_X → blackboard_R, where larger values indicate more abnormality from the training data. For example, we use average Euclidean distance to the training data.

  2. 2.

    We want to find q:(s(x)>q)α:𝑞𝑠𝑥𝑞𝛼q:\mathbb{P}(s(x)>q)\leq\alphaitalic_q : blackboard_P ( italic_s ( italic_x ) > italic_q ) ≤ italic_α if x𝒳trainsimilar-to𝑥subscript𝒳trainx\sim{\mathcal{X}}_{\text{train}}italic_x ∼ caligraphic_X start_POSTSUBSCRIPT train end_POSTSUBSCRIPT, where α𝛼\alphaitalic_α represents the bound on the false positive rate

  3. 3.

    Reserve a calibration dataset 𝒳calsubscript𝒳cal{\mathcal{X}}_{\text{cal}}caligraphic_X start_POSTSUBSCRIPT cal end_POSTSUBSCRIPT. For each xcalj𝒳calsuperscriptsubscript𝑥cal𝑗subscript𝒳calx_{\text{cal}}^{j}\in{\mathcal{X}}_{\text{cal}}italic_x start_POSTSUBSCRIPT cal end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT cal end_POSTSUBSCRIPT, compute the novelty score s(xcalj)𝑠superscriptsubscript𝑥cal𝑗s(x_{\text{cal}}^{j})italic_s ( italic_x start_POSTSUBSCRIPT cal end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) with respect to 𝒳trainsubscript𝒳train{\mathcal{X}}_{\text{train}}caligraphic_X start_POSTSUBSCRIPT train end_POSTSUBSCRIPT.

  4. 4.

    Compute q^=quantile(scal1scaln;(ncal+1)(1α)ncal)^𝑞quantilesuperscriptsubscript𝑠cal1superscriptsubscript𝑠cal𝑛subscript𝑛cal11𝛼subscript𝑛cal\hat{q}=\text{quantile}\left(s_{\text{cal}}^{1}\ldots s_{\text{cal}}^{n};\frac% {\lceil(n_{\text{cal}}+1)(1-\alpha)\rceil}{n_{\text{cal}}}\right)over^ start_ARG italic_q end_ARG = quantile ( italic_s start_POSTSUBSCRIPT cal end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT … italic_s start_POSTSUBSCRIPT cal end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ; divide start_ARG ⌈ ( italic_n start_POSTSUBSCRIPT cal end_POSTSUBSCRIPT + 1 ) ( 1 - italic_α ) ⌉ end_ARG start_ARG italic_n start_POSTSUBSCRIPT cal end_POSTSUBSCRIPT end_ARG )

  5. 5.

    If s(xi)>q^𝑠subscript𝑥𝑖^𝑞s(x_{i})>\hat{q}italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > over^ start_ARG italic_q end_ARG, then we consider individual i𝑖iitalic_i to be an outlier.

  6. 6.

    Specifically, we take the p-value associated with outlier detection to be: 1ncal+1(1+i=1ncal𝟙{s(xi)scali})1subscript𝑛cal11superscriptsubscript𝑖1subscript𝑛cal1𝑠subscript𝑥𝑖superscriptsubscript𝑠cal𝑖\frac{1}{n_{\text{cal}+1}}\cdot(1+\sum_{i=1}^{n_{\text{cal}}}\mathds{1}\{s(x_{% i})\leq s_{\text{cal}}^{i}\})divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT cal + 1 end_POSTSUBSCRIPT end_ARG ⋅ ( 1 + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT cal end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_1 { italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_s start_POSTSUBSCRIPT cal end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } )

Figure 1: Normal Distribution of Claims
Refer to caption
(a) Distribution of Claims
Refer to caption
(b) Expected Utility for Top k𝑘kitalic_k v. BF Lottery in Ex 4.3
Refer to caption
(c) Selection Rate = 0.10.10.10.1
Refer to caption
(d) Selection Rate = 0.250.250.250.25
Refer to caption
(e) Selection Rate = 0.50.50.50.5
Refer to caption
(f) Selection Rate = 0.10.10.10.1
Refer to caption
(g) Selection Rate = 0.250.250.250.25
Refer to caption
(h) Selection Rate = 0.50.50.50.5

(c) - (e) Expected Utility for Varying Partial BF Randomization Rates in Ex 4.10

(f) - (h) Systemic Exclusion Rate v. Expected Utility Across Varying Partial BF Randomization Rates

Figure 2: Inverse Normal Distribution of Claims
Refer to caption
(a) Distribution of Claims
Refer to caption
(b) Expected Utility for Top k𝑘kitalic_k v. BF Lottery in Ex 4.3
Refer to caption
(c) Selection Rate = 0.10.10.10.1
Refer to caption
(d) Selection Rate = 0.250.250.250.25
Refer to caption
(e) Selection Rate = 0.50.50.50.5
Refer to caption
(f) Selection Rate = 0.10.10.10.1
Refer to caption
(g) Selection Rate = 0.250.250.250.25
Refer to caption
(h) Selection Rate = 0.50.50.50.5

(c) - (e) Expected Utility for Varying Partial BF Randomization Rates in Ex 4.10

(f) - (h) Systemic Exclusion Rate v. Expected Utility Across Varying Partial BF Randomization Rates

Figure 3: Pareto Distribution of Claims
Refer to caption
(a) Distribution of Claims
Refer to caption
(b) Expected Utility for Top k𝑘kitalic_k v. BF Lottery in Ex 4.3
Refer to caption
(c) Selection Rate = 0.10.10.10.1
Refer to caption
(d) Selection Rate = 0.250.250.250.25
Refer to caption
(e) Selection Rate = 0.50.50.50.5
Refer to caption
(f) Selection Rate = 0.10.10.10.1
Refer to caption
(g) Selection Rate = 0.250.250.250.25
Refer to caption
(h) Selection Rate = 0.50.50.50.5

(c) - (e) Expected Utility for Varying Partial BF Randomization Rates in Ex 4.10

(f) - (h) Systemic Exclusion Rate v. Expected Utility Across Varying Partial BF Randomization Rates

Figure 4: Inverse Pareto Distribution of Claims
Refer to caption
(a) Distribution of Claims
Refer to caption
(b) Expected Utility for Top k𝑘kitalic_k v. BF Lottery in Ex 4.3
Refer to caption
(c) Selection Rate = 0.10.10.10.1
Refer to caption
(d) Selection Rate = 0.250.250.250.25
Refer to caption
(e) Selection Rate = 0.50.50.50.5
Refer to caption
(f) Selection Rate = 0.10.10.10.1
Refer to caption
(g) Selection Rate = 0.250.250.250.25
Refer to caption
(h) Selection Rate = 0.50.50.50.5

(c) - (e) Expected Utility for Varying Partial BF Randomization Rates in Ex 4.10

(f) - (h) Systemic Exclusion Rate v. Expected Utility Across Varying Partial BF Randomization Rates

Figure 5: Uniform Distribution of Claims
Refer to caption
(a) Distribution of Claims
Refer to caption
(b) Expected Utility for Top k𝑘kitalic_k v. BF Lottery in Ex 4.3
Refer to caption
(c) Selection Rate = 0.10.10.10.1
Refer to caption
(d) Selection Rate = 0.250.250.250.25
Refer to caption
(e) Selection Rate = 0.50.50.50.5
Refer to caption
(f) Selection Rate = 0.10.10.10.1
Refer to caption
(g) Selection Rate = 0.250.250.250.25
Refer to caption
(h) Selection Rate = 0.50.50.50.5

(c) - (e) Expected Utility for Varying Partial BF Randomization Rates in Ex 4.10

(f) - (h) Systemic Exclusion Rate v. Expected Utility Across Varying Partial BF Randomization Rates

Figure 6: Reduction in SER using the BF Lottery in Example 4.3 (c.f. Figure LABEL:sub@fig:claims_ser in Main Text);
Each decision-maker has a noisy estimation of claims (±N(0,σ2)plus-or-minus𝑁0superscript𝜎2\pm\,N(0,\sigma^{2})± italic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )), (a) - (f) show different σ𝜎\sigmaitalic_σ and selection rates k/n𝑘𝑛k/nitalic_k / italic_n
Refer to caption
(a) k/n𝑘𝑛k/nitalic_k / italic_n = 0.1, σ𝜎\sigmaitalic_σ = 0.025
Refer to caption
(b) k/n𝑘𝑛k/nitalic_k / italic_n = 0.1, σ𝜎\sigmaitalic_σ = 0.05
Refer to caption
(c) k/n𝑘𝑛k/nitalic_k / italic_n = 0.25, σ𝜎\sigmaitalic_σ = 0.025
Refer to caption
(d) k/n𝑘𝑛k/nitalic_k / italic_n = 0.25, σ𝜎\sigmaitalic_σ = 0.05
Refer to caption
(e) k/n𝑘𝑛k/nitalic_k / italic_n = 0.5, σ𝜎\sigmaitalic_σ = 0.025
Refer to caption
(f) k/n𝑘𝑛k/nitalic_k / italic_n = 0.5, σ𝜎\sigmaitalic_σ = 0.05
Figure 7: Swiss Unemployment Data Experiments
Refer to caption
(a) Distribution of Claims
Refer to caption
(b) Selection Rate = 0.10.10.10.1
Refer to caption
(c) Selection Rate = 0.250.250.250.25
Refer to caption
(d) Selection Rate = 0.50.50.50.5
Refer to caption
(e) Selection Rate = 0.10.10.10.1
Refer to caption
(f) Selection Rate = 0.250.250.250.25
Refer to caption
(g) Selection Rate = 0.50.50.50.5

(b) - (d) Utility From Randomizing Near the Decision-Boundary

(e) - (g) Systemic Exclusion Rate v. Utility Tradeoff for Each Randomization Method

Table 3: Swiss Unemployment Data – Randomizing Using Variance
k/n𝑘𝑛k/nitalic_k / italic_n Model Random Rate Utility
k/ksuperscript𝑘𝑘k^{\prime}/kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_k n/nsuperscript𝑛𝑛n^{\prime}/nitalic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_n Variance Decision- Top k𝑘kitalic_k
Boundary
0.10 Log. Regression 25.7% 5.0% 69.2% 69.2% 69.3%
Random Forest 57.1% 9.2% 70.8% 69.9% 71.3%
Decision Tree 97.4% 12.3% 64.6% 68.3% 69.5%
0.25 Log. Regression 14.0% 6.8% 62.9% 62.8% 63.1%
Random Forest 32.2% 15.0% 64.1% 63.7% 64.3%
Decision Tree 73.7% 39.0% 61.5% 58.9% 62.9%
0.50 Log. Regression 7.2% 7.0% 55.7% 55.7% 55.8%
Random Forest 19.7% 20.1% 56.3% 56.0% 56.5%
Decision Tree 50.3% 57.6% 54.2% 52.5% 55.5%
Table 4: Swiss Unemployment Data – Randomizing Outliers
α𝛼\alphaitalic_α k/n𝑘𝑛k/nitalic_k / italic_n Model Random Rate Utility
k/ksuperscript𝑘𝑘k^{\prime}/kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_k n/nsuperscript𝑛𝑛n^{\prime}/nitalic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_n Outliers Decision- Top k𝑘kitalic_k
Boundary
0.20 0.10 Log. Regression 0.5% 20.1% 69.1% 69.3% 69.3%
Random Forest 0.4% 20.1% 71.1% 71.2% 71.3%
Decision Tree 1.6% 20.1% 69.1% 69.4% 69.5%
0.20 0.25 Log. Regression 1.2% 20.1% 62.7% 63.0% 63.1%
Random Forest 1.0% 20.1% 64.0% 64.3% 64.4%
Decision Tree 3.0% 20.1% 62.2% 62.8% 62.9%
0.20 0.50 Log. Regression 3.3% 20.1% 55.2% 55.7% 55.8%
Random Forest 2.8% 20.1% 55.9% 56.4% 56.5%
Decision Tree 5.9% 20.1% 54.8% 55.3% 55.5%
0.10 0.25 Log. Regression 0.3% 10.1% 62.9% 63.1% 63.1%
Random Forest 0.3% 10.1% 64.3% 64.4% 64.4%
Decision Tree 1.2% 10.1% 62.6% 62.9% 62.9%
0.30 0.25 Log. Regression 4.1% 30.0% 61.8% 62.8% 63.1%
Random Forest 3.7% 30.0% 63.1% 64.1% 64.4%
Decision Tree 7.2% 30.0% 61.1% 62.5% 62.9%
Figure 8: Census Income Data Experiments
Refer to caption
(a) Distribution of Claims
Refer to caption
(b) Selection Rate = 0.10.10.10.1
Refer to caption
(c) Selection Rate = 0.250.250.250.25
Refer to caption
(d) Selection Rate = 0.50.50.50.5
Refer to caption
(e) Selection Rate = 0.10.10.10.1
Refer to caption
(f) Selection Rate = 0.250.250.250.25
Refer to caption
(g) Selection Rate = 0.50.50.50.5

(b) - (d) Utility From Randomizing Near the Decision-Boundary

(e) - (g) Systemic Exclusion Rate v. Utility Tradeoff for Each Randomization Method

Table 5: Census Income Data – Randomizing Using Variance
k/n𝑘𝑛k/nitalic_k / italic_n Model Random Rate Utility
k/ksuperscript𝑘𝑘k^{\prime}/kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_k n/nsuperscript𝑛𝑛n^{\prime}/nitalic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_n Variance Decision- Top k𝑘kitalic_k
Boundary
0.10 Log. Regression 7.2% 1.5% 91.5% 91.5% 91.5%
Random Forest 66.9% 16.6% 90.7% 88.6% 90.9%
Decision Tree 0.0% 0.0% - - 83.3%
0.25 Log. Regression 3.9% 1.9% 86.1% 86.1% 86.1%
Random Forest 48.7% 26.4% 84.5% 81.6% 85.2%
Decision Tree 70.6% 39.9% 81.0% 74.2% 82.8%
0.50 Log. Regression 2.3% 2.2% 72.2% 72.1% 72.2%
Random Forest 30.0% 29.2% 70.8% 69.4% 71.6%
Decision Tree 45.9% 46.0% 67.7% 64.5% 69.4%
Table 6: Census Income Data – Randomizing Outliers
α𝛼\alphaitalic_α k/n𝑘𝑛k/nitalic_k / italic_n Model Random Rate Utility
k/ksuperscript𝑘𝑘k^{\prime}/kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_k n/nsuperscript𝑛𝑛n^{\prime}/nitalic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_n Outliers Decision- Top k𝑘kitalic_k
Boundary
0.10 0.10 Log. Regression 10.7% 9.9% 86.5% 91.1% 91.5%
Random Forest 3.9% 9.9% 88.9% 90.7% 90.9%
Decision Tree 8.5% 9.9% 79.8% 83.2% 83.3%
0.10 0.25 Log. Regression 9.6% 9.9% 82.3% 85.7% 86.1%
Random Forest 7.7% 9.9% 81.8% 84.9% 85.2%
Decision Tree 7.5% 9.9% 79.8% 82.0% 82.8%
0.10 0.50 Log. Regression 9.4% 9.9% 69.6% 71.8% 72.2%
Random Forest 9.7% 9.9% 68.9% 71.2% 71.6%
Decision Tree 10.3% 9.9% 67.2% 69.1% 69.4%
0.05 0.25 Log. Regression 5.1% 5.0% 84.1% 86.0% 86.1%
Random Forest 3.7% 5.0% 83.5% 85.2% 85.2%
Decision Tree 3.6% 5.0% 81.3% 82.4% 82.8%
0.20 0.25 Log. Regression 18.6% 20.0% 78.5% 84.4% 86.1%
Random Forest 15.6% 20.0% 78.3% 83.8% 85.2%
Decision Tree 15.1% 20.0% 76.7% 80.8% 82.8%
Figure 9: Density of Predictions by Point Estimate p^(xi)^𝑝subscript𝑥𝑖\hat{p}(x_{i})over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) & Uncertainty Metric (Darker Colors = Higher Density)

(a) - (c) Swiss Unemployment Data – Predictions x Std Dev Across Bootstrapped Predictions

(d) - (f) Swiss Unemployment Data – Predictions x Conformal P-Values for Outlier Detection

(g) - (i) Census Income Data – Predictions x Std Dev Across Bootstrapped Predictions

(j) - (l) Census Income Data – Predictions x Conformal P-Values for Outlier Detection

Refer to caption
(a) Logistic Regression
Refer to caption
(b) Random Forest
Refer to caption
(c) Decision Tree
Refer to caption
(d) Logistic Regression
Refer to caption
(e) Random Forest
Refer to caption
(f) Decision Tree
Refer to caption
(g) Logistic Regression
Refer to caption
(h) Random Forest
Refer to caption
(i) Decision Tree
Refer to caption
(j) Logistic Regression
Refer to caption
(k) Random Forest
Refer to caption
(l) Decision Tree