Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Ias and Airness: Train/Test Mismatch

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

8 | B IAS AND FAIRNESS

Science and everyday life cannot Learning Objectives:


and should not be separated. – Rosalind Franklin •

At the end of Chapter 1, you saw the “Russian Tank” example of


a biased data set leading to a classifier that seemed like it was doing
well, but was really relying on some artifact of the data collection
process. As machine learning algorithms have a greater and greater
impact on the real world, it is crucially important to ensure that they
are making decisions based on the “right” aspects of the input, rather
than exploiting arbitrary idiosyncracies of a particular training set.
For the rest of this chapter, we will consider two real world ex-
amples of bias issues that have had significant impact: the effect of
gender in speech recognition systems and the effect of race in pre-
dicting criminal recidivism (i.e., will a convicted criminal commit
further crimes if released).1 The gender issue is that early speech
recognition systems in cars failed to recognized the voices of many
Dependencies: Chapter 1,Chap-
people who were not men. The race issue is that a specific recidivism ter 3,Chapter 4,Chapter 5
predictor based on standard learning algorithms was biased against 1
See Autoblog and ProPublica for press
coverage of these two issues.
minorities.

8.1 Train/Test Mismatch

One of the most common issues in bias is a mismatch between the


training distribution and the testing distribution. In the running
example of speech recognition failing to work on many non-men
speakers, a large part of this happened because most of the training
data on which the speech recognition system was trained was spoken
by men. The learning algorithm learned—very well—how to recog-
nize men’s speech, but its accuracy dropped significantly when faced
with a different distribution of test data.
To understand why this happens, recall the Bayes Optimal clas-
sifier from Chapter 2. This was the classifier than magically know
the data distribution D , and then when presented with an example x
predicted argmaxy D( x, y). This was optimal, but only because D was
the correct distribution. Even if one has access to the true distribu-
tion for male speakers, say D (male) , the Bayes Optimal classifier under
bias and fairness 105

D (male) will generally not be optimal under the distribution for any
other gender.
Another example occurs in sentiment analysis. It is common to
train sentiment analysis systems on data collected from reviews:
product reviews, restaurant reviews, movie reviews, etc. This data is
convenient because it includes both text (the review itself) and also
a rating (typically one to five stars). This yields a “free” dataset for
training a model to predict rating given text. However, one should
be wary when running such a sentiment classifier on text other than
the types of reviews it was trained on. One can easily imagine that
a sentiment analyzer trained on movie reviews may not work so
well when trying to detect sentiment in a politician’s speech. Even
moving between one type of product and another can fail wildly:
“very small” typically expresses positive sentiment for USB drives,
but not for hotel rooms.
The issue of train/test mismatch has been widely studied under
many different names: covariate shift (in statistics, the input features
are called “covariates”), sample selection bias and domain adapta-
tion are the most common. We will refer to this problem simply as
adaptation. The adaptation challenge is: given training data from one
“old” distribution, learn a classifier that does a good job on another
related, but different, “new” distribution.
It’s important to recognize that in general, adaptation is im-
possible. For example, even if the task remains the same (posi-
tive/negative sentiment), if the old distribution is text reviews and
the new distribution is images of text reviews, it’s hard to imagine
that doing well on the old distribution says anything about the new.
As a less extreme example, if the old distribution is movie reviews in
English and the new distribution is movie reviews in Mandarin, it’s
unlikely that adaptation will be easy.
These examples give rise to the tension in adaptation problems.

1. What does it mean for two distributions to be related? We might


believe that “reviews of DVDs” and “reviews of movies” are
“highly related” (though somewhat different: DVD reviews of-
ten discuss bonus features, quality, etc.); while “reviews of hotels”
and “reviews of political policies” are “less related.” But how can
we formalize this?

2. When two distributions are related, how can we build models that
effectively share information between them?
106 a course in machine learning

8.2 Unsupervised Adaptation

The first type of adaptation we will cover is unsupervised adapta-


tion. The setting is the following. There are two distributions, D old
and D new . We have labeled training data from D old , say ( x1 , y1 ), . . . , ( x N , y N )
totalling N examples. We also have M many unlabeled examples from
D new : z1 , . . . , z M . We assume that the examples live in the same
space, RD . This is called unsupervised adaptation because we do not
have access to any labels in the new distribution.2 2
Sometimes this is called semi-super-
Our goal is to learn a classifier f that achieves low expected loss vised adaptation in the literature.
new
under the new distribution, D . The challenge is that we do not have
access to any labeled data from D new . As a warm-up, let’s suppose
that we have a black box machine learning algorithm A that takes
in weighted examples and produces a classifier. At the very least,
this can be achieved using either undersampling or oversampling
(see Section 6.1). We’re going to attempt to reweigh the (old distri-
bution) labeled examples based on how similar they are to the new
distribution. This is justified using the importance sampling trick for
switching expectations:

test loss (8.1)


= E(x,y)∼Dnew [`(y, f ( x))] definition (8.2)
= ∑ D new ( x, y)`(y, f ( x)) expand expectation (8.3)
( x,y)

D old ( x, y)
= ∑ D new ( x, y)
D old ( x, y)
`(y, f ( x)) times one (8.4)
( x,y)
D new ( x, y)
= ∑ D old ( x, y)
D old ( x, y)
`(y, f ( x)) rearrange (8.5)
( x,y)
 new
D ( x, y)

= E(x,y)∼Dold `( y, f ( x )) definition (8.6)
D old ( x, y)

What we have achieved here is rewriting the test loss, which is an


expectation over D new , as an expectation over D old instead.3 This 3
In this example, we assumed a discrete
is useful because we have access to labeled examples from D old distribution; if the distributions are con-
tinuous, the sums are simply replaced
but not D new . The implicit suggested algorithm by this analysis with integrals.
to to train a classifier using our learning algorithm A, but where
each training example ( xn , yn ) is weighted according to the ratio
D new ( xn , yn )/D old ( xn , yn ). Intuitively, this makes sense: the classifier
is being told to pay more attention to training examples that have
high probability under the new distribution, and less attention to
training that have low probability under the new distribution.
The problem with this approach is that we do not have access to
D new or D old , so we cannot compute this ratio and therefore cannot
run this algorithm. One approach to this problem is to try to explic-
bias and fairness 107

itly estimate these distributions, a task known as density estimation.


This is an incredibly difficult problem; far harder than the original
adaptation problem.
A solution to this problem is to try to estimate the ratio directly,
rather than separately estimating the two probability distributions4 . 4
Bickel et al. 2007
The key idea is to think of the adaptation as follows. All examples
are drawn according to some fixed base distribution D base . Some
of these are selected to go into the new distribution, and some of
them are selected to go into the old distribution. The mechanism for
deciding which ones are kept and which are thrown out is governed
by a selection variable, which we call s. The choice of selection-or-
not, s, is based only on the input example x and not on it’s label. In What could go wrong if s got to
particular, we define: ? look at the label, too?

D old ( x, y) ∝ D base ( x, y) p(s = 1 | x) (8.7)


new base
D ( x, y) ∝ D ( x, y) p(s = 0 | x) (8.8)
That is, the probability of drawing some pair ( x, y) in the old distri-
bution is proportional to the probability of first drawing that example
according to the base distribution, and then the probability of se-
lecting that particular example into the old distribution. If we can
successfully estimate p(s = 1 | x), then the ratio that we sought, then
we can compute the importance ratio as:
1 base ( x, y ) p ( s = 0 | x )
D new ( x, y) Znew D
= 1
definition (8.9)
D old ( x, y) D base ( x, y) p(s = 1 | x)
Zold
1
Znew p ( s = 0 | x )
= 1
cancel base (8.10)
Zold
p(s = 1 | x)
p(s = 0 | x)
=Z consolidate (8.11)
p(s = 1 | x)
1 − p(s = 1 | x)
=Z binary selection (8.12)
p(s = 1 | x)
 
1
=Z −1 rearrange (8.13)
p(s = 1 | x)
This means that if we can estimate the selection probability p(s =
1 | x), we’re done. We can therefore use 1/p(s = 1 | xn ) − 1 as an
example weight on example ( xn , yn ) when feeding these examples
into our learning algorithm A. As a check: make sure that these
The remaining question is how to estimate p(s = 1 | xn ). Recall weights are always non-negative.
? Furthermore, why is it okay to
that s = 1 denotes the case that x is selected into the old distribution ignore the Z factor?
and s = 0 denotes the case that x is selected into the new distribution.
This means that predicting s is exactly a binary classification problem,
where the “positive” class is the set of N examples from the old
distribution and the “negative” class is the set of M examples from
the new distribution.
108 a course in machine learning

Algorithm 23 SelectionAdaptation(h( xn , yn )inN=1 , hzm im


M , A)
=1
Ddist ← h( xn , +1)inN=1 M
h(zm , −1)im
S
1:
=1 // assemble data for distinguishing
// between old and new distributions
2: p̂ ← train logistic regression on Ddist
D EN
3: Dweighted ← ( xn , yn , p̂(1x ) − 1) // assemble weight classification
n n =1
// data using selector
4: return A(Dweighted ) // train classifier

This analysis gives rise to Algorithm 8.2, which consists of essen-


tially two steps. The first is to train a logistic regression classifier5 to 5
The use of logistic regression is arbi-
distinguish between old and new distributions. The second is to use trary: it need only be a classification
algorithm that can produce probabili-
that classifier to produce weights on the labeled examples from the ties.
old distribution and then train whatever learning algorithm you wish
on that.
In terms of the questions posed at the beginning of this chapter,
this approach to adaptation measures nearness of the two distribu-
tions by the degree to which the selection probability is constant. In
particular, if the selection probability is independent of x, then the
two distributions are identical. If the selection probabilities vary sigi-
ficantly as x changes, then the two distributions are considered very
different. More generally, if it is easy to train a classifier to distin-
guish between the old and new distributions, then they are very
different.
In the case of speech recognition failing as a function of gender, a
core issue is that speech from men was massively over-represented
in the training data but not the test data. When the selection logistic
regression is trained, it is likely to say “old” on speech from men and
“new” on other speakers, thereby downweighting the significance of
male speakers and upweighting the significance of speakers of other
genders on the final learned model. This would (hopefully) address
many of the issues confounding that system. Make up percentages for fraction
of speakers who are male in the
old and new distributions; estimate
? (you’ll have to make some assump-
8.3 Supervised Adaptation tions) what the importance weights
would look like in this case.
Unsupervised adaptation is very challenging because we never get to
see try labels in the new distribution. In many ways, unsupervised
adaptation attempts to guard against bad things happening. That is,
if an old distribution training example looks very unlike the new
distribution, it (and consequently it’s features) are downweighted so
much as to be ignored. In supervised adaptation, we can hope for
more: we can hope to actually do better on the new distribution than
the old because we have labeled data there.
The typical setup is similar to the unsupervised case. There are
bias and fairness 109

(old) (old) (new) (new)


Algorithm 24 EasyAdapt(h( xn , yn )inN=1 , h( xm , ym )im M , A)
=1
D EN S D EM
(old) (old) (old) (new) (new) (new)
1: D ← (h xn , xn , 0i, yn ) (h xm , 0, xm i, ym ) // union
n =1 m =1
// of transformed data
2: return A(D) // train classifier

two distributions, D old and D new , and our goal is to learn a classi-
fier that does well in expectation on D new . However, now we have
labeled data from both distributions: N labeled examples from D old
and M labeled examples from D new . Call them h x(old) (old) N
n , yn in=1 from
D old and h x(new)
m , y(new)
m
M
im =1 from D
new . Again, suppose that both x(old)
n
(new)
and xm both live in RD .
One way of answer the question of “how do we share informa-
tion between the two distributions” is to say: when the distributions
agree on the value of a feature, let them share it, but when they dis-
agree, allow them to learn separately. For instance, in a sentiment
analysis task, D old might be reviews of electronics and D new might
be reviews of hotel rooms. In both cases, if the review contains the
word “awesome” then it’s probably a positive review, regardless of
which distribution we’re in. We would want to share this information
across distributions. On the other hand, “small” might be positive
in electronics and negative in hotels, and we would like the learning
algorithm to be able to learn separate information for that feature.
A very straightforward way to accomplish this is the feature aug-
mentation approach6 . This is a simple preprocessing step after which 6
Daumé III 2007
one can apply any learning algorithm. The idea is to create three ver-
sions of every feature: one that’s shared (for words like “awesome”),
one that’s old-distribution-specific and one that’s new-distribution-
specific. The mapping is:

shared old-only new-only


D E
(old) (old) (old)
xn 7→ xn , xn , 0, 0, . . . , 0 (8.14)
| {z }
D-many
D E
(new) (new) (new)
xm 7→ xm , 0, 0, . . . , 0 , xm (8.15)
| {z }
D-many

Once you’ve applied this transformation, you can take the union
of the (transformed) old and new labeled examples and feed the en-
tire set into your favorite classification algorithm. That classification
algorithm can then choose to share strength between the two distri-
butions by using the “shared” features, if possible; or, if not, it can
learn distribution-specific properties on the old-only or new-only
parts. This is summarized in Algorithm 8.3.
Note that this algorithm can be combined with the instance weight-
110 a course in machine learning

ing (unsupervised) learning algorithm. In this case, the logistic re-


gression separator should be trained on the untransformed data, and
then weights can be used exactly as in Algorithm 8.2. This is particu- Why is it crucial that the separator
larly useful if you have access to way more old distribution data than ? be trained on the untransformed
data?
new distribution data, and you don’t want the old distribution data
to “wash out” the new distribution data.
Although this approach is general, it is most effective when the
two distributions are “not too close but not too far”:

• If the distributions are too far, and there’s little information to


share, you’re probably better off throwing out the old distribution
data and training just on the (untransformed) new distribution
data.

• If the distributions are too close, then you might as well just take
the union of the (untransformed) old and new distribution data,
and training on that.

In general, the interplay between how far the distributions are and
how much new distribution data you have is complex, and you
should always try “old only” and “new only” and “simple union”
as baselines.

8.4 Fairness and Data Bias

Almost any data set in existence is biased in some way, in the sense
that it captures an imperfect view of the world. The degree to which
this bias is obvious can vary greatly:

• There might be obvious bias in the labeling. For instance, in crimi-


nology, learning to predict sentence lengths by predicting the sen-
tences assigned by judges will simply learn to reproduce whatever
bias already exists in judicial sentencing.

• There might be sample selection bias, as discussed early. In the


same criminology example, the only people for which we have
training data are those that have already been arrested, charged
with and convicted of a crime; these processes are inherantly bi-
ased, and so any model learned on this data may exhibit similar
biases.

• The task itself might be biased because the designers had blindspots.
An intelligent billboard that predicts the gender of the person
walking toward it so as to show “better” advertisements may be
trained as a binary classifier between male/female and thereby
excludes anyone who does not fall in the gender binary. A similar
example holds for a classifier that predicts political affiliation in
bias and fairness 111

the US as Democrat/Republican, when in fact there are far more


possibilities.

• There might be bias in the features or model structure. Machine


translation systems often exhibit incorrect gender stereotypes7 7
For instance, many languages have
because the relevant context, which would tell them the correct verbal marking that agrees with the
gender of the subject. In such cases,
gender, is not encoded in the feature representation. Alone, or “the doctor treats . . . ” puts masculine
coupled with biased data, this leads to errors. markers on the translation of “treats”
but “the nurse treats . . . ” uses feminine
• The loss function may favor certain types of errors over others. markers.
You’ve already seen such an example: using zero/one loss on
a highly imbalanced class problem will often lead to a learned
model that ignores the minority class entirely.
• A deployed system creates feedback loops when it begins to con-
sume it’s own output as new input. For instance, once a spam
filter is in place, spammers will adjust their strategies, leading to
distribution shift in the inputs. Or a car guidance system for pre-
dicting which roads are likely to be unoccupied may find those
roads now occupied by other cars that it directed there.

These are all difficult questions, none of which has easy answers.
Nonetheless, it’s important to be aware of how our systems may
fail, especially as they take over more and more of our lives. Most
computing, engineering and mathematical societies (like the ACM,
IEEE, BCS, etc.) have codes of ethics, all of which include a statement
about avoiding harm; the following is taken from the ACM code8 : 8
ACM Code of Ethics

To minimize the possibility of indirectly harming others, computing


professionals must minimize malfunctions by following generally
accepted standards for system design and testing. Furthermore, it is
often necessary to assess the social consequences of systems to project
the likelihood of any serious harm to others.

In addition to ethical questions, there are often related legal ques-


tions. For example, US law prohibits discrimination by race and gen-
der (among other “protected attributes”) in processes like hiring and
housing. The current legal mechanism for measuring discimination is
disparate impact and the 80% rule. Informally, the 80% rule says that
your rate of hiring women (for instance) must be at least 80% of your
rate of hiring men. Formally, the rule states:

Pr(y = +1 | G 6= male) ≥ 0.8 × Pr(y = +1 | G = male) (8.16)

Of course, gender/male can be replaced with any other protected


attribute.
One non-solution to the disparate impact problem is to simply
throw out protected attributes from your dataset. Importantly, yet un-
fortunately, this is not sufficient. Many other features often correlate
112 a course in machine learning

strongly with gender, race, and other demographic information, and


just because you’ve removed explicit encodings of these factors does
not imply that a model trained as such would satisfy the 80% rule. A
natural question to ask is: can we build machine learning algorithms
that have high accuracy but simultaneously avoid disparate impact?
Disparate impact is an imperfect measure of (un)fairness, and
there are alternatives, each with it’s own pros and cons9 . All of these 9
Friedler et al. 2016, Hardt et al. 2016
rely on predefined categories of protected attributes, and a natural
question is where these come from if not governmental regulation.
Regardless of the measure you choose, the most important thing to
keep in mind is that just because something comes from data, or is
algorithmically implemented, does not mean it’s fair.

8.5 How Badly can it Go?

To help better understand how badly things can go when the distri-
bution over inputs changes, it’s worth thinking about how to analyze
this situation formally. Suppose we have two distributions D old and
D new , and let’s assume that the only thing that’s different about these
is the distribution they put on the inputs X and not the outputs Y .
(We will return later to the usefulness of this assumption.) That is:
D old ( x, y) = D old ( x)D(y | x) and D new ( x, y) = D new ( x)D(y | x),
where D(y | x) is shared between them.
Let’s say that you’ve learned a classifier f that does really well on
D old and achieves some test error of e(old) . That is:

e(old) = E
 
x∼D old 1[ f ( x ) 6 = y ] (8.17)
y∼D(· | x)

The question is: how badly can f do on the new distribution?


We can calculate this directly.

e(new)
=E
 
x∼D new 1[ f ( x ) 6 = y ] (8.18)
y∼D(· | x)
Z Z
= D new ( x)D(y | x)1[ f ( x) 6= y]dydx def. of E (8.19)
X Y
Z  
= D new ( x) − D old ( x) + D old ( x) ×
X
Z
D(y | x)1[ f ( x) 6= y]dydx add zero (8.20)
Y
= e(old) +
Z  
D new ( x) − D old ( x) ×
ZX
D(y | x)1[ f ( x) 6= y]dydx def. eold (8.21)
Y
bias and fairness 113

Z   Z
new
≤e (old)
+ D ( x) − D old ( x) 1 dx worst case (8.22)

X Y

=e (old)
+ 2 D new − D old def. |·|Var (8.23)

Var

Here, |·|Var is the total variation distance (or variational distance)


between two probability distributions, defined as:

1
Z
| P − Q|Var = sup | P(e) − Q(e)| = | P( x) − Q( x)| dx (8.24)
e 2 X

Is a standard measure of dissimilarity between probability distribu-


tions. In particular, the variational distance is the largest difference
in probability that P and Q assign to the same event (in this case, the
event is an example x).
The bound tells us that we might not be able to hope for very
good error on the new distribution, even if we have very good error
on the old distribution, when D old and D new are very different (as-
sign very different probabilities to the same event). Of course, this is
an upper bound, and possibly a very loose one.
The second observation is that we barely used the assumption that
D new and D old share the same label distribution D(y | x) in the above
analysis. (In fact, as an exercise, repeat the analysis without this as-
sumption. It will still go through.) In general, this assumption buys
us very little. In an extreme case, D new and D old can essentially “en-
code” which distribution a given x came from in one of the features.
Once the origin distribution is completely encoded in a feature, then
D(y | x) could look at the encoding, and completely flip the label
based on which distribution it’s from. How could D new and D old
encode the distribution? One way would be to set the 29th decimal
digit of feature 1 to an odd value in the old distribution and an even
value in the new distribution. This tiny change will be essentially im-
perceptible if one looks at the data, but would give D(y | x) enough
power to make our lives miserable.
If we want a way out of this dilemma, we need more technology.
The core idea is that if we’re learning a function f from some hypoth-
esis class F , and this hypothesis class isn’t rich enough to peek at the
29th decimal digit of feature 1, then perhaps things are not as bad
as they could be. This motivates the idea of looking at a measure of
distance between probability distributions that depends on the hypoth-
esis class. A popular measure is the dA -distance or the discrepancy.
The discrepancy measure distances between probability distributions
based on how much two function f and f 0 in the hypothesis class can
disagree on their labels. Let:
h i
eP ( f , f 0 ) = Ex∼ P 1[ f ( x) 6= f 0 ( x)] (8.25)
114 a course in machine learning

You can think of eP ( f , f 0 ) as the error of f 0 when the ground truth is


given by f , where the error is taken with repsect to examples drawn
from P. Given a hypothesis class F , the discrepancy between P and
Q is defined as:
dA ( P, Q) = max eP ( f , f 0 ) − eQ ( f , f 0 )

(8.26)
f , f 0 ∈F

The discrepancy very much has the flavor of a classifier: if you think
of f as providing the ground truth labeling, then the discrepancy is
large whenever there exists a function f 0 that agrees strongly with f
on distribution P but disagrees strongly with f on Q. This feels natu-
ral: if all functions behave similarly on P and Q, then the discrepancy
will be small, and we also expect to be able to generalize across these
two distributions easily.
One very attractive property of the discrepancy is that you can
estimate it from finite unlabeled samples from D old and D new . Al-
though not obvious at first, the discrepancy is very closely related
to a quantity we saw earlier in unsupervised adaptation: a classifier
that distinguishes between D old and D new . In fact, the discrepancy is
precisely twice the accuracy of the best classifier from H at separating
D old from D new .
How does this work in practice? Exactly as in the section on un-
supervised adaptation, we train a classifier to distinguish between
D old and D new . It needn’t be a probabilistic classifier; any binary
classifier is fine. This classifier, the “domain separator,” will have
some (heldout) accuracy, call it acc. The discrepancy is then exactly
dA = 2(acc − 0.5).
Intuitively the accuracy of the domain separator is a natural mea-
sure of how different the two distributions are. If the two distribu-
tions are identical, you shouldn’t expect to get very good accuracy at
separating them. In particular, you expect the accuracy to be around
0.5, which puts the discrepancy at zero. On the other hand, if the two
distributions are really far apart, separation is easy, and you expect
an accuracy of about 1, yielding a discrepancy also of about 1.
One can, in fact, prove a generalization bound—generalizing from
finite samples from D old to expected loss on D new —based on the
discrepancy10 . 10
Ben-David et al. 2007

Theorem 9 (Unsupervised Adaptation Bound). Given a fixed rep-


resentation and a fixed hypothesis space F , let f ∈ F and let e(best) =
min f ∗ ∈F 12 e(old) ( f ∗ ) + e(new) ( f ∗ ) , then, for all f ∈ F :
 

e(new) ( f ) ≤ e(old) ( f ) + e|(best)


{z } + dA (D old , D new ) (8.27)
| {z } | {z } | {z }
error on D new error on D old minimal avg error distance

In this bound, e(best) denotes the error rate of the best possible
classifier from F , where the error rate is measured as the average
bias and fairness 115

error rate on D new and D old ; this term ensures that at least some
good classifier exists that does well on both distributions.
The main practical use of this result is that it suggests a way to
look for representations that are good for adaptation: on the one
hand, we should try to get our training error (on D old ) as low as
possible; on the other hand, we want to make sure that it is hard to
distinguish between D old and D new in this representation.

8.6 Further Reading

TODO further reading

You might also like