Ias and Airness: Train/Test Mismatch
Ias and Airness: Train/Test Mismatch
Ias and Airness: Train/Test Mismatch
D (male) will generally not be optimal under the distribution for any
other gender.
Another example occurs in sentiment analysis. It is common to
train sentiment analysis systems on data collected from reviews:
product reviews, restaurant reviews, movie reviews, etc. This data is
convenient because it includes both text (the review itself) and also
a rating (typically one to five stars). This yields a “free” dataset for
training a model to predict rating given text. However, one should
be wary when running such a sentiment classifier on text other than
the types of reviews it was trained on. One can easily imagine that
a sentiment analyzer trained on movie reviews may not work so
well when trying to detect sentiment in a politician’s speech. Even
moving between one type of product and another can fail wildly:
“very small” typically expresses positive sentiment for USB drives,
but not for hotel rooms.
The issue of train/test mismatch has been widely studied under
many different names: covariate shift (in statistics, the input features
are called “covariates”), sample selection bias and domain adapta-
tion are the most common. We will refer to this problem simply as
adaptation. The adaptation challenge is: given training data from one
“old” distribution, learn a classifier that does a good job on another
related, but different, “new” distribution.
It’s important to recognize that in general, adaptation is im-
possible. For example, even if the task remains the same (posi-
tive/negative sentiment), if the old distribution is text reviews and
the new distribution is images of text reviews, it’s hard to imagine
that doing well on the old distribution says anything about the new.
As a less extreme example, if the old distribution is movie reviews in
English and the new distribution is movie reviews in Mandarin, it’s
unlikely that adaptation will be easy.
These examples give rise to the tension in adaptation problems.
2. When two distributions are related, how can we build models that
effectively share information between them?
106 a course in machine learning
D old ( x, y)
= ∑ D new ( x, y)
D old ( x, y)
`(y, f ( x)) times one (8.4)
( x,y)
D new ( x, y)
= ∑ D old ( x, y)
D old ( x, y)
`(y, f ( x)) rearrange (8.5)
( x,y)
new
D ( x, y)
= E(x,y)∼Dold `( y, f ( x )) definition (8.6)
D old ( x, y)
two distributions, D old and D new , and our goal is to learn a classi-
fier that does well in expectation on D new . However, now we have
labeled data from both distributions: N labeled examples from D old
and M labeled examples from D new . Call them h x(old) (old) N
n , yn in=1 from
D old and h x(new)
m , y(new)
m
M
im =1 from D
new . Again, suppose that both x(old)
n
(new)
and xm both live in RD .
One way of answer the question of “how do we share informa-
tion between the two distributions” is to say: when the distributions
agree on the value of a feature, let them share it, but when they dis-
agree, allow them to learn separately. For instance, in a sentiment
analysis task, D old might be reviews of electronics and D new might
be reviews of hotel rooms. In both cases, if the review contains the
word “awesome” then it’s probably a positive review, regardless of
which distribution we’re in. We would want to share this information
across distributions. On the other hand, “small” might be positive
in electronics and negative in hotels, and we would like the learning
algorithm to be able to learn separate information for that feature.
A very straightforward way to accomplish this is the feature aug-
mentation approach6 . This is a simple preprocessing step after which 6
Daumé III 2007
one can apply any learning algorithm. The idea is to create three ver-
sions of every feature: one that’s shared (for words like “awesome”),
one that’s old-distribution-specific and one that’s new-distribution-
specific. The mapping is:
Once you’ve applied this transformation, you can take the union
of the (transformed) old and new labeled examples and feed the en-
tire set into your favorite classification algorithm. That classification
algorithm can then choose to share strength between the two distri-
butions by using the “shared” features, if possible; or, if not, it can
learn distribution-specific properties on the old-only or new-only
parts. This is summarized in Algorithm 8.3.
Note that this algorithm can be combined with the instance weight-
110 a course in machine learning
• If the distributions are too close, then you might as well just take
the union of the (untransformed) old and new distribution data,
and training on that.
In general, the interplay between how far the distributions are and
how much new distribution data you have is complex, and you
should always try “old only” and “new only” and “simple union”
as baselines.
Almost any data set in existence is biased in some way, in the sense
that it captures an imperfect view of the world. The degree to which
this bias is obvious can vary greatly:
• The task itself might be biased because the designers had blindspots.
An intelligent billboard that predicts the gender of the person
walking toward it so as to show “better” advertisements may be
trained as a binary classifier between male/female and thereby
excludes anyone who does not fall in the gender binary. A similar
example holds for a classifier that predicts political affiliation in
bias and fairness 111
These are all difficult questions, none of which has easy answers.
Nonetheless, it’s important to be aware of how our systems may
fail, especially as they take over more and more of our lives. Most
computing, engineering and mathematical societies (like the ACM,
IEEE, BCS, etc.) have codes of ethics, all of which include a statement
about avoiding harm; the following is taken from the ACM code8 : 8
ACM Code of Ethics
To help better understand how badly things can go when the distri-
bution over inputs changes, it’s worth thinking about how to analyze
this situation formally. Suppose we have two distributions D old and
D new , and let’s assume that the only thing that’s different about these
is the distribution they put on the inputs X and not the outputs Y .
(We will return later to the usefulness of this assumption.) That is:
D old ( x, y) = D old ( x)D(y | x) and D new ( x, y) = D new ( x)D(y | x),
where D(y | x) is shared between them.
Let’s say that you’ve learned a classifier f that does really well on
D old and achieves some test error of e(old) . That is:
e(old) = E
x∼D old 1[ f ( x ) 6 = y ] (8.17)
y∼D(· | x)
e(new)
=E
x∼D new 1[ f ( x ) 6 = y ] (8.18)
y∼D(· | x)
Z Z
= D new ( x)D(y | x)1[ f ( x) 6= y]dydx def. of E (8.19)
X Y
Z
= D new ( x) − D old ( x) + D old ( x) ×
X
Z
D(y | x)1[ f ( x) 6= y]dydx add zero (8.20)
Y
= e(old) +
Z
D new ( x) − D old ( x) ×
ZX
D(y | x)1[ f ( x) 6= y]dydx def. eold (8.21)
Y
bias and fairness 113
Z Z
new
≤e (old)
+ D ( x) − D old ( x) 1 dx worst case (8.22)
X Y
=e (old)
+ 2 D new − D old def. |·|Var (8.23)
Var
1
Z
| P − Q|Var = sup | P(e) − Q(e)| = | P( x) − Q( x)| dx (8.24)
e 2 X
The discrepancy very much has the flavor of a classifier: if you think
of f as providing the ground truth labeling, then the discrepancy is
large whenever there exists a function f 0 that agrees strongly with f
on distribution P but disagrees strongly with f on Q. This feels natu-
ral: if all functions behave similarly on P and Q, then the discrepancy
will be small, and we also expect to be able to generalize across these
two distributions easily.
One very attractive property of the discrepancy is that you can
estimate it from finite unlabeled samples from D old and D new . Al-
though not obvious at first, the discrepancy is very closely related
to a quantity we saw earlier in unsupervised adaptation: a classifier
that distinguishes between D old and D new . In fact, the discrepancy is
precisely twice the accuracy of the best classifier from H at separating
D old from D new .
How does this work in practice? Exactly as in the section on un-
supervised adaptation, we train a classifier to distinguish between
D old and D new . It needn’t be a probabilistic classifier; any binary
classifier is fine. This classifier, the “domain separator,” will have
some (heldout) accuracy, call it acc. The discrepancy is then exactly
dA = 2(acc − 0.5).
Intuitively the accuracy of the domain separator is a natural mea-
sure of how different the two distributions are. If the two distribu-
tions are identical, you shouldn’t expect to get very good accuracy at
separating them. In particular, you expect the accuracy to be around
0.5, which puts the discrepancy at zero. On the other hand, if the two
distributions are really far apart, separation is easy, and you expect
an accuracy of about 1, yielding a discrepancy also of about 1.
One can, in fact, prove a generalization bound—generalizing from
finite samples from D old to expected loss on D new —based on the
discrepancy10 . 10
Ben-David et al. 2007
In this bound, e(best) denotes the error rate of the best possible
classifier from F , where the error rate is measured as the average
bias and fairness 115
error rate on D new and D old ; this term ensures that at least some
good classifier exists that does well on both distributions.
The main practical use of this result is that it suggests a way to
look for representations that are good for adaptation: on the one
hand, we should try to get our training error (on D old ) as low as
possible; on the other hand, we want to make sure that it is hard to
distinguish between D old and D new in this representation.