M2 l10 fairness, accountability, and transparency

Adam Obeng
Adam Obeng is a Research Scientist on the
Adaptive Experimentation team (Core Data
Science) where he works on research and
development for experimental meta-analysis
methods. Adam is a computational social scientist
with a PhD in Sociology from Columbia University
and various masters’ degrees from Columbia and
Oxford. Research interests include natural
language processing, social network analysis,
analytical philosophy, and the sociology of
science and knowledge.
Lecturer Introduction

Toby Farmer
Toby Farmer is a product manager in Facebook
AI working on NLP (machine translation). He
holds a bachelors degree in political science and
a juris doctorate. He has written and researched
public policy in the United States Senate and
Arizona Legislature. He has been a tech
entrepreneur for the the past 12 years and prior to
joining Facebook in 2019, he spent a year at
LinkedIn as a product lead.
Lecturer Introduction

Fairness,
Accountability,
and
Transparency

F A T *
Fairness
Accountability
Transparency
And
more!

Expectations
Motivation for why these issues
come up and matter
A couple of specific examples
Not an exhaustive listing of all
the FAT* problems which have
come up
Not a definitive solution to any
of them
Guidelines of how to identify and
address this type of problem

What are we even talking
about?
Why should we care?
What are the problems?
What should we do about
them?
Overview

What are we
even talking
about?

What are we even talking
about?
FAT, FAT*, FATE, FATES, etc.
Fairness
Accountability
Transparency
Ethics
Safety/Security

View 0: We shouldn’t.
https://xkcd.com/1901/CC-BY-NC
Why care about FAT*?

https://trends.google.com/trends/explore?date=2010-02-
26%202020-02-26&q=ai%20ethics
OK, but other people care

OK, but other people care
These considerations matter to “pure science”

Example: ethics
as an organic
banana sticker
Tony Webster, CC BY-SA https://www.flickr.com/photos/diversey/47811235621
View 1: We need to do FAT after we do science

Tony Webster, CC BY-SA https://www.flickr.com/photos/diversey/47811235621
View 2: FAT* concerns are inextricable from ML
Technology affords and constrains
Technology is political
Science and engineering construct abstractions
Knowledge and techne are social facts

An Unconscionably Brief Overview of FAT* Problems
https://www.forbes.com/sites/kashmirhill/2014/06/28/facebook-manipulated-689003-users-emotions-for-science/#1064079197c5
https://www.forbes.com/sites/bradtempleton/2020/02/13/ntsb-releases-report-on-2018-fatal-silicon-valley-tesla-autopilot-crash/#6258bae842a8
https://www.forbes.com/sites/bradtempleton/2020/02/13/ntsb-releases-report-on-2018-fatal-silicon-valley-tesla-autopilot-crash/#605c6ae842a8
https://tech.fb.com/building-inclusive-ai-at-facebook/
https://www.washingtonpost.com/technology/2019/12/19/federal-study-confirms-racial-bias-many-facial-recognition-systems-casts-doubt-their-
expanding-use/

Why is this domain so
fraught?
Once you open up the
abstraction, almost any
issue can be at stake in
fairness
Reasonable people
disagree
There are inherent tradeoffs

Goodhart, Charles AE. "Problems of monetary management: the UK experience." Monetary Theory and Practice.
Palgrave, London, 1984. 91-121.
In other words, if you incentivize optimization, you’ll get overfitting.
Goodhart’s Law
“Any observed statistical regularity
will tend to collapse once pressure is
placed upon it for control purposes.”

Goodhart’s Law in ML
Overfitting to MNIST and
distribution shift in CIFAR-10
Yadav, Chhavi, and Léon Bottou. "Cold case: The lost
mnist digits." Advances in Neural Information Processing
Systems. 2019.
Recht, Benjamin, et al. "Do cifar-10 classifiers generalize to
cifar-10?." arXiv preprint arXiv:1806.00451 (2018).

Goodhart’s Law in ML
Incompatible and
incommensurable fairness
measures

Calibration
and the
Fairness
Impossibility
Theorems

Definition
Measuring Calibration
Calibrating models
Limitations of Calibration
Calibration

A classifier is well-calibrated if the probability of the observations
with a given probability score of having a label is equal to the
proportion of observations having that label
Example: if a binary classifier gives a score of 0.8 to 100
observations, then 80 of them should be in the positive class
where !
𝒀 is the predicted label and !
𝑷 is the predicted probability
(or score) for class Y
Calibration: Definition

Group Calibration: the scores for
subgroups of interest are calibrated
(or at least, equally mis-calibrated)
Calibration: Definition

Reliability diagram
Divergence measure?
Measuring Calibration

Some models (e.g Logistic Regression) tend to have
well-calibrated predictions
Some DL models (e.g. ResNet) tend to be
overconfident (https://arxiv.org/pdf/1706.04599.pdf)
Logistic calibration/Platt scaling
Calibrating a DL Model

Post-processing approach requiring an additional validation
dataset
Platt scaling (binary classifier)
Learn parameters 𝒂, 𝒃 so that the calibrated probability is
$
𝒒𝒊 = 𝝈 𝒂𝒛𝒊 + 𝒃 (where 𝒛𝒊 is the network’s logit output)
Temperature scaling extends this to multi-class classification
Learn a temperature 𝑻, and produce calibrated probabilities
$
𝒒𝒊 = 𝐦𝐚𝐱
𝒌
𝝈𝑺𝒐𝒇𝒕𝑴𝒂𝒙 𝒛𝒊/𝑻
Platt/Temperature Scaling

Group based
The Inherent Tradeoffs of
Calibration
Calibration: Limitations

It is impossible for a classifier to achieve both equal
calibration and error rates between groups, (if there is a
difference in prevalence between the groups and the
classifier is not perfect)
Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan. "Inherent trade-offs in the fair determination of risk scores."
arXiv preprint arXiv:1609.05807 (2016).
Chouldechova, Alexandra. "Fair prediction with disparate impact: A study of bias in recidivism prediction instruments." Big
data 5, no. 2 (2017): 153-163.
The Fairness Impossibility Theorems

Classifier confusion matrix:
Ground truth
Prediction
TP FP
FN TN
Derived quantities:
False Positive Rate (FPR):
FP/(FP+TN)
False Negative Rate (FNR):
FN/(FN+TP)
Positive Predictive Value (PPV):
TP/(TP+FP)
Measures “test fairness” for
a binary classifier

Result:
(where p is the prevalence of the label in a given group)

More generally, we can state many fairness theorems based on any
three quantities derived from the confusion matrix
https://en.wikipedia.org/wiki/Confusion_matrix
Narayanan, 21 Fairness Definitions and Their Politics
Ground truth
Prediction
TP FP
FN TN

An impossibility theorem obtains for any three (or more)
measures of model performance derived (non degenerately)
from the confusion matrix.
In all cases,
In a system of equations with three more equations,
p is determined uniquely: if groups have different
prevalences, these quantities cannot be equal

What should
we do about
these
problems?

No
You can have some
abstractions (but know
that they are leaky)
Can I have a checklist?

Research Ethics:
Examples: Belmont
Report, Menlo Report
Business Ethics
Technology Ethics
Engineering Ethics
Example: AAAI
Overview: Ethical
Frameworks

https://xkcd.com/927/ CC-BY-NC

Central insights:
There are different ethical
perspectives
Process matters: do the
moral math
The Markkula Center
Framework for Ethical
Decision-Making
https://www.scu.edu/ethics/ethics-resources/ethical-decision-
making/a-framework-for-ethical-decision-making/

1. Recognize an Ethical Issue
Could it harm people?
Is it about ethics (contra law, efficiency, aesthetics, etc.)?
2. Get the facts
Is there enough information?
Stakeholders
Options
3. Evaluate options following different approaches
Deontology, Consequentialism, Virtue Ethics, Confucian, Buddhist, Hindu ethics
4. Make a decision and test it
The Moral Math: explain how the decision is derived from the facts and evaluations
5. Act and reflect on the outcome
Did it work?
What did we learn
The Markkula Center Framework for Ethical
Decision-Making
https://www.scu.edu/ethics/ethics-resources/ethical-decision-making/a-framework-for-ethical-decision-making/

Approaches are academic philosophical schools, but more
broadly different perspectives which focus on different
aspects of the problem
Deontology: e.g. Kant’s Categorical Imperative
Consequentialism: e.g. Utilitarianism
Virtue Ethics: character and habits
Non-Western frameworks: e.g. Confucian, Buddhist,
Hindu
The Markkula Center Approaches

The Most General Advice: Reflective Equilibrium
Moral
Judgments
Moral
Principles
See https://plato.stanford.edu/entries/reflective-equilibrium/
Moral
Theories

https://datascience.columbia.edu/FATES-Elaborated
https://www.fatml.org/resources
https://www.scu.edu/ethics/ethics-resources/ethical-decision-
making/a-framework-for-ethical-decision-making/
https://aaai.org/Conferences/code-of-ethics-and-conduct.php
Resources

This problem is not unique to ML
Knowledge covers its tracks
The Tetrachoric Correlation Coefficient

Correlation for continuous variables was well defined
How to define correlation for discrete variables?
Yule’s Q: 𝑸 =
𝒂𝒅#𝒃𝒄
𝒂𝒅&𝒃𝒄
Pearson’s Tetrachoric Coefficient of correlation:
assume underlying zero-mean bivariate normal
distribution
estimate cutoffs, sigma, and correlation coefficient r
MacKenzie, Donald. "Statistical theory and social interests: A case-study." Social studies of science 8, no. 1 (1978): 35-83.

The debate:
Yule: assuming underlying continuous normal variables is
bogus
Pearson:
If there is actually a bivariate normal Q ≠ R (depending
on cutoffs)
Q is not unique

No obvious reason to favor one approach, why do they differ?
Pearson was a social Darwinist, committed to eugenics
Regression was created to measure heritability
The measure of correlation must be such that the effects of natural (or
unnatural) selection can be predicted
“If the theory of correlation can be extended [to
categorical characteristics] we shall have much
widened the field within which we can make
numerical investigations into the intensity of
heredity” — Pearson
Mathematical contributions to the theory of evolution.—VII. On the correlation of characters not quantitatively measurable."
Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical
Character 195, no. 262-273 (1900): 1-47.

The choice of measures — even those as basic as a
correlation coefficient — can be motivated by concerns and
have effects which are profoundly ethical

Gender Bias in Word Embeddings
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space."
Word embeddings represent words as vectors derived
from their co-occurence matrix (e.g. word2vec, later
GloVE)
Similar words have similar vectors, we can do algebra
with vectors:
Example: King - Man + Woman = Queen

Word embeddings represent words as vectors derived
from their co-occurence matrix (e.g. word2vec, later
GloVE)
Similar words have similar vectors, we can do algebra
with vectors:
Example: King - Man + Woman = Queen
More specifically, 3CosAdd:
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space."

Generate analogies for he::she, get crowdsourced
workers to rank how stereotypical they are:
Examples: surgeon::nurse, Karate::Gymnastics,
carpentry::sewing
Suggestions to reduce debias of already-trained
embeddings
Bolukbasi, T., Chang, K. W., Zou, J., Saligrama, V., & Kalai, A. (2016). Quantifying and reducing stereotypes in word
embeddings. arXiv preprint arXiv:1606.06121.

But:
3CosAdd is broken
For analogy A : B :: C : D word2vec implementation does not return
D=B
This also applies to Bolukbasi’s direction-based formulation
People choose which analogies to report: Manzini et al. found
biased examples even with a mistakenly reversed the query
(Example: caucasian is to criminal as black is to X)
Nissim, Malvina, Rik van Noord, and Rob van der Goot. "Fair is better than sensational: Man is to doctor as woman is to
doctor." arXiv preprint arXiv:1905.09866 (2019).
Manzini, Thomas, Lim Yao Chong, Alan W. Black, and Yulia Tsvetkov. 2019a. Black is to criminal as caucasian is to police:
Detecting and removing multiclass bias in word embeddings. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short
Papers), pages 615–621, Association for Computational Linguistics, Minneapolis, Minnesota

A mixed conclusion
Of course there is gender bias in society
And there’s probably bias of some sort in word
embeddings
But analogy tasks aren’t the right task to capture them
More than that, analogy tasks are tricky to use for
evaluation in algorithms
Gladkova, Anna, Aleksandr Drozd, and Satoshi Matsuoka. "Analogy-based detection of morphological and semantic
relations with word embeddings: what works and what doesn’t." In Proceedings of the NAACL Student Research Workshop,
pp. 8-15. 2016.

⬣ GDPR (General Data Protection Regulation)
⬣ CCPA (California Consumer Privacy Act)
⬣ In 2019 bills were filed or introduced in at least 25 states
that regulate user privacy on the internet.
⬣ EU Commission White Paper on AI Regulation “ Artificial
Intelligence – A European approach to excellence and trust”
⬣ Facebook FTC Case: Face Recognition privacy violation
Privacy
Current Consumer Privacy/AI Regulation

M2 l10 fairness, accountability, and transparency

Related slideshows

More Related Content

M2 l10 fairness, accountability, and transparency