research-article

Data Augmentation for Discrimination Prevention and Bias Disambiguation

Authors:

Shubham Sharma,

Yunfeng Zhang,

Jesús M. Ríos Aliaga,

Djallel Bouneffouf,

Vinod Muthusamy,

Kush R. VarshneyAuthors Info & Claims

AIES '20: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society

Pages 358 - 364

https://doi.org/10.1145/3375627.3375865

Published: 07 February 2020 Publication History

Get Access

Abstract

Machine learning models are prone to biased decisions due to biases in the datasets they are trained on. In this paper, we introduce a novel data augmentation technique to create a fairer dataset for model training that could also lend itself to understanding the type of bias existing in the dataset i.e. if bias arises from a lack of representation for a particular group (sampling bias) or if it arises because of human bias reflected in the labels (prejudice based bias). Given a dataset involving a protected attribute with a privileged and unprivileged group, we create an "ideal world'' dataset: for every data sample, we create a new sample having the same features (except the protected attribute(s)) and label as the original sample but with the opposite protected attribute value. The synthetic data points are sorted in order of their proximity to the original training distribution and added successively to the real dataset to create intermediate datasets. We theoretically show that two different notions of fairness: statistical parity difference (independence) and average odds difference (separation) always change in the same direction using such an augmentation. We also show submodularity of the proposed fairness-aware augmentation approach that enables an efficient greedy algorithm. We empirically study the effect of training models on the intermediate datasets and show that this technique reduces the two bias measures while keeping the accuracy nearly constant for three datasets. We then discuss the implications of this study on the disambiguation of sample bias and prejudice based bias and discuss how pre-processing techniques should be evaluated in general. The proposed method can be used by policy makers who want to use unbiased datasets to train machine learning models for their applications to add a subset of synthetic points to an extent that they are comfortable with to mitigate unwanted bias.

References

[1]

[n. d.]. ProPublica COMPAS. https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis.

Abstract

References

Cited By

Index Terms

Recommendations

Counterfactually Measuring and Eliminating Social Bias in Vision-Language Pre-training Models

Rule protection for indirect discrimination prevention in data mining

Bias oriented unbiased data augmentation for cross-bias representation learning

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations