Divide and Imitate: Multi-cluster Identification and Mitigation of Selection Bias

Dost, Katharina; Duncanson, Hamish; Ziogas, Ioannis; Riddle, Patricia; Wicker, Jörg

doi:10.1007/978-3-031-05936-0_12

Katharina Dost¹³,
Hamish Duncanson¹³,
Ioannis Ziogas¹⁴,
Patricia Riddle¹³ &
…
Jörg Wicker¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13281))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2149 Accesses
2 Citations
3 Altmetric

Abstract

Machine Learning can help overcome human biases in decision making by focussing on purely logical conclusions based on the training data. If the training data is biased, however, that bias will be transferred to the model and remains undetected as the performance is validated on a test set drawn from the same biased distribution. Existing strategies for selection bias identification and mitigation generally rely on some sort of knowledge of the bias or the ground-truth. An exception is the Imitate algorithm that assumes no knowledge but comes with a strong limitation: It can only model datasets with one normally distributed cluster per class. In this paper, we introduce a novel algorithm, Mimic, which uses Imitate as a building block but relaxes this limitation. By allowing mixtures of multivariate Gaussians, our technique is able to model multi-cluster datasets and provide solutions for a substantially wider set of problems. Experiments confirm that Mimic not only identifies potential biases in multi-cluster datasets which can be corrected early on but also improves classifier performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Extended Naïve Bayes for Group Based Classification

Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data

Validation of Unimodal Non-Gaussian Clusters

Notes

1.
Implementation and Supplementary Material: https://github.com/KatDost/Mimic.
2.
The Central Limit Theorem states that a sequence of independent and identically distributed (i.i.d.) random variables converges almost surely to a Gaussian [10]. Since we can typically assume that real-world measurements are not perfectly i.i.d. but rather combinations of different effects, we will often observe this effect.

References

Abreu, N.: Análise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em marketing, ISCTE-IUL, Lisbon (2011)
Google Scholar
Bareinboim, E., Tian, J., Pearl, J.: Recovering from selection bias in causal and statistical inference. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence, June 2014 (2014)
Google Scholar
Bellamy, R.K.E., et al.: AI fairness 360: an extensible toolkit for detecting and mitigating algorithmic bias. IBM J. Res. Develop. 63(4/5), 4:1–4:15 (2019). https://doi.org/10.1147/JRD.2019.2942287
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000). https://doi.org/10.1145/342009.335388
Dost, K., Taskova, K., Riddle, P., Wicker, J.: Your best guess when you know nothing: identification and mitigation of selection bias. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996–1001. IEEE (2020). https://doi.org/10.1109/ICDM50108.2020.00115
Dua, D., Graff, C.: UCI ML repository (2017). http://archive.ics.uci.edu/ml
Goel, N., Yaghini, M., Faltings, B.: Non-discriminatory machine learning through convex fairness criteria. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, April 2018 (2018)
Google Scholar
Granichin, O., Volkovich, Z.V., Toledano-Kitai, D.: Cluster validation. In: Randomized Algorithms in Automatic Control and Data Mining. ISRL, vol. 67, pp. 163–228. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-642-54786-7_7
Chapter Google Scholar
Hassani, B.K.: Societal bias reinforcement through machine learning: a credit scoring perspective. AI Ethics 1(3), 239–247 (2020). https://doi.org/10.1007/s43681-020-00026-z
Article Google Scholar
Hoeffding, W., Robbins, H.: The central limit theorem for dependent random variables. Duke Math. J. 15(3), 773–780 (1948). https://doi.org/10.1215/S0012-7094-48-01568-3
Article MathSciNet MATH Google Scholar
Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Netw. 13(4), 411–430 (2000). https://doi.org/10.1016/S0893-6080(00)00026-5
Article Google Scholar
Lavalle, A., Maté, A., Trujillo, J.: An approach to automatically detect and visualize bias in data analytics. In: CEUR Workshop Proceedings of the 22nd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data, vol. 2572. CEUR (2020)
Google Scholar
Lyon, A.: Why are normal distributions normal? Br. J. Philos. Sci. 65(3), 621–649 (2014). https://doi.org/10.1093/bjps/axs046
Article MathSciNet MATH Google Scholar
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Comput. Surv. 54(6), 1–35 (2021). https://doi.org/10.1145/3457607
Article Google Scholar
Panch, T., Mattie, H., Atun, R.: Artificial intelligence and algorithmic bias: implications for health systems. J. Glob. Health 9(2), 010318 (2019). https://doi.org/10.7189/jogh.09.020318
Article Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Poulos, J., Valle, R.: Missing data imputation for supervised learning. Appl. Artif. Intell. 32(2), 186–196 (2018). https://doi.org/10.1080/08839514.2018.1448143
Article Google Scholar
Rabanser, S., Günnemann, S., Lipton, Z.: Failing loudly: an empirical study of methods for detecting dataset shift. Adv. Neural Info. Process. Syst. 32, 1396–1408 (2019)
Google Scholar
Rezaei, A., Liu, A., Memarrast, O., Ziebart, B.D.: Robust fairness under covariate shift. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 9419–9427 (2021)
Google Scholar
Smith, A.T., Elkan, C.: Making generative classifiers robust to selection bias. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 657–666 (2007). https://doi.org/10.1145/1281192.1281263
Stojanov, P., Gong, M., Carbonell, J., Zhang, K.: Low-dimensional density ratio estimation for covariate shift correction. Proc. Mach. Learn. Res. 89, 3449–3458 (2019)
Google Scholar
Strack, B., Deshazo, J., Gennings, C., Olmo Ortiz, J.L., Ventura, S., et al.: Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed Res. Int. 2014, 781670 (2014). https://doi.org/10.1155/2014/781670

Download references

Author information

Authors and Affiliations

University of Auckland, Auckland, New Zealand
Katharina Dost, Hamish Duncanson, Patricia Riddle & Jörg Wicker
University of Mississippi, Oxford, USA
Ioannis Ziogas

Authors

Katharina Dost
View author publications
You can also search for this author in PubMed Google Scholar
Hamish Duncanson
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Ziogas
View author publications
You can also search for this author in PubMed Google Scholar
Patricia Riddle
View author publications
You can also search for this author in PubMed Google Scholar
Jörg Wicker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Katharina Dost .

Editor information

Editors and Affiliations

Laboratory of Artificial Intelligence and Decision Support, University of Porto, Porto, Portugal
João Gama
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
Tianrui Li
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Yang Yu
School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
Enhong Chen
JD iCity, JD Technology & JD Intelligent Cities Research, Beijing, China
Yu Zheng
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China
Fei Teng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dost, K., Duncanson, H., Ziogas, I., Riddle, P., Wicker, J. (2022). Divide and Imitate: Multi-cluster Identification and Mitigation of Selection Bias. In: Gama, J., Li, T., Yu, Y., Chen, E., Zheng, Y., Teng, F. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2022. Lecture Notes in Computer Science(), vol 13281. Springer, Cham. https://doi.org/10.1007/978-3-031-05936-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-05936-0_12
Published: 11 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-05935-3
Online ISBN: 978-3-031-05936-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Divide and Imitate: Multi-cluster Identification and Mitigation of Selection Bias

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Extended Naïve Bayes for Group Based Classification

Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data

Validation of Unimodal Non-Gaussian Clusters

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Divide and Imitate: Multi-cluster Identification and Mitigation of Selection Bias

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Extended Naïve Bayes for Group Based Classification

Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data

Validation of Unimodal Non-Gaussian Clusters

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation