Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Divide and Imitate: Multi-cluster Identification and Mitigation of Selection Bias

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13281))

Included in the following conference series:

Abstract

Machine Learning can help overcome human biases in decision making by focussing on purely logical conclusions based on the training data. If the training data is biased, however, that bias will be transferred to the model and remains undetected as the performance is validated on a test set drawn from the same biased distribution. Existing strategies for selection bias identification and mitigation generally rely on some sort of knowledge of the bias or the ground-truth. An exception is the Imitate algorithm that assumes no knowledge but comes with a strong limitation: It can only model datasets with one normally distributed cluster per class. In this paper, we introduce a novel algorithm, Mimic, which uses Imitate as a building block but relaxes this limitation. By allowing mixtures of multivariate Gaussians, our technique is able to model multi-cluster datasets and provide solutions for a substantially wider set of problems. Experiments confirm that Mimic not only identifies potential biases in multi-cluster datasets which can be corrected early on but also improves classifier performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Implementation and Supplementary Material: https://github.com/KatDost/Mimic.

  2. 2.

    The Central Limit Theorem states that a sequence of independent and identically distributed (i.i.d.) random variables converges almost surely to a Gaussian [10]. Since we can typically assume that real-world measurements are not perfectly i.i.d. but rather combinations of different effects, we will often observe this effect.

References

  1. Abreu, N.: Análise do perfil do cliente Recheio e desenvolvimento de um sistema promocional. Mestrado em marketing, ISCTE-IUL, Lisbon (2011)

    Google Scholar 

  2. Bareinboim, E., Tian, J., Pearl, J.: Recovering from selection bias in causal and statistical inference. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence, June 2014 (2014)

    Google Scholar 

  3. Bellamy, R.K.E., et al.: AI fairness 360: an extensible toolkit for detecting and mitigating algorithmic bias. IBM J. Res. Develop. 63(4/5), 4:1–4:15 (2019). https://doi.org/10.1147/JRD.2019.2942287

  4. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000). https://doi.org/10.1145/342009.335388

  5. Dost, K., Taskova, K., Riddle, P., Wicker, J.: Your best guess when you know nothing: identification and mitigation of selection bias. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 996–1001. IEEE (2020). https://doi.org/10.1109/ICDM50108.2020.00115

  6. Dua, D., Graff, C.: UCI ML repository (2017). http://archive.ics.uci.edu/ml

  7. Goel, N., Yaghini, M., Faltings, B.: Non-discriminatory machine learning through convex fairness criteria. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, April 2018 (2018)

    Google Scholar 

  8. Granichin, O., Volkovich, Z.V., Toledano-Kitai, D.: Cluster validation. In: Randomized Algorithms in Automatic Control and Data Mining. ISRL, vol. 67, pp. 163–228. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-642-54786-7_7

    Chapter  Google Scholar 

  9. Hassani, B.K.: Societal bias reinforcement through machine learning: a credit scoring perspective. AI Ethics 1(3), 239–247 (2020). https://doi.org/10.1007/s43681-020-00026-z

    Article  Google Scholar 

  10. Hoeffding, W., Robbins, H.: The central limit theorem for dependent random variables. Duke Math. J. 15(3), 773–780 (1948). https://doi.org/10.1215/S0012-7094-48-01568-3

    Article  MathSciNet  MATH  Google Scholar 

  11. Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural Netw. 13(4), 411–430 (2000). https://doi.org/10.1016/S0893-6080(00)00026-5

    Article  Google Scholar 

  12. Lavalle, A., Maté, A., Trujillo, J.: An approach to automatically detect and visualize bias in data analytics. In: CEUR Workshop Proceedings of the 22nd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data, vol. 2572. CEUR (2020)

    Google Scholar 

  13. Lyon, A.: Why are normal distributions normal? Br. J. Philos. Sci. 65(3), 621–649 (2014). https://doi.org/10.1093/bjps/axs046

    Article  MathSciNet  MATH  Google Scholar 

  14. Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Comput. Surv. 54(6), 1–35 (2021). https://doi.org/10.1145/3457607

    Article  Google Scholar 

  15. Panch, T., Mattie, H., Atun, R.: Artificial intelligence and algorithmic bias: implications for health systems. J. Glob. Health 9(2), 010318 (2019). https://doi.org/10.7189/jogh.09.020318

    Article  Google Scholar 

  16. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  17. Poulos, J., Valle, R.: Missing data imputation for supervised learning. Appl. Artif. Intell. 32(2), 186–196 (2018). https://doi.org/10.1080/08839514.2018.1448143

    Article  Google Scholar 

  18. Rabanser, S., Günnemann, S., Lipton, Z.: Failing loudly: an empirical study of methods for detecting dataset shift. Adv. Neural Info. Process. Syst. 32, 1396–1408 (2019)

    Google Scholar 

  19. Rezaei, A., Liu, A., Memarrast, O., Ziebart, B.D.: Robust fairness under covariate shift. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 9419–9427 (2021)

    Google Scholar 

  20. Smith, A.T., Elkan, C.: Making generative classifiers robust to selection bias. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 657–666 (2007). https://doi.org/10.1145/1281192.1281263

  21. Stojanov, P., Gong, M., Carbonell, J., Zhang, K.: Low-dimensional density ratio estimation for covariate shift correction. Proc. Mach. Learn. Res. 89, 3449–3458 (2019)

    Google Scholar 

  22. Strack, B., Deshazo, J., Gennings, C., Olmo Ortiz, J.L., Ventura, S., et al.: Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed Res. Int. 2014, 781670 (2014). https://doi.org/10.1155/2014/781670

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katharina Dost .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dost, K., Duncanson, H., Ziogas, I., Riddle, P., Wicker, J. (2022). Divide and Imitate: Multi-cluster Identification and Mitigation of Selection Bias. In: Gama, J., Li, T., Yu, Y., Chen, E., Zheng, Y., Teng, F. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2022. Lecture Notes in Computer Science(), vol 13281. Springer, Cham. https://doi.org/10.1007/978-3-031-05936-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-05936-0_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-05935-3

  • Online ISBN: 978-3-031-05936-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics