Explainable global error weighted on feature importance: The xGEWFI metric to evaluate the error of data imputation and data augmentation

Dessureault, Jean-Sébastien; Massicotte, Daniel

doi:10.1007/s10489-023-04661-x

Explainable global error weighted on feature importance: The xGEWFI metric to evaluate the error of data imputation and data augmentation

Published: 06 June 2023

Volume 53, pages 21532–21542, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

212 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Evaluating data imputation and augmentation performance is a critical issue in data science. In statistics, methods like Kolmogorov-Smirnov K-S test, Cramér-von Mises $W^2$, Anderson-Darling $A^2$, Pearson’s $\chi ^2$ and Watson’s $U^2$ exists for decades to compare the distribution of two datasets. In the context of data generation, typical evaluation metrics have the same flaw: They calculate the feature’s error and the global error on the generated data without weighting the error with the feature’s importance. In most cases, the importance of the features is imbalanced, and it can induce a bias on the features and global errors. This paper proposes a novel metric named “Explainable Global Error Weighted on Feature Importance” (xGEWFI). This new metric is tested in a whole preprocessing method that 1. Process the outliers, 2. impute the missing data, and 3. augments the data. At the end of the process, the xGEWFI error is calculated. The distribution error between the original and generated data is calculated using a Kolmogorov-Smirnov test (K-S test) for each feature. Those results are multiplied by the importance of the respective features and calculated using a Random Forest (RF) algorithm. The metric result is expressed in an explainable format, aiming for an ethical AI. This novel method provides a more precise evaluation of a data generation process than if only a K-S test were used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance

Article Open access 29 October 2021

Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset

Random Forest Model and Sample Explainer for Non-experts in Machine Learning – Two Case Studies

Data availability

We used only datasets that a publicly available.

Code availability

The code is not published yet. It can be provided on demand.

References

Steele M, Chaseling J (2006) Powers of discrete goodness-of-fit test statistics for a uniform null against a selection of alternative distributions 35(4):1067–1075. Publisher: Taylor & Francis. eprint: https://doi.org/10.1080/03610910600880666. Accessed 2022-06-30
Elmore, K.L.: Alternatives to the chi-square test for evaluating rank histograms from ensemble forecasts 20(5), 789–795 (2005). https://doi.org/10.1175/WAF884.1. Publisher: American Meteorological Society Section: Weather and Forecasting. Accessed 2022-06-30
Massey, F.J.: The kolmogorov-smirnov test for goodness of fit 46(253), 68–78 (1951). https://doi.org/10.1080/01621459.1951.10500769. Publisher: Taylor & Francis
Berger, V.W., Zhou, Y.: Kolmogorov-smirnov test: Overview. In: Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd, ??? (2014). https://doi.org/10.1002/9781118445112.stat06558
Pfeifer, B., Holzinger, A., Schimek, M.G.: Robust random forest-based all-relevant feature ranks for trustworthy ai. Studies in Health Technology and Informatics 294, 137–138 (2022)
Biau, G., Scornet, E.: A random forest guided tour 25(2), 197–227 (2016). 10.1007/s11749-016-0481-7. Company: Springer Distributor: Springer Institution: Springer Label: Springer Number: 2 Publisher: Springer Berlin Heidelberg. Accessed 2021-03-23
Lv, J., Wang, Y., Liang, X., Yao, Y., Ma, T., Guan, Q.: Simulating urban expansion by incorporating an integrated gravitational field model into a demand-driven random forest-cellular automata model 109, 103044 (2021). https://doi.org/10.1016/j.cities.2020.103044. Accessed 2021-03-29
Vinutha, H.P., Poornima, B., Sagar, B.M.: Detection of outliers using interquartile range technique from intrusion dataset, 511–518 (2018). https://doi.org/10.1007/978-981-10-7563-6_53
Sánchez-González, J.-M., Rocha-de-Lossada, C., Flikier, D.: Median absolute error and interquartile range as criteria of success against the percentage of eyes within a refractive target in IOL surgery 46(10), 1441 (2020). https://doi.org/10.1097/j.jcrs.0000000000000248. Accessed 2022-01-04
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981)
Tutz, G., Ramzan, S.: Improved methods for the imputation of missing data by nearest neighbor methods 90, 84–99 (2015). https://doi.org/10.1016/j.csda.2015.04.009. Accessed 2022-03-11
de Silva, H., Perera, A.S.: Missing data imputation using evolutionary k- nearest neighbor algorithm for gene expression data. In: 2016 Sixteenth International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 141–146 (2016). https://doi.org/10.1109/ICTER.2016.7829911. ISSN: 2472-7598
Wang, Y., Li, D., Li, X., Yang, M.: PC-GAIN: Pseudo-label conditional generative adversarial imputation networks for incomplete data 141, 395–403 (2021). https://doi.org/10.1016/j.neunet.2021.05.033. Accessed 2022-01-05
Popolizio, M., Amato, A., Politi, T., Calienno, R., Di Lecce, V.: Missing data imputation in meteorological datasets with the GAIN method. In: 2021 IEEE International Workshop on Metrology for Industry 4.0 IoT (MetroInd4.0 IoT), pp. 556–560 (2021). https://doi.org/10.1109/MetroInd4.0IoT51437.2021.9488451
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique 16(1), 321–357 (2002)
Han, B., Jia, S., Liu, G., Wang, J.: Imbalanced fault classification of bearing via wasserstein generative adversarial networks with gradient penalty. Shock and Vibration, 1–14 (2020)
Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning 6(1), 60 (2019). https://doi.org/10.1186/s40537-019-0197-0. Accessed 2022-01-02
Hasanin, T., Khoshgoftaar, T.M., Leevy, J.L., Bauder, R.A.: Severely imbalanced big data challenges: investigating data sampling approaches 6(1), 107 (2019). https://doi.org/10.1186/s40537-019-0274-4. Accessed 2022-03-11
Guo, S., Liu, Y., Chen, R., Sun, X., Wang, X.: Improved SMOTE algorithm to deal with imbalanced activity classes in smart homes 50(2), 1503–1526 (2019). https://doi.org/10.1007/s11063-018-9940-3. Accessed 2022-03-11
Veugen, T., Kamphorst, B., van de L’Isle, N., van Egmond, M.B.: Privacy-preserving coupling of vertically-partitioned databases and subsequent training with gradient descent, 38–51 (2021). https://doi.org/10.1007/978-3-030-78086-9_3
Guedj, B., Srinivasa Desikan, B.: Kernel-based ensemble learning in python 11(2), 63 (2020). 10.3390/info11020063. Number: 2 Publisher: Multidisciplinary Digital Publishing Institute. Accessed 2022-01-03
III, D.L.W.: The interquartile range: Theory and estimation - ProQuest (2005). https://www.proquest.com/openview/8449e263bd9f96a22e0348e6abdeb5a9/1?pq-origsite=gscholar &cbl=18750 &diss=y

Download references

Acknowledgements

This work has been supported by the “Cellule d’expertise en robotique et intelligence artificielle” of the Cégep de Trois-Rivières and the Natural Sciences and Engineering Research Council.

Funding

This work has been supported by the Natural Sciences and Engineering Research Council.

Author information

Authors and Affiliations

Laboratoire des signaux et des systèmes intégrés, Departement of Electrical and Computer Engineering, Université du Québec à Trois-Rivières, 3351 Bd des Forges, Trois-Rivières, G9A 5H7, Québec, Canada
Jean-Sébastien Dessureault & Daniel Massicotte

Authors

Jean-Sébastien Dessureault
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Massicotte
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

JSD : Conceptualization, Methodology, Software, Writing - Original Draft, Software. D.M.: Conceptualization, Methodology, Validation, Resources, Writing - Review & Editing, Supervision, Project administration, Funding acquisition.

Corresponding author

Correspondence to Jean-Sébastien Dessureault.

Ethics declarations

Ethical approval

The work uses publicly available and non-identifiable information. No ethical approval was needed.

Consent to participate

Not applicable, since no human participant was involved in the evaluation of our study.

Consent for publication

Not applicable, since all datasets used in this study are released by third parties.

Conflict of interest

The authors confirm there are no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dessureault, JS., Massicotte, D. Explainable global error weighted on feature importance: The xGEWFI metric to evaluate the error of data imputation and data augmentation. Appl Intell 53, 21532–21542 (2023). https://doi.org/10.1007/s10489-023-04661-x

Download citation

Accepted: 24 April 2023
Published: 06 June 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10489-023-04661-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Explainable global error weighted on feature importance: The xGEWFI metric to evaluate the error of data imputation and data augmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance

Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset

Random Forest Model and Sample Explainer for Non-experts in Machine Learning – Two Case Studies

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval

Consent to participate

Consent for publication

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Explainable global error weighted on feature importance: The xGEWFI metric to evaluate the error of data imputation and data augmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Unrestricted permutation forces extrapolation: variable importance requires at least one more model, or there is no free variable importance

Review on General Techniques and Packages for Data Imputation in R on a Real World Dataset

Random Forest Model and Sample Explainer for Non-experts in Machine Learning – Two Case Studies

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval

Consent to participate

Consent for publication

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation