Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Explainable global error weighted on feature importance: The xGEWFI metric to evaluate the error of data imputation and data augmentation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Evaluating data imputation and augmentation performance is a critical issue in data science. In statistics, methods like Kolmogorov-Smirnov K-S test, Cramér-von Mises \(W^2\), Anderson-Darling \(A^2\), Pearson’s \(\chi ^2\) and Watson’s \(U^2\) exists for decades to compare the distribution of two datasets. In the context of data generation, typical evaluation metrics have the same flaw: They calculate the feature’s error and the global error on the generated data without weighting the error with the feature’s importance. In most cases, the importance of the features is imbalanced, and it can induce a bias on the features and global errors. This paper proposes a novel metric named “Explainable Global Error Weighted on Feature Importance” (xGEWFI). This new metric is tested in a whole preprocessing method that 1. Process the outliers, 2. impute the missing data, and 3. augments the data. At the end of the process, the xGEWFI error is calculated. The distribution error between the original and generated data is calculated using a Kolmogorov-Smirnov test (K-S test) for each feature. Those results are multiplied by the importance of the respective features and calculated using a Random Forest (RF) algorithm. The metric result is expressed in an explainable format, aiming for an ethical AI. This novel method provides a more precise evaluation of a data generation process than if only a K-S test were used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability

We used only datasets that a publicly available.

Code availability

The code is not published yet. It can be provided on demand.

References

  1. Steele M, Chaseling J (2006) Powers of discrete goodness-of-fit test statistics for a uniform null against a selection of alternative distributions 35(4):1067–1075. Publisher: Taylor & Francis. eprint: https://doi.org/10.1080/03610910600880666. Accessed 2022-06-30

  2. Elmore, K.L.: Alternatives to the chi-square test for evaluating rank histograms from ensemble forecasts 20(5), 789–795 (2005). https://doi.org/10.1175/WAF884.1. Publisher: American Meteorological Society Section: Weather and Forecasting. Accessed 2022-06-30

  3. Massey, F.J.: The kolmogorov-smirnov test for goodness of fit 46(253), 68–78 (1951). https://doi.org/10.1080/01621459.1951.10500769. Publisher: Taylor & Francis

  4. Berger, V.W., Zhou, Y.: Kolmogorov-smirnov test: Overview. In: Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd, ??? (2014). https://doi.org/10.1002/9781118445112.stat06558

  5. Pfeifer, B., Holzinger, A., Schimek, M.G.: Robust random forest-based all-relevant feature ranks for trustworthy ai. Studies in Health Technology and Informatics 294, 137–138 (2022)

  6. Biau, G., Scornet, E.: A random forest guided tour 25(2), 197–227 (2016). 10.1007/s11749-016-0481-7. Company: Springer Distributor: Springer Institution: Springer Label: Springer Number: 2 Publisher: Springer Berlin Heidelberg. Accessed 2021-03-23

  7. Lv, J., Wang, Y., Liang, X., Yao, Y., Ma, T., Guan, Q.: Simulating urban expansion by incorporating an integrated gravitational field model into a demand-driven random forest-cellular automata model 109, 103044 (2021). https://doi.org/10.1016/j.cities.2020.103044. Accessed 2021-03-29

  8. Vinutha, H.P., Poornima, B., Sagar, B.M.: Detection of outliers using interquartile range technique from intrusion dataset, 511–518 (2018). https://doi.org/10.1007/978-981-10-7563-6_53

  9. Sánchez-González, J.-M., Rocha-de-Lossada, C., Flikier, D.: Median absolute error and interquartile range as criteria of success against the percentage of eyes within a refractive target in IOL surgery 46(10), 1441 (2020). https://doi.org/10.1097/j.jcrs.0000000000000248. Accessed 2022-01-04

  10. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981)

  11. Tutz, G., Ramzan, S.: Improved methods for the imputation of missing data by nearest neighbor methods 90, 84–99 (2015). https://doi.org/10.1016/j.csda.2015.04.009. Accessed 2022-03-11

  12. de Silva, H., Perera, A.S.: Missing data imputation using evolutionary k- nearest neighbor algorithm for gene expression data. In: 2016 Sixteenth International Conference on Advances in ICT for Emerging Regions (ICTer), pp. 141–146 (2016). https://doi.org/10.1109/ICTER.2016.7829911. ISSN: 2472-7598

  13. Wang, Y., Li, D., Li, X., Yang, M.: PC-GAIN: Pseudo-label conditional generative adversarial imputation networks for incomplete data 141, 395–403 (2021). https://doi.org/10.1016/j.neunet.2021.05.033. Accessed 2022-01-05

  14. Popolizio, M., Amato, A., Politi, T., Calienno, R., Di Lecce, V.: Missing data imputation in meteorological datasets with the GAIN method. In: 2021 IEEE International Workshop on Metrology for Industry 4.0 IoT (MetroInd4.0 IoT), pp. 556–560 (2021). https://doi.org/10.1109/MetroInd4.0IoT51437.2021.9488451

  15. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique 16(1), 321–357 (2002)

  16. Han, B., Jia, S., Liu, G., Wang, J.: Imbalanced fault classification of bearing via wasserstein generative adversarial networks with gradient penalty. Shock and Vibration, 1–14 (2020)

  17. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning 6(1), 60 (2019). https://doi.org/10.1186/s40537-019-0197-0. Accessed 2022-01-02

  18. Hasanin, T., Khoshgoftaar, T.M., Leevy, J.L., Bauder, R.A.: Severely imbalanced big data challenges: investigating data sampling approaches 6(1), 107 (2019). https://doi.org/10.1186/s40537-019-0274-4. Accessed 2022-03-11

  19. Guo, S., Liu, Y., Chen, R., Sun, X., Wang, X.: Improved SMOTE algorithm to deal with imbalanced activity classes in smart homes 50(2), 1503–1526 (2019). https://doi.org/10.1007/s11063-018-9940-3. Accessed 2022-03-11

  20. Veugen, T., Kamphorst, B., van de L’Isle, N., van Egmond, M.B.: Privacy-preserving coupling of vertically-partitioned databases and subsequent training with gradient descent, 38–51 (2021). https://doi.org/10.1007/978-3-030-78086-9_3

  21. Guedj, B., Srinivasa Desikan, B.: Kernel-based ensemble learning in python 11(2), 63 (2020). 10.3390/info11020063. Number: 2 Publisher: Multidisciplinary Digital Publishing Institute. Accessed 2022-01-03

  22. III, D.L.W.: The interquartile range: Theory and estimation - ProQuest (2005). https://www.proquest.com/openview/8449e263bd9f96a22e0348e6abdeb5a9/1?pq-origsite=gscholar &cbl=18750 &diss=y

Download references

Acknowledgements

This work has been supported by the “Cellule d’expertise en robotique et intelligence artificielle” of the Cégep de Trois-Rivières and the Natural Sciences and Engineering Research Council.

Funding

This work has been supported by the Natural Sciences and Engineering Research Council.

Author information

Authors and Affiliations

Authors

Contributions

JSD : Conceptualization, Methodology, Software, Writing - Original Draft, Software. D.M.: Conceptualization, Methodology, Validation, Resources, Writing - Review & Editing, Supervision, Project administration, Funding acquisition.

Corresponding author

Correspondence to Jean-Sébastien Dessureault.

Ethics declarations

Ethical approval

The work uses publicly available and non-identifiable information. No ethical approval was needed.

Consent to participate

Not applicable, since no human participant was involved in the evaluation of our study.

Consent for publication

Not applicable, since all datasets used in this study are released by third parties.

Conflict of interest

The authors confirm there are no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dessureault, JS., Massicotte, D. Explainable global error weighted on feature importance: The xGEWFI metric to evaluate the error of data imputation and data augmentation. Appl Intell 53, 21532–21542 (2023). https://doi.org/10.1007/s10489-023-04661-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04661-x

Keywords