Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3603719.3604285acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
poster

Four Factors Affecting Missing Data Imputation

Published: 27 August 2023 Publication History

Abstract

Missing data is a common problem in datasets and impacts the reliability of data analysis. Numerous methods to impute (i.e., predict and replace) missing values have been proposed. The quality of these imputed values depends on factors like correlation, percentage of missingness, or the mechanism behind the missing value. Despite comparative studies on imputation methods, conditions for their effectiveness and safe application lack dedicated investigation.
This research aims to systematically investigate the impact of four factors on imputation quality. We specifically investigate the extent to which (1) missing data mechanism, (2) variable distribution, (3) correlation, and (4) percentage of missingness affect the imputation quality of eight different machine-learning-based imputation methods. The evaluation will be done on both a synthetic dataset and a real-world dataset from voestalpine Stahl GmbH.

References

[1]
Michal Bechny, Florian Sobieczky, Jürgen Zeindl, and Lisa Ehrlinger. 2021. Missing Data Patterns: From Theory to an Application in the Steel Industry. In 33rd International Conference on Scientific and Statistical Database Management. ACM, New York, NY, USA, 214–219. https://doi.org/10.1145/3468791.3468841
[2]
Syed Imtiaz and Sirish Shah. 2008. Treatment of Missing Values in Process Data Analysis. The Canadian Journal of Chemical Engineering 86, 5 (2008), 838–858.
[3]
Anil Jadhav, Dhanya Pramod, and Krishnan Ramanathan. 2019. Comparison of Performance of Data Imputation Methods for Numeric Dataset. Applied Artificial Intelligence 33, 10 (2019), 913–933.
[4]
Sebastian Jäger, Arndt Allhorn, and Felix Bießmann. 2021. A Benchmark for Data Imputation Methods. Frontiers in Big Data 4 (2021).
[5]
Roderick J.A. Little and Donald B. Rubin. 2020. Statistical Analysis with Missing Data. John Wiley & Sons, Hoboken, NJ, USA.
[6]
Jason Poulos and Rafael Valle. 2018. Missing Data Imputation for Supervised Learning. Applied Artificial Intelligence 32, 2 (2018), 186–196.
[7]
Donald B. Rubin. 1976. Inference and Missing Data. Biometrika 63, 3 (1976), 581–592.
[8]
Stef van Buuren. 2022. MICE: Multiple imputation by Chained Equations. R Foundation for Statistical Computing. https://cran.r-project.org/web/packages/mice/mice.pdfR package version 3.15.0.
[9]
Katarzyna Woźnica and Przemysław Biecek. 2020. Does Imputation Matter? Benchmark for Predictive Models. arxiv:2007.02837 [stat.ML]

Cited By

View all
  • (2023)An evolutionary computation classification method for high‐dimensional mixed missing variables dataElectronics Letters10.1049/ell2.1305859:24Online publication date: 12-Dec-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SSDBM '23: Proceedings of the 35th International Conference on Scientific and Statistical Database Management
July 2023
232 pages
ISBN:9798400707469
DOI:10.1145/3603719
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 August 2023

Check for updates

Author Tags

  1. Missing data
  2. correlation
  3. data quality
  4. distribution
  5. imputation
  6. missing data mechanisms
  7. missing values
  8. missingness

Qualifiers

  • Poster
  • Research
  • Refereed limited

Funding Sources

  • Austrian Research Promotion Agency (FFG)

Conference

SSDBM 2023

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)5
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)An evolutionary computation classification method for high‐dimensional mixed missing variables dataElectronics Letters10.1049/ell2.1305859:24Online publication date: 12-Dec-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media