Data cleaning and machine learning: a systematic literature review

Côté, Pierre-Olivier; Nikanjam, Amin; Ahmed, Nafisa; Humeniuk, Dmytro; Khomh, Foutse

doi:10.1007/s10515-024-00453-w

Data cleaning and machine learning: a systematic literature review

Published: 11 June 2024

Volume 31, article number 54, (2024)
Cite this article

Automated Software Engineering Aims and scope Submit manuscript

Pierre-Olivier Côté¹,
Amin Nikanjam¹,
Nafisa Ahmed¹,
Dmytro Humeniuk¹ &
…
Foutse Khomh¹

299 Accesses
Explore all metrics

Abstract

Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches to detect and repair data errors (i.e., data cleaning). Researchers are also exploring how ML can be used for data cleaning; hence creating a dual relationship between ML and data cleaning. To the best of our knowledge, there is no study that comprehensively reviews this relationship. This paper’s objectives are twofold. First, it aims to summarize the latest approaches for data cleaning for ML and ML for data cleaning. Second, it provides future work recommendations. We conduct a systematic literature review of the papers published between 2016 and 2022 inclusively. We identify different types of data cleaning activities with and for ML: feature cleaning, label cleaning, entity matching, outlier detection, imputation, and holistic data cleaning. We summarize the content of 101 papers covering various data cleaning activities and provide 24 future work recommendations. Our review highlights many promising data cleaning techniques that can be further extended. We believe that our review of the literature will help the community develop better approaches to clean data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Data Cleaning and AutoML: Would an Optimizer Choose to Clean?

Article Open access 13 May 2022

RECol: Reconstruction Error Columns for Outlier Detection

Big Data Cleaning

Data availibility

All data generated or analyzed during this study are available in the GitHub repository to help reproduce our results (Côté et al. 2023).

Notes

For example, "A modular edge-/cloud-solution for automated error detection of industrial hairpin weldings using convolutional neural networks" is a paper that was included in the results but not relevant to our study since it is not a data cleaning approach.
https://scholar.google.com/.
https://www.engineeringvillage.com.
https://webofknowledge.com.
https://www.sciencedirect.com.
https://dl.acm.org.
https://ieeexplore.ieee.org.
Harzing, A.W. (2007) Publish or Perish, available from https://harzing.com/resources/publish-or-perish.
https://www.zotero.org/.
https://openai.com/blog/chatgpt.

References

(2022) Common problems. https://developers.google.com/machine-learning/gan/problems
(2023) https://www.cnet.com/tech/chatgpt-can-pass-the-bar-exam-does-that-actually-matter/
Abedjan, Z., Chu, X., Deng, D., Fernandez, R.C., Ilyas, I.F., Ouzzani, M., Papotti, P., Stonebraker, M., Tang, N.: Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow. 9(12), 993–1004 (2016)
Article Google Scholar
Abidin, N.Z., Ismail, A.R., Emran, N.A.: Performance analysis of machine learning algorithms for missing value imputation. Int. J. Adv. Comput. Sci. Appl. 9(6), (2018)
Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases, vol. 8. Addison-Wesley Reading, Delhi (1995)
Google Scholar
Adhikari, D., Jiang, W., Zhan, J., He, Z., Rawat, D.B., Aickelin, U., Khorshidi, H.A.: A comprehensive survey on imputation of missing data in internet of things. ACM Comput. Surv. 55(7), 1–38 (2022)
Article Google Scholar
Aggarwal Charu, C., Reddy Chandan, K.: Data clustering: algorithms and applications, (2013)
Agrawal, A., Chatterjee, R., Curino, C., Floratou, A., Gowdal, N., Interlandi, M., Jindal, A., Karanasos, K., Krishnan, S., Kroth, B., et al.: Cloudy with high chance of dbms: A 10-year prediction for enterprise-grade ml. (2019), arXiv preprint arXiv:1909.00084
Akouemo, H.N., Povinelli, R.J.: Data improving in time series using ARX and ANN models. IEEE Trans. Power Syst. 32(5), 3352–3359 (2017)
Article Google Scholar
Alimohammadi, H., Chen, S.N.: Performance evaluation of outlier detection techniques in production timeseries: A systematic review and meta-analysis. Expert Syst. Appl. 191, 116371 (2022)
Article Google Scholar
Alsolai, H., Roper, M.: A systematic literature review of machine learning techniques for software maintainability prediction. Inf. Softw. Technol. 119, 106214 (2020). https://doi.org/10.1016/j.infsof.2019.106214
Article Google Scholar
Dempster, A.P.: Upper and lower probabilities induced by a multivalued mapping. Ann. Math. Stat. 38(2), 325–339 (1967)
Article MathSciNet Google Scholar
Araci, D.: Finbert: financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063 (2019)
Ataeyan, M., Daneshpour, N.: A novel data repairing approach based on constraints and ensemble learning. Expert Syst. Appl. 159, 113511 (2020)
Article Google Scholar
Atkinson, G., Metsis, V.: Identifying label noise in time-series datasets. In: Adjunct Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, pp. 238–243 (2020)
Atkinson, G., Metsis, V.: Tsar: a time series assisted relabeling tool for reducing label noise. In: The 14th PErvasive Technologies Related to Assistive Environments Conference, pp 203–209. (2021)
Azeem, M.I., Palomba, F., Shi, L., Wang, Q.: Machine learning techniques for code smell detection: a systematic literature review and meta-analysis. Inf. Softw. Technol. 108, 115–138 (2019). https://doi.org/10.1016/j.infsof.2018.12.009
Article Google Scholar
Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random fields and probabilistic soft logic. J. Mach. Learn. Res. 18, 1–67 (2017)
MathSciNet Google Scholar
Badue, C., Guidolini, R., Carneiro, R.V., Azevedo, P., Cardoso, V.B., Forechi, A., Jesus, L., Berriel, R., Paixao, T.M., Mutz, F., et al.: Self-driving cars: a survey. Expert Syst. Appl. 165, 113816 (2021)
Article Google Scholar
Bagherzadeh, P., Sadoghi Yazdi, H.: Label denoising based on Bayesian aggregation. Int. J. Mach. Learn. Cybern. 8, 903–914 (2017)
Article Google Scholar
Bank, D., Koenigstein, N., Giryes, R.: Autoencoders. arXiv preprint arXiv:2003.05991 (2020)
Barlaug, N., Gulla, J.A.: Neural networks for entity matching: a survey. ACM Trans. Knowl. Discov. Data (TKDD) 15(3), 1–37 (2021)
Article Google Scholar
Beltagy, I., Lo, K., Cohan, A.: Scibert: a pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019)
Ben-Gal, I.: Outlier detection in: data mining and knowledge discovery handbook: A complete guide for practitioners and researchers (2005)
Bergstra, J., Yamins, D., Cox, D.D., et al.: Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms. In: Proceedings of the 12th Python in science conference, Citeseer, vol. 13, p. 20 (2013)
Bernhardt, M., Castro, D.C., Tanno, R., Schwaighofer, A., Tezcan, K.C., Monteiro, M., Bannur, S., Lungren, M.P., Nori, A., Glocker, B., et al.: Active label cleaning for improved dataset quality under resource constraints. Nat. Commun. 13(1), 1161 (2022)
Article Google Scholar
Berti-Equille, L.: Learn2clean: Optimizing the sequence of tasks for web data preparation. In: The World Wide Web Conference, pp. 2580–2586 (2019)
Bhandari, K., Kumar, K., Sangal, A.L.: Data quality issues in software fault prediction: a systematic literature review. Artif. Intelli. Rev. 56(8), 7839–7908 (2023)
Article Google Scholar
Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning, vol. 4. Springer, New York (2006)
Google Scholar
Bogatu, A., Paton, N.W., Douthwaite, M., Davie, S., Freitas, A.: Cost–effective variational active entity resolution. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), IEEE, pp. 1272–1283 (2021)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. linguist. 5, 135–146 (2017)
Article Google Scholar
Bosu, M.F., MacDonell, S.G.: A taxonomy of data quality challenges in empirical software engineering. In: 2013 22nd Australian Software Engineering Conference, IEEE, pp. 97–106 (2013)
Boukerche, A., Zheng, L., Alfandi, O.: Outlier detection: methods, models, and classification. ACM Comput. Surv. (CSUR) 53(3), 1–37 (2020)
Article Google Scholar
Braiek, H.B., Khomh, F.: On testing machine learning programs. J. Syst. Softw. 164, 110542 (2020). https://doi.org/10.1016/j.jss.2020.110542
Article Google Scholar
Brunner, U., Stockinger, K.: Entity matching with transformer architectures-a step forward in data integration. In: 23rd International Conference on Extending Database Technology, Copenhagen, OpenProceedings (2020)
Carpenter, G.A., Grossberg, S., Rosen, D.B.: Fuzzy art: fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Netw 4(6), 759–771 (1991)
Article Google Scholar
Cer, D., Yang, Y., Kong, Sy., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018)
Chai, C., Wang, J., Luo, Y., Niu, Z., Li, G.: Data management for machine learning: a survey. IEEE Trans. Knowl. Data Eng. 35(5), 4646–4667 (2022)
Google Scholar
Chasmai, M.E.: Cubetr: learning to solve the rubiks cube using transformers. arXiv preprint arXiv:2111.06036 (2021)
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto HPdO, Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code (2021). arXiv preprint arXiv:2107.03374
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning, PMLR, pp. 1597–1607. (2020)
Cheng, K., Li, X., Xu, Y.E., Dong, X.L., Sun, Y.: Pge: Robust Product Graph Embedding Learning for Error Detection. https://doi.org/10.48550/ARXIV.2202.09747. arXiv:2202.09747 (2022)
Cholewiak, S.A., Ipeirotis, P., Silva, V., Kannawadi, A.: SCHOLARLY: Simple Access to Google Scholar Authors and Citation Using Python. https://doi.org/10.5281/zenodo.5764801, https://github.com/scholarly-python-package/scholarly (2021)
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. (CSUR) 53(6), 1–42 (2020)
Article Google Scholar
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: Overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data, Association for Computing Machinery, New York, NY, USA, SIGMOD ’16, pp. 2201–2206. https://doi.org/10.1145/2882903.2912574 (2016a)
Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2201–2206. (2016b)
Côté, P.O., Nikanjam, A., Bouchoucha, R., Basta, I., Abidi, M., Khomh, F.: Quality Issues in Machine Learning Software Systems. arXiv preprint arXiv:2306.15007 (2023)
Croft, R., Xie, Y., Babar, M.A.: Data preparation for software vulnerability prediction: a systematic literature review. IEEE Trans. Softw. Eng. 49(3), 1044–1063 (2022)
Article Google Scholar
Croft, R., Babar, M.A., Kholoosi, M.M.: Data quality for software vulnerability datasets. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), IEEE, pp. 121–133 (2023)
Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113–123. (2019)
Côté, P.O., Nikanjam, A., Ahmed, N., Humeniuk, D., Khomh, F.: The replication package. https://github.com/poclecoqq/SLR-datacleaning (2023)
Das, S., Doan, A., G C PS., Gokhale, C., Konda, P., Govind, Y., Paulsen, D.: The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data (2016)
Dempster, A.P., et al.: Upper and lower probabilities induced by a multivalued mapping. In: Classic Works of the Dempster-Shafer Theory of Belief Functions, pp. 57–72. Springer, Berlin (2008)
Chapter Google Scholar
Deng, D., Fernandez, R.C., Abedjan, Z., Wang, S., Stonebraker, M., Elmagarmid, A.K., Ilyas, I.F., Madden, S., Ouzzani, M., Tang, N.: The data civilizer system. In: Cidr, (2017)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 248–255. (2009)
Dolatshah, M., Teoh, M., Wang, J., Pei, J.: Cleaning crowdsourced labels using oracles for supervised learning. PVLDB 12(4), 376–389 (2018)
Google Scholar
Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recognit. 74, 406–421 (2018)
Article Google Scholar
Dong, X.L., Rekatsinas, T.: Data integration and machine learning: a natural synergy. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1645–1650. (2018)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929
Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow 11(11), 1454–1467 (2018)
Article Google Scholar
Ekambaram, R., Fefilatyev, S., Shreve, M., Kramer, K., Hall, L.O., Goldgof, D.B., Kasturi, R.: Active cleaning of label noise. Pattern Recognit. 51, 463–480 (2016)
Article Google Scholar
Felderer, M., Russo, B., Auer, F.: On testing data-intensive software systems. In: Security and Quality in Cyber-Physical Systems Engineering: With Forewords by Robert M Lee and Tom Gilb, pp. 129–148. (2019)
Feldt, R., Magazinius, A.: Validity threats in empirical software engineering research-an initial survey. In: Seke, pp 374–379, (2010)
Feng, W., Long, Y., Wang, S., Quan, Y.: A review of addressing class noise problems of remote sensing classification. J. Syst. Eng. Electron. 34(1), 36–46 (2023). https://doi.org/10.23919/JSEE.2023.000034
Article Google Scholar
Filippone, M., Sanguinetti, G.: Information theoretic novelty detection. Pattern Recognit. 43(3), 805–814 (2010)
Article Google Scholar
Flokas, L., Wu, W., Liu, Y., Wang, J., Verma, N., Wu, E.: Complaint-driven training data debugging at interactive speeds. In: Proceedings of the 2022 International Conference on Management of Data, pp 369–383. (2022)
Foidl, H., Felderer, M.: Risk-based data validation in machine learning-based software systems. In: Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation, pp. 13–18 (2019)
Fox, T.L., Guynes, C.S., Prybutok, V.R., Windsor, J.: Maintaining quality in information systems. J. Comput. Inf. Syst. 40(1), 76–80 (1999)
Google Scholar
Freund, Y., Seung, H.S., Shamir, E., Tishby, N.: Selective sampling using the query by committee algorithm. Mach. Learn. 28(2–3), 133 (1997)
Article Google Scholar
Fu, C., Han, X., He, J., Sun, L.: Hierarchical matching network for heterogeneous entity resolution. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3665–3671. (2021)
Gal, Y.: Uncertainty in Deep Learning (2016)
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. https://doi.org/10.48550/ARXIV.1506.02142, arXiv:1506.02142 (2015)
Gauen, K., Dailey, R., Laiman, J., Zi, Y., Asokan, N., Lu, Y.H., Thiruvathukal, G.K., Shyu, M.L., Chen, S.C.: Comparison of visual datasets for machine learning. In: 2017 IEEE International Conference on Information Reuse and Integration (IRI), IEEE, pp. 346–355. (2017)
Ge, C., Gao, Y., Miao, X., Yao, B., Wang, H.: A hybrid data cleaning framework using Markov logic networks. IEEE Trans. Knowl. Data Eng. 34(5), 2048–2062 (2020)
Article Google Scholar
Gemp, I., Theocharous, G., Ghavamzadeh, M.: Automated Data Cleansing Through Meta-learning. In: Twenty-Ninth IAAI Conference (2017)
Gezici, B., Tarhan, A.K.: Systematic literature review on software quality for AI-based software. Empir. Softw. Eng. 27(3), 66 (2022)
Article Google Scholar
Gitnux, A.: Self driving cars safety statistics and trends in 2023 \(\bullet\) gitnux. https://blog.gitnux.com/self-driving-cars-safety-statistics/ (2023)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press. http://www.deeplearningbook.org (2016)
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks (2014). arXiv:1406.2661
Gottapu, R.D., Dagli, C., Ali, B.: Entity resolution using convolutional neural network. Procedia Comput. Sci. 95, 153–158 (2016)
Article Google Scholar
Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., Piot, B., kavukcuoglu, k., Munos, R., Valko, M.: Bootstrap your own latent—a new approach to self-supervised learning. In: Larochelle H., Ranzato M., Hadsell R., Balcan M., Lin H. (eds.) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol. 33, pp. 21271–21284. https://proceedings.neurips.cc/paper/2020/file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf (2020)
Guan, H., Zhang, Y., Xian, M., Cheng, H.D., Tang, X.: Wenn for individualized cleaning in imbalanced data. In: 2016 23rd International Conference on Pattern Recognition (ICPR), IEEE, pp. 456–461. (2016)
Guo, G., Adjeroh, D., Li, X.: Automated cleaning of identity label noise in a large-scale face dataset using a face image quality control (2018)
Guo, Y., Bettaieb, S.: An investigation of quality issues in vulnerability detection datasets. In: 2023 IEEE European Symposium on Security and Privacy Workshops (EuroS &PW), IEEE, pp. 29–33. (2023)
Guo, Z., Rekatsinas, T.: Learning functional dependencies with sparse regression. arXiv:1905.01425 (2019)
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., Sugiyama, M.: Co-teaching: Robust training of deep neural networks with extremely noisy labels. https://doi.org/10.48550/ARXIV.1804.06872, arXiv:1804.06872 (2018)
Hara, S., Nitanda, A., Maehara, T.: Data cleansing for models trained with sgd. Adv. Neural Inf. Process. Syst. 32, (2019)
Hawkins, D.M.: Identification of Outliers, vol. 11. Springer (1980)
He, X., Zhao, K., Chu, X.: Automl: a survey of the state-of-the-art. Knowl. Based Syst. 212, 106622 (2021a)
Article Google Scholar
He, Y. et al.: Automatic detection of grammatical errors in english verbs based on rnn algorithm: auxiliary objectives for neural error detection models. Comput. Intell. Neurosci. (2021b)
Heidari, A., McGrath, J., Ilyas, I.F., Rekatsinas, T.: Holodetect: few-shot learning for error detection. In: Proceedings of the 2019 International Conference on Management of Data, pp. 829–846 (2019)
Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. https://doi.org/10.48550/ARXIV.1610.02136, arXiv:1610.02136 (2016)
Hernández-García, A., König, P.: Data augmentation instead of explicit regularization. arXiv preprint arXiv:1806.03852 (2018)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, J., Qu, L., Jia, R., Zhao, B.: O2u-net: A simple noisy label detection approach for deep neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3326–3334. (2019)
Huang, J., Hu, W., Bao, Z., Chen, Q., Qu, Y.: Deep entity matching with adversarial active learning. VLDB J. 32(1), 229–255 (2023)
Article Google Scholar
Huang, Z., Li, X., Deng, L., Wei, K., Sui, Y.: Mislabeled samples adjustment based on self-paced learning framework. In: 2021 7th International Conference on Computer and Communications (ICCC), IEEE, pp. 1659–1659. (2021)
Hurakadli, V., Kulkarni, S., Patil, U., Tabib, R., Mudengudi, U.: Deep learning based radial blur estimation and image enhancement. In: 2019 IEEE International Conference on Electronics, pp. 1–5. IEEE, Computing and Communication Technologies (CONECCT) (2019)
Hwang, P., Kim, Y.: Data cleaning of sound data with label noise using self organizing map. In: 2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM), pp 1–5. https://doi.org/10.1109/IMCOM53663.2022.9721724 (2022)
Ilyas, I., Chu, X.: Data Cleaning. Association for Computing Machinery and Morgan & Claypool Publishers. https://books.google.ca/books?id=RxieDwAAQBAJ (2019).
Ilyas, I.F., Rekatsinas, T.: Machine learning and data cleaning: Which serves the other? J. Data Inf. Qual. 14(3), 1–11 (2022). https://doi.org/10.1145/3506712
Article Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, Association for Computing Machinery, New York, NY, USA, STOC ’98, pp. 604–613. https://doi.org/10.1145/276698.276876 (1998)
Jiang, W., Ge, Y., Cheng, H., Chen, M., Feng, S., Wang, C.: Read: aggregating reconstruction error into out-of-distribution detection. Proc. AAAI Conf. Artif. Intell. 37, 14910–14918 (2023)
Google Scholar
Jin, D., Sisman, B., Wei, H., Dong, X.L., Koutra, D.: Deep transfer learning for multi-source entity linkage via domain adaptation. arXiv preprint arXiv:2110.14509 (2021)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2019)
Article Google Scholar
Johnson, J.M., Khoshgoftaar, T.M.: A survey on classifying big data with label noise. ACM J. Data Inf. Qual. 14(4), 1–43 (2022)
Article Google Scholar
Kang, Z., Catal, C., Tekinerdogan, B.: Machine learning applications in production lines: a systematic literature review. Comput. Ind. Eng. 149, 106773 (2020). https://doi.org/10.1016/j.cie.2020.106773
Article Google Scholar
Karimi, D., Dou, H., Warfield, S.K., Gholipour, A.: Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Med. Image Anal. 65, 101759 (2020)
Article Google Scholar
Karlaš, B., Li, P., Wu, R., Gürel, N.M., Chu, X., Wu, W., Zhang, C.: Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions. arXiv preprint arXiv:2005.05117 (2020)
Kasai, J., Qian, K., Gurajada, S., Li, Y., Popa, L.: Low-resource deep entity resolution with transfer and active learning. arXiv preprint arXiv:1906.08042 (2019)
Ke, X., Bai, J., Wen, L., Cao, B.: Multi-index dialogue data cleaning model. In: 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), IEEE, pp. 672–676. (2019)
Kim, J., Scott, C.D.: Robust kernel density estimation. J. Mach. Learn. Res. 13(1), 2529–2565 (2012)
MathSciNet Google Scholar
Kitchenham, B.: Procedures for performing systematic reviews. Keele UK Keele Univ. 33(2004), 1–26 (2004)
Google Scholar
Klie, J.C., Webber, B., Gurevych, I.: Annotation error detection: Analyzing the past and present for a more coherent future. Comput. Linguist. pp. 1–42 (2022)
Knill, K.M., Gales, M.J., Manakul, P., Caines, A.: Automatic grammatical error detection of non-native spoken learner english. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 8127–8131. (2019)
Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: International Conference on Machine Learning, PMLR, pp. 1885–1894 (2017)
Köhler, J.M., Autenrieth, M., Beluch, W.H.: Uncertainty based detection and relabeling of noisy image labels. In: CVPR Workshops, pp. 33–37. (2019)
Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: Advances in Knowledge Discovery and Data Mining: 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, 2009 Proceedings 13, Springer, pp. 831–838. (2009)
Krishnan, S., Wu, E.: Alphaclean: Automatic generation of data cleaning pipelines. arXiv preprint arXiv:1904.11827 (2019)
Krishnan, S., Wang, J., Wu, E., Franklin, M.J., Goldberg, K.: Activeclean: interactive data cleaning for statistical modeling. Proc. VLDB Endow. 9(12), 948–959 (2016)
Article Google Scholar
Krishnan, S., Franklin, M.J., Goldberg, K., Wu, E.: Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Article Google Scholar
Lakshminarayan, K., Harp, S.A., Samad, T.: Imputation of missing data in industrial databases. Appl. Intell. 11(3), 259–275 (1999)
Article Google Scholar
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. https://doi.org/10.48550/ARXIV.1612.01474, arXiv:1612.01474 (2016)
Lattar, H., Salem, A.B., Ghezala, H.H.B.: Does data cleaning improve heart disease prediction? Proc. Comput. Sci. 176, 1131–1140 (2020)
Article Google Scholar
Laure, B.E., Angela, B., Tova, M.: Machine learning to data management: A round trip. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE), IEEE, pp. 1735–1738. (2018)
Lee, K.H., He, X., Zhang, L., Yang, L.: Cleannet: Transfer learning for scalable image classifier training with label noise. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5447–5456. (2018)
Lew, A., Agrawal, M., Sontag, D., Mansinghka, V.: Pclean: Bayesian data cleaning at scale with domain-specific probabilistic programming. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp. 1927–1935. (2021)
Li, B., Wang, W., Sun, Y., Zhang, L., Ali, M.A., Wang, Y.: Grapher: token-centric entity resolution with graph convolutional neural networks. Proc. AAAI Conf. Artif. Intell. 34, 8172–8179 (2020)
Google Scholar
Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X., Zhang, C.: Cleanml: A benchmark for joint data cleaning and machine learning [experiments and analysis], p 75. arXiv preprint arXiv:1904.09483 (2019)
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020b)
Li, Z., Du, W., Rao, N.: Research on error label screening method based on convolutional neural network. In: 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), IEEE, pp 1020–1024. (2021)
Liang, Q., Sun, Z., Zhu, Q., Hu, J., Zhao, Y., Zhang, L.: Cupcleaner: A data cleaning approach for comment updating. arXiv preprint arXiv:2308.06898 (2023)
Liebchen, G., Shepperd, M.: Data sets and data quality in software engineering: Eight years on. In: Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering, Association for Computing Machinery, New York, NY, USA, PROMISE 2016. https://doi.org/10.1145/2972958.2972967 (2016)
Liebchen, G.A., Shepperd, M.: Data sets and data quality in software engineering. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pp 39–44. (2008)
Lim, S., Kim, I., Kim, T., Kim, C., Kim, S.: Fast autoaugment. Adv. Neural Inf. Process. Syst. 32, (2019)
Lin, W.C., Tsai, C.F.: Missing value imputation: a review and analysis of the literature (2006–2017). Artif. Intell. Rev. 53, 1487–1509 (2020)
Article Google Scholar
Liu, D., Meng, Y., Wang, L.: Data cleaning of irrelevant images based on transfer learning. In: 2020 International Conference on Intelligent Computing, Automation and Systems (ICICAS), pp. 450–456. IEEE, (2020)
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, IEEE, pp. 413–422 (2008)
Liu, Y., Li, Z., Zhou, C., Jiang, Y., Sun, J., Wang, M., He, X.: Generative adversarial active learning for unsupervised outlier detection. IEEE Trans. Knowl. Data Eng. 32(8), 1517–1528 (2019)
Google Scholar
Liu, Z., Zhou, Z., Rekatsinas, T.: Picket: guarding against corrupted data in tabular data during learning and inference. VLDB J. pp. 1–29 (2022)
Mahdavi, M., Abedjan, Z.: Baran: effective error correction via a unified context representation and transfer learning. Proc. VLDB Endow. 13(12), 1948–1961 (2020)
Article Google Scholar
Mahdavi, M., Abedjan, Z.: Semi-supervised data cleaning with raha and baran. In: CIDR, (2021)
Mahdavi, M., Abedjan, Z., Castro Fernandez, R., Madden, S., Ouzzani, M., Stonebraker, M., Tang, N.: Raha: A configuration-free error detection system. In: Proceedings of the 2019 International Conference on Management of Data, pp. 865–882. (2019)
Marsland, S., Shapiro, J., Nehmzow, U.: A self-organising network that grows when required. Neural Netw. 15(8–9), 1041–1058 (2002)
Article Google Scholar
Martínez-Fernández, S., Bogner, J., Franch, X., Oriol, M., Siebert, J., Trendowicz, A., Vollmer, A.M., Wagner, S.: Software engineering for AI-based systems: a survey. ACM Trans. Softw. Eng. Methodol. 31(2), 1–59 (2022). https://doi.org/10.1145/3487043
Article Google Scholar
Mauritz, R., Nijweide, F., Goseling, J., van Keulen, M.: A probabilistic database approach to autoencoder-based data cleaning. arXiv preprint arXiv:2106.09764 (2021)
Mayfield, C., Neville, J., Prabhakar, S.: Eracer: a database approach for statistical inference and data cleaning. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 75–86. (2010)
Mazumder, M., Banbury, C., Yao, X., Karlaš, B., Rojas, W.G., Diamos, S., Diamos, G., He, L., Parrish, A., Kirk, H.R., et al.: Dataperf: Benchmarks for data-centric AI development. arXiv preprint arXiv:2207.10062 (2022)
Meduri, V.V., Popa, L., Sen, P., Sarwat, M.: A comprehensive benchmark framework for active learning methods in entity matching. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1133–1147. (2020)
Miao, Z., Li, Y., Wang, X.: Rotom: A meta-learned data augmentation framework for entity matching, data cleaning, text classification, and beyond. In: Proceedings of the 2021 International Conference on Management of Data, pp. 1303–1316. (2021)
Motulsky, H.J., Brown, R.E.: Detecting outliers when fitting data with nonlinear regression-a new method based on robust nonlinear regression and the false discovery rate. BMC Bioinform. 7(1), 1–20 (2006)
Article Google Scholar
Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: A design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp 19–34, (2018)
Müller, H., Castelo, S., Qazi, M., Freire, J.: From papers to practice: the openclean open-source data cleaning library. Proc. VLDB Endow 14(12), 2763–2766 (2021)
Article Google Scholar
Narayan, A., Chami, I., Orr, L., Ré, C.: Can foundation models wrangle your data? (2022). arXiv preprint arXiv:2205.09911
Nashaat, M., Ghosh, A., Miller, J., Quader, S.: Tabreformer: unsupervised representation learning for erroneous data detection. ACM/IMS Trans. Data Sci. 2(3), 1–29 (2021)
Article Google Scholar
Nassif, A.B., Talib, M.A., Nasir, Q., Dakalbab, F.M.: Machine learning for anomaly detection: a systematic review. IEEE Access 9, 78658–78700 (2021)
Article Google Scholar
Neutatz, F., Mahdavi, M., Abedjan, Z.: Ed2: two-stage active learning for error detection–technical report. arXiv preprint arXiv:1908.06309 (2019)
Neutatz, F., Chen, B., Abedjan, Z., Wu, E.: From cleaning before ml to cleaning for ml. IEEE Data Eng. Bull. 44(1), 24–41 (2021)
Google Scholar
Ng, A.: A chat with andrew on mlops: from model-centric to data-centric AI. https://www.youtube.com/watch?v=06-AZXmwHjo &ab_channel=DeepLearningAI (2021)
Ng, A., He, L., Laird, D.: Data-centric AI competition. https://https-deeplearning-ai.github.io/data-centric-comp/ (2021)
Nie, H., Han, X., He, B., Sun, L., Chen, B., Zhang, W., Wu, S., Kong, H.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 629–638. (2019)
Northcutt, C.G., Jiang, L., Chuang, I.L.: Confident learning: Estimating uncertainty in dataset labels. https://doi.org/10.48550/ARXIV.1911.00068, arXiv:1911.00068 (2019)
Oliveira, P.H., Kaster, D.S., Ilyas, I.F., et al.: Batchwise probabilistic incremental data cleaning. arXiv preprint arXiv:2011.04730 (2020)
OpenAI (2023) https://openai.com/research/gpt-4
Pang, G., Shen, C., Cao, L., Hengel, A.V.D.: Deep learning for anomaly detection: a review. ACM Comput. Surv. (CSUR) 54(2), 1–38 (2021)
Article Google Scholar
Papastefanopoulos, V., Linardatos, P., Kotsiantis, S.: Unsupervised outlier detection: a meta-learning algorithm based on feature selection. Electronics 10(18), 2236 (2021)
Article Google Scholar
Patel, H., Gupta, N., Panwar, N., Sharma Mittal, R., Mehta, S., Guttula, S., Mujumdar, S., Afzal, S., Bedathur, S., Munigala, V.: Automatic assessment of quality of your data for AI. In: Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD), Association for Computing Machinery, New York, NY, USA, CODS-COMAD ’22, pp. 354–357. (2022). https://doi.org/10.1145/3493700.3493774
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. (2014)
Pham, M., Knoblock, C.A., Chen, M., Vu, B., Pujara, J.: Spade: a semi-supervised probabilistic approach for detecting errors in tables. In: IJCAI, pp 3543–3551. (2021)
Pise, N.N., Kulkarni, P.: A survey of semi-supervised learning methods. In: 2008 International Conference on Computational Intelligence and Security, IEEE, vol. 2, pp. 30–34. (2008)
Pit-Claudel, C., Mariet, Z., Harding, R., Madden, S.: Outlier detection in heterogeneous datasets using automatic tuple expansion. Tech. rep., MIT—Computer Science and Artificial Intelligence Laboratory (MIT-CSAIL-TR-2016-002). (2016)
Ponzio, F., Macii, E., Ficarra, E., Di Cataldo, S.: W2wnet: a two-module probabilistic convolutional neural network with embedded data cleansing functionality. arXiv preprint arXiv:2103.13107 (2021)
Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M.P., Shyu, M.L., Chen, S.C., Iyengar, S.S.: A survey on deep learning: algorithms, techniques, and applications. ACM Comput. Surv. (CSUR) 51(5), 1–36 (2018)
Article Google Scholar
Press, G.: Cleaning big data: Most time-consuming, least enjoyable data science task, survey says. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=4c577cb46f63 (2022)
Qian, K., Popa, L., Sen, P.: Active learning for large-scale entity resolution. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1379–1388. (2017)
Rahm, E., Do, H.H., et al.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Razavi-Far, R., Cheng, B., Saif, M., Ahmadi, M.: Similarity-learning information-fusion schemes for missing data imputation. Knowl. Based Syst. 187, 104805 (2020)
Article Google Scholar
Rehbein, I., Ruppenhofer, J.: Detecting annotation noise in automatically labelled data. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), pp. 1160–1170. (2017)
Rei, M., Yannakoudakis, H.: Compositional sequence labeling models for error detection in learner writing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, pp. 1181–1191. https://doi.org/10.18653/v1/P16-1112, https://aclanthology.org/P16-1112 (2016)
Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: Holistic data repairs with probabilistic inference. arXiv preprint arXiv:1702.00820 (2017)
Ren, M., Zeng, W., Yang, B., Urtasun, R.: Learning to reweight examples for robust deep learning. In: International Conference on Machine Learning, PMLR, pp. 4334–4343. (2018)
Roh, Y., Heo, G., Whang, S.E.: A survey on data collection for machine learning: a big data-AI integration perspective. IEEE Trans. Knowl. Data Eng. 33(4), 1328–1347 (2019)
Article Google Scholar
Rosner, B.: Percentage points for a generalized esd many-outlier procedure. Technometrics 25(2), 165–172 (1983)
Article Google Scholar
Rottmann, M., Reese, M.: Automated detection of label errors in semantic segmentation datasets via deep learning and uncertainty quantification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3214–3223. (2023)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Article MathSciNet Google Scholar
Salekshahrezaee, Z., Leevy, J.L., Khoshgoftaar, T.M.: A reconstruction error-based framework for label noise detection. J. Big Data 8, 1–16 (2021)
Article Google Scholar
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., Aroyo, L.M.: “Everyone wants to do the model work, not the data work”: data cascades in high-stakes AI. In: proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15. (2021)
Santos, E.A., Campbell, J.C., Hindle, A., Amaral, J.N.: Finding and correcting syntax errors using recurrent neural networks. PeerJ PrePrints 5, e3123v1 (2017)
Google Scholar
Sarker, I.H.: Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2(6), 1–20 (2021)
Article MathSciNet Google Scholar
Schölkopf, B., Williamson, R.C., Smola, A., Shawe-Taylor, J., Platt, J.: Support vector method for novelty detection. Adv. Neural Inf. Process. Syst. 12, (1999)
Shi, J., Wu, J.: Distilling effective supervision for robust medical image segmentation with noisy labels. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 2021, Proceedings, Part I 24, Springer, pp. 668–677. (2021)
Shi, L., Mu, F., Chen, X., Wang, S., Wang, J., Yang, Y., Li, G., Xia, X., Wang, Q.: Are we building on the rock? On the importance of data preprocessing for code summarization. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 107–119. (2022)
Silva-Ramírez, E.L., Cabrera-Sánchez, J.F.: Co-active neuro-fuzzy inference system model as single imputation approach for non-monotone pattern of missing data. Neural Comput. Appl. 33, 8981–9004 (2021)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Smyth, L.: Training-Valuenet: A New Approach for Label Cleaning on Weakly-Supervised Datasets. University of Exeter, (2020)
Song, H., Kim, M., Park, D., Shin, Y., Lee, J.G.: Learning from noisy labels with deep neural networks: a survey. IEEE Trans. Neural Netw. Learn. Syst. 34(11), 8135–8153 (2023). https://doi.org/10.1109/TNNLS.2022.3152527
Article Google Scholar
Spithourakis, G.P., Augenstein, I., Riedel, S.: Numerically grounded language models for semantic error correction. arXiv preprint arXiv:1608.04147 (2016)
Studer, S., Bui, T.B., Drescher, C., Hanuschkin, A., Winkler, L., Peters, S., Müller, K.R.: Towards crisp-ml (q): a machine learning process model with quality assurance methodology. Mach. Learn. Knowl. Extr. 3(2), 392–413 (2021)
Article Google Scholar
Su, J., Gao, X., Qin, Y., Guo, S.: Correcting corrupted labels using mode dropping of acgan. In: 2021 15th International Symposium on Medical Information and Communication Technology (ISMICT), IEEE, pp. 98–103. (2021)
Surameery, N.M.S., Shakor, M.Y.: Use chat gpt to solve programming bugs. Int. J. Inf. Technol. Comput. Eng. (IJITC) 3(01), 17–22 (2023)
Google Scholar
Suzuki, K., Kobayashi, Y., Narihira, T.: Data cleansing for deep neural networks with storage-efficient approximation of influence functions. arXiv preprint arXiv:2103.11807 (2021)
Tae, K.H., Roh, Y., Oh, Y.H., Kim, H., Whang, S.E.: Data cleaning for accurate, fair, and robust models: A big data-AI integration approach. In: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, pp. 1–4. (2019)
Tambon, F., Laberge, G., An, L., Nikanjam, A., Mindom, P.S.N., Pequignot, Y., Khomh, F., Antoniol, G., Merlo, E., Laviolette, F.: How to certify machine learning based safety-critical systems? A systematic literature review. Autom. Softw. Eng. 29(2), 1–74 (2022)
Article Google Scholar
Tang, N., Fan, J., Li, F., Tu, J., Du, X., Li, G., Madden, S., Ouzzani, M.: Relational pretrained transformers towards democratizing data preparation [vision]. arXiv preprint arXiv:2012.02469 (2020)
Tawfik, N.S., Spruit, M.R.: Evaluating sentence representations for biomedical text: methods and experimental results. J. Biomed. Inform. 104, 103396 (2020)
Article Google Scholar
Team, S.: Data-centric AI for the enterprise (2024). https://snorkel.ai/#
Terrades, O.R., Berenguel, A., Gil, D.: A flexible outlier detector based on a topology given by graph communities. Big Data Res. 29, 100332 (2022)
Article Google Scholar
Teso, S., Bontempelli, A., Giunchiglia, F., Passerini, A.: Interactive label cleaning with example-based explanations. Adv. Neural Inf. Process. Syst. 34, 12966–12977 (2021)
Google Scholar
Tfwala, S.S., Wang, Y.M., Lin, Y.C., et al.: Prediction of missing flow records using multilayer perceptron and coactive neurofuzzy inference system. Sci. World J. (2013)
Thekumparampil, K.K., Khetan, A., Lin, Z., Oh, S.: Robustness of conditional gans to noisy labels. Adv. Neural Inf. Process. Syst. 31, (2018)
Thirumuruganathan, S., Tang, N., Ouzzani, M., Doan, A.: Data curation with deep learning. In: EDBT, pp. 277–286. (2020)
Tonolini, F., Moreno, P.G., Damianou, A., Murray-Smith, R.: Tomographic auto-encoder: unsupervised bayesian recovery of corrupted data. arXiv preprint arXiv:2006.16938 (2020)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, (2017a)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. CoRR abs/1706.03762, arXiv:1706.03762 (2017b)
Veit, A., Alldrin, N., Chechik, G., Krasin, I., Gupta, A., Belongie, S.: Learning from noisy large-scale datasets with minimal supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 839–847. (2017)
Visengeriyeva, L., Abedjan, Z.: Metadata-driven error detection. In: Proceedings of the 30th International Conference on Scientific and Statistical Database Management, pp. 1–12. (2018)
Visengeriyeva, L., Akbik, A., Kaul, M., Rabl, T., Markl, V.: Improving data quality by leveraging statistical relational learning. In: ICIQ, pp. 220–236. (2016)
Wang, H., Bah, M.J., Hammad, M.: Progress in outlier detection techniques: a survey. IEEE Access 7, 107964–108000 (2019). https://doi.org/10.1109/ACCESS.2019.2932769
Article Google Scholar
Wang, Q., Tan, Y.: Grammatical error detection with self attention by pairwise training. In: 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1–7. (2020)
Wang, R., Li, Y., Wang, J.: Sudowoodo: Contrastive self-supervised learning for multi-purpose data integration and preparation. arXiv preprint arXiv:2207.04122 (2022)
Wang, X., Wang, C.: Time series data cleaning: a survey. IEEE Access 8, 1866–1881 (2019)
Article Google Scholar
Wang, Z., Sisman, B., Wei, H., Dong, X.L., Ji, S.: Cordel: a contrastive deep learning approach for entity linkage. In: 2020 IEEE International Conference on Data Mining (ICDM), IEEE, pp. 1322–1327. (2020)
Wei, J., Zou, K.: Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196 (2019)
Whang, S.E., Roh, Y., Song, H., Lee, J.G.: Data collection and quality challenges in deep learning: a data-centric AI perspective. arXiv preprint arXiv:2112.06409 (2021)
Whang, S.E., Roh, Y., Song, H., Lee, J.G.: Data collection and quality challenges in deep learning: a data-centric AI perspective. VLDB J. 32(4), 791–813 (2023)
Article Google Scholar
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023)
Wikipedia (2023a) https://en.wikipedia.org/wiki/Machine_learning
Wikipedia (2023b) https://en.wikipedia.org/wiki/Imputation_(statistics)
Wikipedia (2023c) Active learning (machine learning). https://en.wikipedia.org/wiki/Active_learning_(machine_learning)
Wikipedia (2023d) Boosting (machine learning). https://en.wikipedia.org/wiki/Boosting_(machine_learning)
Wikipedia (2023e) Transfer learning. https://en.wikipedia.org/wiki/Transfer_learning
Wohlin, C.: Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, pp. 1–10. (2014)
Wong, W.E., Gao, R., Li, Y., Abreu, R., Wotawa, F.: A survey on software fault localization. IEEE Trans. Softw. Eng. 42(8), 707–740 (2016). https://doi.org/10.1109/TSE.2016.2521368
Article Google Scholar
Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: Zeroer: Entity resolution using zero labeled examples. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1149–1164. (2020)
Wu, Y., Weimer, J., Davidson, S.B.: Chef: a cheap and fast pipeline for iteratively cleaning label uncertainties (technical report). arXiv preprint arXiv:2107.08588 (2021)
Xiang, S., Ye, X., Xia, J., Wu, J., Chen, Y., Liu, S.: Interactive correction of mislabeled training data. In: 2019 IEEE Conference on Visual Analytics Science and Technology (VAST), IEEE, pp 57–68. (2019)
Yu, Q., Aizawa, K.: Unknown class label cleaning for learning with open-set noisy labels. In: 2020 IEEE International Conference on Image Processing (ICIP), IEEE, pp 1731–1735. (2020)
Zha, D., Bhat, Z.P., Lai, K.H., Yang, F., Jiang, Z., Zhong, S., Hu, X.: Data-centric artificial intelligence: a survey. arXiv preprint arXiv:2303.10158 (2023)
Zhang, A., Song, S., Wang, J., Yu, P.S.: Time series data cleaning: From anomaly detection to anomaly repairing (technical report). arXiv preprint arXiv:2003.12396 (2020a)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412(2017).
Zhang, Q., Fang, C., Ma, Y., Sun, W., Chen, Z.: A survey of learning-based automated program repair. ACM Trans. Softw. Eng. Methodol. 33(2), 1–69 (2023). https://doi.org/10.1145/3631974
Article Google Scholar
Zhang, W., Tan, X.: Combining outlier detection and reconstruction error minimization for label noise reduction. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp), IEEE, pp. 1–4. (2019)
Zhang, W., Wang, D., Tan, X.: Data cleaning and classification in the presence of label noise with class-specific autoencoder. In: International Symposium on Neural Networks, Springer, pp. 256–264. (2018a)
Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: Autoblock: A hands-off blocking framework for entity matching. In: Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 744–752. (2020b)
Zhang, X., Ji, Y., Nguyen, C., Wang, T.: Deepclean: data cleaning via question asking. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), IEEE, pp. 283–292. (2018b)
Zhang, X., Zhu, X., Wright, S.: Training set debugging using trusted items. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, (2018c)
Zhang, Y., Zheng, S., Dalirrooyfard, M., Wu, P., Schneider, A., Raj, A., Nevmyvaka, Y., Chen, C.: Learning to abstain from uninformative data. arXiv preprint arXiv:2309.14240 (2023b)
Zhao, C., He, Y.: Auto-em: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: The World Wide Web Conference, pp. 2413–2424. (2019)
Zhao, Z., Liu, H.: Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th International Conference on Machine Learning, pp. 1151–1157. (2007)
Zhou, X., Jin, Y., Zhang, H., Li, S., Huang, X.: A map of threats to validity of systematic literature reviews in software engineering. In: 2016 23rd Asia-Pacific Software Engineering Conference (APSEC), IEEE, pp. 153–160. (2016)
Zhou, X., Liu, X., Wang, C., Zhai, D., Jiang, J., Ji, X.: Learning with noisy labels via sparse regularization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 72–81. (2021)
Zhu, X., Ghahramani, Z.: Learning from Labeled and Unlabeled Data with Label Propagation. ProQuest Number: INFORMATION TO ALL USERS (2002)

Download references

Acknowledgements

This work is funded by the Fonds de Recherche du Quebec (FRQ), the Canadian Institute for Advanced Research (CIFAR), and the Natural Sciences and Engineering Research Council of Canada (NSERC). We would like to thank Dr. Hyacinth Ali for contributing to improving this SLR with his valuable comments.

Author information

Authors and Affiliations

Polytechnique Montréal, Québec, Canada
Pierre-Olivier Côté, Amin Nikanjam, Nafisa Ahmed, Dmytro Humeniuk & Foutse Khomh

Authors

Pierre-Olivier Côté
View author publications
You can also search for this author in PubMed Google Scholar
Amin Nikanjam
View author publications
You can also search for this author in PubMed Google Scholar
Nafisa Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Dmytro Humeniuk
View author publications
You can also search for this author in PubMed Google Scholar
Foutse Khomh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: P-O.C., A.N., F.K.; Paper collection: P-O.C.; Paper selection: P-O.C., A.N.; Data Extraction: P-O.C. (74 papers), N.A. (17 papers), D.H. (6 papers), A.N. (2 papers); Writing - original draft preparation: P-O.C., N.A., D.H., A.N.; Writing - review and editing: A.N., P-O.C., F.K., D.H.; Funding acquisition: A.N., F.K.; Project administration: P-O.C.; Supervision: A.N., F.K.; Validation: A.N., P-O.C.; Software: P-O.C.; Resource: P-O.C.

Corresponding author

Correspondence to Pierre-Olivier Côté.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is funded by the Fonds de Recherche du Quebec (FRQ), the Canadian Institute for Advanced Research (CIFAR), and the Natural Sciences and Engineering Research Council of Canada (NSERC).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Côté, PO., Nikanjam, A., Ahmed, N. et al. Data cleaning and machine learning: a systematic literature review. Autom Softw Eng 31, 54 (2024). https://doi.org/10.1007/s10515-024-00453-w

Download citation

Received: 03 October 2023
Accepted: 29 May 2024
Published: 11 June 2024
DOI: https://doi.org/10.1007/s10515-024-00453-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data cleaning and machine learning: a systematic literature review

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Data Cleaning and AutoML: Would an Optimizer Choose to Clean?

RECol: Reconstruction Error Columns for Outlier Detection

Big Data Cleaning

Data availibility

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Data cleaning and machine learning: a systematic literature review

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Data Cleaning and AutoML: Would an Optimizer Choose to Clean?

RECol: Reconstruction Error Columns for Outlier Detection

Big Data Cleaning

Data availibility

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation