Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Data cleaning and machine learning: a systematic literature review

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches to detect and repair data errors (i.e., data cleaning). Researchers are also exploring how ML can be used for data cleaning; hence creating a dual relationship between ML and data cleaning. To the best of our knowledge, there is no study that comprehensively reviews this relationship. This paper’s objectives are twofold. First, it aims to summarize the latest approaches for data cleaning for ML and ML for data cleaning. Second, it provides future work recommendations. We conduct a systematic literature review of the papers published between 2016 and 2022 inclusively. We identify different types of data cleaning activities with and for ML: feature cleaning, label cleaning, entity matching, outlier detection, imputation, and holistic data cleaning. We summarize the content of 101 papers covering various data cleaning activities and provide 24 future work recommendations. Our review highlights many promising data cleaning techniques that can be further extended. We believe that our review of the literature will help the community develop better approaches to clean data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availibility

All data generated or analyzed during this study are available in the GitHub repository to help reproduce our results (Côté et al. 2023).

Notes

  1. For example, "A modular edge-/cloud-solution for automated error detection of industrial hairpin weldings using convolutional neural networks" is a paper that was included in the results but not relevant to our study since it is not a data cleaning approach.

  2. https://scholar.google.com/.

  3. https://www.engineeringvillage.com.

  4. https://webofknowledge.com.

  5. https://www.sciencedirect.com.

  6. https://dl.acm.org.

  7. https://ieeexplore.ieee.org.

  8. Harzing, A.W. (2007) Publish or Perish, available from https://harzing.com/resources/publish-or-perish.

  9. https://www.zotero.org/.

  10. https://openai.com/blog/chatgpt.

References

  • (2022) Common problems. https://developers.google.com/machine-learning/gan/problems

  • (2023) https://www.cnet.com/tech/chatgpt-can-pass-the-bar-exam-does-that-actually-matter/

  • Abedjan, Z., Chu, X., Deng, D., Fernandez, R.C., Ilyas, I.F., Ouzzani, M., Papotti, P., Stonebraker, M., Tang, N.: Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow. 9(12), 993–1004 (2016)

    Article  Google Scholar 

  • Abidin, N.Z., Ismail, A.R., Emran, N.A.: Performance analysis of machine learning algorithms for missing value imputation. Int. J. Adv. Comput. Sci. Appl. 9(6), (2018)

  • Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases, vol. 8. Addison-Wesley Reading, Delhi (1995)

    Google Scholar 

  • Adhikari, D., Jiang, W., Zhan, J., He, Z., Rawat, D.B., Aickelin, U., Khorshidi, H.A.: A comprehensive survey on imputation of missing data in internet of things. ACM Comput. Surv. 55(7), 1–38 (2022)

    Article  Google Scholar 

  • Aggarwal Charu, C., Reddy Chandan, K.: Data clustering: algorithms and applications, (2013)

  • Agrawal, A., Chatterjee, R., Curino, C., Floratou, A., Gowdal, N., Interlandi, M., Jindal, A., Karanasos, K., Krishnan, S., Kroth, B., et al.: Cloudy with high chance of dbms: A 10-year prediction for enterprise-grade ml. (2019), arXiv preprint arXiv:1909.00084

  • Akouemo, H.N., Povinelli, R.J.: Data improving in time series using ARX and ANN models. IEEE Trans. Power Syst. 32(5), 3352–3359 (2017)

    Article  Google Scholar 

  • Alimohammadi, H., Chen, S.N.: Performance evaluation of outlier detection techniques in production timeseries: A systematic review and meta-analysis. Expert Syst. Appl. 191, 116371 (2022)

    Article  Google Scholar 

  • Alsolai, H., Roper, M.: A systematic literature review of machine learning techniques for software maintainability prediction. Inf. Softw. Technol. 119, 106214 (2020). https://doi.org/10.1016/j.infsof.2019.106214

    Article  Google Scholar 

  • Dempster, A.P.: Upper and lower probabilities induced by a multivalued mapping. Ann. Math. Stat. 38(2), 325–339 (1967)

    Article  MathSciNet  Google Scholar 

  • Araci, D.: Finbert: financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063 (2019)

  • Ataeyan, M., Daneshpour, N.: A novel data repairing approach based on constraints and ensemble learning. Expert Syst. Appl. 159, 113511 (2020)

    Article  Google Scholar 

  • Atkinson, G., Metsis, V.: Identifying label noise in time-series datasets. In: Adjunct Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, pp. 238–243 (2020)

  • Atkinson, G., Metsis, V.: Tsar: a time series assisted relabeling tool for reducing label noise. In: The 14th PErvasive Technologies Related to Assistive Environments Conference, pp 203–209. (2021)

  • Azeem, M.I., Palomba, F., Shi, L., Wang, Q.: Machine learning techniques for code smell detection: a systematic literature review and meta-analysis. Inf. Softw. Technol. 108, 115–138 (2019). https://doi.org/10.1016/j.infsof.2018.12.009

    Article  Google Scholar 

  • Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random fields and probabilistic soft logic. J. Mach. Learn. Res. 18, 1–67 (2017)

    MathSciNet  Google Scholar 

  • Badue, C., Guidolini, R., Carneiro, R.V., Azevedo, P., Cardoso, V.B., Forechi, A., Jesus, L., Berriel, R., Paixao, T.M., Mutz, F., et al.: Self-driving cars: a survey. Expert Syst. Appl. 165, 113816 (2021)

    Article  Google Scholar 

  • Bagherzadeh, P., Sadoghi Yazdi, H.: Label denoising based on Bayesian aggregation. Int. J. Mach. Learn. Cybern. 8, 903–914 (2017)

    Article  Google Scholar 

  • Bank, D., Koenigstein, N., Giryes, R.: Autoencoders. arXiv preprint arXiv:2003.05991 (2020)

  • Barlaug, N., Gulla, J.A.: Neural networks for entity matching: a survey. ACM Trans. Knowl. Discov. Data (TKDD) 15(3), 1–37 (2021)

    Article  Google Scholar 

  • Beltagy, I., Lo, K., Cohan, A.: Scibert: a pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019)

  • Ben-Gal, I.: Outlier detection in: data mining and knowledge discovery handbook: A complete guide for practitioners and researchers (2005)

  • Bergstra, J., Yamins, D., Cox, D.D., et al.: Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms. In: Proceedings of the 12th Python in science conference, Citeseer, vol. 13, p. 20 (2013)

  • Bernhardt, M., Castro, D.C., Tanno, R., Schwaighofer, A., Tezcan, K.C., Monteiro, M., Bannur, S., Lungren, M.P., Nori, A., Glocker, B., et al.: Active label cleaning for improved dataset quality under resource constraints. Nat. Commun. 13(1), 1161 (2022)

    Article  Google Scholar 

  • Berti-Equille, L.: Learn2clean: Optimizing the sequence of tasks for web data preparation. In: The World Wide Web Conference, pp. 2580–2586 (2019)

  • Bhandari, K., Kumar, K., Sangal, A.L.: Data quality issues in software fault prediction: a systematic literature review. Artif. Intelli. Rev. 56(8), 7839–7908 (2023)

    Article  Google Scholar 

  • Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning, vol. 4. Springer, New York (2006)

    Google Scholar 

  • Bogatu, A., Paton, N.W., Douthwaite, M., Davie, S., Freitas, A.: Cost–effective variational active entity resolution. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), IEEE, pp. 1272–1283 (2021)

  • Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  • Bosu, M.F., MacDonell, S.G.: A taxonomy of data quality challenges in empirical software engineering. In: 2013 22nd Australian Software Engineering Conference, IEEE, pp. 97–106 (2013)

  • Boukerche, A., Zheng, L., Alfandi, O.: Outlier detection: methods, models, and classification. ACM Comput. Surv. (CSUR) 53(3), 1–37 (2020)

    Article  Google Scholar 

  • Braiek, H.B., Khomh, F.: On testing machine learning programs. J. Syst. Softw. 164, 110542 (2020). https://doi.org/10.1016/j.jss.2020.110542

    Article  Google Scholar 

  • Brunner, U., Stockinger, K.: Entity matching with transformer architectures-a step forward in data integration. In: 23rd International Conference on Extending Database Technology, Copenhagen, OpenProceedings (2020)

  • Carpenter, G.A., Grossberg, S., Rosen, D.B.: Fuzzy art: fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Netw 4(6), 759–771 (1991)

    Article  Google Scholar 

  • Cer, D., Yang, Y., Kong, Sy., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018)

  • Chai, C., Wang, J., Luo, Y., Niu, Z., Li, G.: Data management for machine learning: a survey. IEEE Trans. Knowl. Data Eng. 35(5), 4646–4667 (2022)

    Google Scholar 

  • Chasmai, M.E.: Cubetr: learning to solve the rubiks cube using transformers. arXiv preprint arXiv:2111.06036 (2021)

  • Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto HPdO, Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code (2021). arXiv preprint arXiv:2107.03374

  • Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning, PMLR, pp. 1597–1607. (2020)

  • Cheng, K., Li, X., Xu, Y.E., Dong, X.L., Sun, Y.: Pge: Robust Product Graph Embedding Learning for Error Detection. https://doi.org/10.48550/ARXIV.2202.09747. arXiv:2202.09747 (2022)

  • Cholewiak, S.A., Ipeirotis, P., Silva, V., Kannawadi, A.: SCHOLARLY: Simple Access to Google Scholar Authors and Citation Using Python. https://doi.org/10.5281/zenodo.5764801, https://github.com/scholarly-python-package/scholarly (2021)

  • Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM Comput. Surv. (CSUR) 53(6), 1–42 (2020)

    Article  Google Scholar 

  • Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: Overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data, Association for Computing Machinery, New York, NY, USA, SIGMOD ’16, pp. 2201–2206. https://doi.org/10.1145/2882903.2912574 (2016a)

  • Chu, X., Ilyas, I.F., Krishnan, S., Wang, J.: Data cleaning: overview and emerging challenges. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2201–2206. (2016b)

  • Côté, P.O., Nikanjam, A., Bouchoucha, R., Basta, I., Abidi, M., Khomh, F.: Quality Issues in Machine Learning Software Systems. arXiv preprint arXiv:2306.15007 (2023)

  • Croft, R., Xie, Y., Babar, M.A.: Data preparation for software vulnerability prediction: a systematic literature review. IEEE Trans. Softw. Eng. 49(3), 1044–1063 (2022)

    Article  Google Scholar 

  • Croft, R., Babar, M.A., Kholoosi, M.M.: Data quality for software vulnerability datasets. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), IEEE, pp. 121–133 (2023)

  • Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113–123. (2019)

  • Côté, P.O., Nikanjam, A., Ahmed, N., Humeniuk, D., Khomh, F.: The replication package. https://github.com/poclecoqq/SLR-datacleaning (2023)

  • Das, S., Doan, A., G C PS., Gokhale, C., Konda, P., Govind, Y., Paulsen, D.: The Magellan Data Repository. https://sites.google.com/site/anhaidgroup/projects/data (2016)

  • Dempster, A.P., et al.: Upper and lower probabilities induced by a multivalued mapping. In: Classic Works of the Dempster-Shafer Theory of Belief Functions, pp. 57–72. Springer, Berlin (2008)

    Chapter  Google Scholar 

  • Deng, D., Fernandez, R.C., Abedjan, Z., Wang, S., Stonebraker, M., Elmagarmid, A.K., Ilyas, I.F., Madden, S., Ouzzani, M., Tang, N.: The data civilizer system. In: Cidr, (2017)

  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 248–255. (2009)

  • Dolatshah, M., Teoh, M., Wang, J., Pei, J.: Cleaning crowdsourced labels using oracles for supervised learning. PVLDB 12(4), 376–389 (2018)

    Google Scholar 

  • Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recognit. 74, 406–421 (2018)

    Article  Google Scholar 

  • Dong, X.L., Rekatsinas, T.: Data integration and machine learning: a natural synergy. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1645–1650. (2018)

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929

  • Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow 11(11), 1454–1467 (2018)

    Article  Google Scholar 

  • Ekambaram, R., Fefilatyev, S., Shreve, M., Kramer, K., Hall, L.O., Goldgof, D.B., Kasturi, R.: Active cleaning of label noise. Pattern Recognit. 51, 463–480 (2016)

    Article  Google Scholar 

  • Felderer, M., Russo, B., Auer, F.: On testing data-intensive software systems. In: Security and Quality in Cyber-Physical Systems Engineering: With Forewords by Robert M Lee and Tom Gilb, pp. 129–148. (2019)

  • Feldt, R., Magazinius, A.: Validity threats in empirical software engineering research-an initial survey. In: Seke, pp 374–379, (2010)

  • Feng, W., Long, Y., Wang, S., Quan, Y.: A review of addressing class noise problems of remote sensing classification. J. Syst. Eng. Electron. 34(1), 36–46 (2023). https://doi.org/10.23919/JSEE.2023.000034

    Article  Google Scholar 

  • Filippone, M., Sanguinetti, G.: Information theoretic novelty detection. Pattern Recognit. 43(3), 805–814 (2010)

    Article  Google Scholar 

  • Flokas, L., Wu, W., Liu, Y., Wang, J., Verma, N., Wu, E.: Complaint-driven training data debugging at interactive speeds. In: Proceedings of the 2022 International Conference on Management of Data, pp 369–383. (2022)

  • Foidl, H., Felderer, M.: Risk-based data validation in machine learning-based software systems. In: Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation, pp. 13–18 (2019)

  • Fox, T.L., Guynes, C.S., Prybutok, V.R., Windsor, J.: Maintaining quality in information systems. J. Comput. Inf. Syst. 40(1), 76–80 (1999)

    Google Scholar 

  • Freund, Y., Seung, H.S., Shamir, E., Tishby, N.: Selective sampling using the query by committee algorithm. Mach. Learn. 28(2–3), 133 (1997)

    Article  Google Scholar 

  • Fu, C., Han, X., He, J., Sun, L.: Hierarchical matching network for heterogeneous entity resolution. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3665–3671. (2021)

  • Gal, Y.: Uncertainty in Deep Learning (2016)

  • Gal, Y., Ghahramani, Z.: Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. https://doi.org/10.48550/ARXIV.1506.02142, arXiv:1506.02142 (2015)

  • Gauen, K., Dailey, R., Laiman, J., Zi, Y., Asokan, N., Lu, Y.H., Thiruvathukal, G.K., Shyu, M.L., Chen, S.C.: Comparison of visual datasets for machine learning. In: 2017 IEEE International Conference on Information Reuse and Integration (IRI), IEEE, pp. 346–355. (2017)

  • Ge, C., Gao, Y., Miao, X., Yao, B., Wang, H.: A hybrid data cleaning framework using Markov logic networks. IEEE Trans. Knowl. Data Eng. 34(5), 2048–2062 (2020)

    Article  Google Scholar 

  • Gemp, I., Theocharous, G., Ghavamzadeh, M.: Automated Data Cleansing Through Meta-learning. In: Twenty-Ninth IAAI Conference (2017)

  • Gezici, B., Tarhan, A.K.: Systematic literature review on software quality for AI-based software. Empir. Softw. Eng. 27(3), 66 (2022)

    Article  Google Scholar 

  • Gitnux, A.: Self driving cars safety statistics and trends in 2023 \(\bullet\) gitnux. https://blog.gitnux.com/self-driving-cars-safety-statistics/ (2023)

  • Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press. http://www.deeplearningbook.org (2016)

  • Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks (2014). arXiv:1406.2661

  • Gottapu, R.D., Dagli, C., Ali, B.: Entity resolution using convolutional neural network. Procedia Comput. Sci. 95, 153–158 (2016)

    Article  Google Scholar 

  • Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., Piot, B., kavukcuoglu, k., Munos, R., Valko, M.: Bootstrap your own latent—a new approach to self-supervised learning. In: Larochelle H., Ranzato M., Hadsell R., Balcan M., Lin H. (eds.) Advances in Neural Information Processing Systems, Curran Associates, Inc., vol. 33, pp. 21271–21284. https://proceedings.neurips.cc/paper/2020/file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf (2020)

  • Guan, H., Zhang, Y., Xian, M., Cheng, H.D., Tang, X.: Wenn for individualized cleaning in imbalanced data. In: 2016 23rd International Conference on Pattern Recognition (ICPR), IEEE, pp. 456–461. (2016)

  • Guo, G., Adjeroh, D., Li, X.: Automated cleaning of identity label noise in a large-scale face dataset using a face image quality control (2018)

  • Guo, Y., Bettaieb, S.: An investigation of quality issues in vulnerability detection datasets. In: 2023 IEEE European Symposium on Security and Privacy Workshops (EuroS &PW), IEEE, pp. 29–33. (2023)

  • Guo, Z., Rekatsinas, T.: Learning functional dependencies with sparse regression. arXiv:1905.01425 (2019)

  • Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., Sugiyama, M.: Co-teaching: Robust training of deep neural networks with extremely noisy labels. https://doi.org/10.48550/ARXIV.1804.06872, arXiv:1804.06872 (2018)

  • Hara, S., Nitanda, A., Maehara, T.: Data cleansing for models trained with sgd. Adv. Neural Inf. Process. Syst. 32, (2019)

  • Hawkins, D.M.: Identification of Outliers, vol. 11. Springer (1980)

  • He, X., Zhao, K., Chu, X.: Automl: a survey of the state-of-the-art. Knowl. Based Syst. 212, 106622 (2021a)

    Article  Google Scholar 

  • He, Y. et al.: Automatic detection of grammatical errors in english verbs based on rnn algorithm: auxiliary objectives for neural error detection models. Comput. Intell. Neurosci. (2021b)

  • Heidari, A., McGrath, J., Ilyas, I.F., Rekatsinas, T.: Holodetect: few-shot learning for error detection. In: Proceedings of the 2019 International Conference on Management of Data, pp. 829–846 (2019)

  • Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. https://doi.org/10.48550/ARXIV.1610.02136, arXiv:1610.02136 (2016)

  • Hernández-García, A., König, P.: Data augmentation instead of explicit regularization. arXiv preprint arXiv:1806.03852 (2018)

  • Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  • Huang, J., Qu, L., Jia, R., Zhao, B.: O2u-net: A simple noisy label detection approach for deep neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3326–3334. (2019)

  • Huang, J., Hu, W., Bao, Z., Chen, Q., Qu, Y.: Deep entity matching with adversarial active learning. VLDB J. 32(1), 229–255 (2023)

    Article  Google Scholar 

  • Huang, Z., Li, X., Deng, L., Wei, K., Sui, Y.: Mislabeled samples adjustment based on self-paced learning framework. In: 2021 7th International Conference on Computer and Communications (ICCC), IEEE, pp. 1659–1659. (2021)

  • Hurakadli, V., Kulkarni, S., Patil, U., Tabib, R., Mudengudi, U.: Deep learning based radial blur estimation and image enhancement. In: 2019 IEEE International Conference on Electronics, pp. 1–5. IEEE, Computing and Communication Technologies (CONECCT) (2019)

  • Hwang, P., Kim, Y.: Data cleaning of sound data with label noise using self organizing map. In: 2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM), pp 1–5. https://doi.org/10.1109/IMCOM53663.2022.9721724 (2022)

  • Ilyas, I., Chu, X.: Data Cleaning. Association for Computing Machinery and Morgan & Claypool Publishers. https://books.google.ca/books?id=RxieDwAAQBAJ (2019).

  • Ilyas, I.F., Rekatsinas, T.: Machine learning and data cleaning: Which serves the other? J. Data Inf. Qual. 14(3), 1–11 (2022). https://doi.org/10.1145/3506712

    Article  Google Scholar 

  • Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, Association for Computing Machinery, New York, NY, USA, STOC ’98, pp. 604–613. https://doi.org/10.1145/276698.276876 (1998)

  • Jiang, W., Ge, Y., Cheng, H., Chen, M., Feng, S., Wang, C.: Read: aggregating reconstruction error into out-of-distribution detection. Proc. AAAI Conf. Artif. Intell. 37, 14910–14918 (2023)

    Google Scholar 

  • Jin, D., Sisman, B., Wei, H., Dong, X.L., Koutra, D.: Deep transfer learning for multi-source entity linkage via domain adaptation. arXiv preprint arXiv:2110.14509 (2021)

  • Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2019)

    Article  Google Scholar 

  • Johnson, J.M., Khoshgoftaar, T.M.: A survey on classifying big data with label noise. ACM J. Data Inf. Qual. 14(4), 1–43 (2022)

    Article  Google Scholar 

  • Kang, Z., Catal, C., Tekinerdogan, B.: Machine learning applications in production lines: a systematic literature review. Comput. Ind. Eng. 149, 106773 (2020). https://doi.org/10.1016/j.cie.2020.106773

    Article  Google Scholar 

  • Karimi, D., Dou, H., Warfield, S.K., Gholipour, A.: Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Med. Image Anal. 65, 101759 (2020)

    Article  Google Scholar 

  • Karlaš, B., Li, P., Wu, R., Gürel, N.M., Chu, X., Wu, W., Zhang, C.: Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions. arXiv preprint arXiv:2005.05117 (2020)

  • Kasai, J., Qian, K., Gurajada, S., Li, Y., Popa, L.: Low-resource deep entity resolution with transfer and active learning. arXiv preprint arXiv:1906.08042 (2019)

  • Ke, X., Bai, J., Wen, L., Cao, B.: Multi-index dialogue data cleaning model. In: 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), IEEE, pp. 672–676. (2019)

  • Kim, J., Scott, C.D.: Robust kernel density estimation. J. Mach. Learn. Res. 13(1), 2529–2565 (2012)

    MathSciNet  Google Scholar 

  • Kitchenham, B.: Procedures for performing systematic reviews. Keele UK Keele Univ. 33(2004), 1–26 (2004)

    Google Scholar 

  • Klie, J.C., Webber, B., Gurevych, I.: Annotation error detection: Analyzing the past and present for a more coherent future. Comput. Linguist. pp. 1–42 (2022)

  • Knill, K.M., Gales, M.J., Manakul, P., Caines, A.: Automatic grammatical error detection of non-native spoken learner english. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 8127–8131. (2019)

  • Koh, P.W., Liang, P.: Understanding black-box predictions via influence functions. In: International Conference on Machine Learning, PMLR, pp. 1885–1894 (2017)

  • Köhler, J.M., Autenrieth, M., Beluch, W.H.: Uncertainty based detection and relabeling of noisy image labels. In: CVPR Workshops, pp. 33–37. (2019)

  • Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A.: Outlier detection in axis-parallel subspaces of high dimensional data. In: Advances in Knowledge Discovery and Data Mining: 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, 2009 Proceedings 13, Springer, pp. 831–838. (2009)

  • Krishnan, S., Wu, E.: Alphaclean: Automatic generation of data cleaning pipelines. arXiv preprint arXiv:1904.11827 (2019)

  • Krishnan, S., Wang, J., Wu, E., Franklin, M.J., Goldberg, K.: Activeclean: interactive data cleaning for statistical modeling. Proc. VLDB Endow. 9(12), 948–959 (2016)

    Article  Google Scholar 

  • Krishnan, S., Franklin, M.J., Goldberg, K., Wu, E.: Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017)

  • Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)

    Article  Google Scholar 

  • Lakshminarayan, K., Harp, S.A., Samad, T.: Imputation of missing data in industrial databases. Appl. Intell. 11(3), 259–275 (1999)

    Article  Google Scholar 

  • Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. https://doi.org/10.48550/ARXIV.1612.01474, arXiv:1612.01474 (2016)

  • Lattar, H., Salem, A.B., Ghezala, H.H.B.: Does data cleaning improve heart disease prediction? Proc. Comput. Sci. 176, 1131–1140 (2020)

    Article  Google Scholar 

  • Laure, B.E., Angela, B., Tova, M.: Machine learning to data management: A round trip. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE), IEEE, pp. 1735–1738. (2018)

  • Lee, K.H., He, X., Zhang, L., Yang, L.: Cleannet: Transfer learning for scalable image classifier training with label noise. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5447–5456. (2018)

  • Lew, A., Agrawal, M., Sontag, D., Mansinghka, V.: Pclean: Bayesian data cleaning at scale with domain-specific probabilistic programming. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp. 1927–1935. (2021)

  • Li, B., Wang, W., Sun, Y., Zhang, L., Ali, M.A., Wang, Y.: Grapher: token-centric entity resolution with graph convolutional neural networks. Proc. AAAI Conf. Artif. Intell. 34, 8172–8179 (2020)

    Google Scholar 

  • Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X., Zhang, C.: Cleanml: A benchmark for joint data cleaning and machine learning [experiments and analysis], p 75. arXiv preprint arXiv:1904.09483 (2019)

  • Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.C.: Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584 (2020b)

  • Li, Z., Du, W., Rao, N.: Research on error label screening method based on convolutional neural network. In: 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), IEEE, pp 1020–1024. (2021)

  • Liang, Q., Sun, Z., Zhu, Q., Hu, J., Zhao, Y., Zhang, L.: Cupcleaner: A data cleaning approach for comment updating. arXiv preprint arXiv:2308.06898 (2023)

  • Liebchen, G., Shepperd, M.: Data sets and data quality in software engineering: Eight years on. In: Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering, Association for Computing Machinery, New York, NY, USA, PROMISE 2016. https://doi.org/10.1145/2972958.2972967 (2016)

  • Liebchen, G.A., Shepperd, M.: Data sets and data quality in software engineering. In: Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, pp 39–44. (2008)

  • Lim, S., Kim, I., Kim, T., Kim, C., Kim, S.: Fast autoaugment. Adv. Neural Inf. Process. Syst. 32, (2019)

  • Lin, W.C., Tsai, C.F.: Missing value imputation: a review and analysis of the literature (2006–2017). Artif. Intell. Rev. 53, 1487–1509 (2020)

    Article  Google Scholar 

  • Liu, D., Meng, Y., Wang, L.: Data cleaning of irrelevant images based on transfer learning. In: 2020 International Conference on Intelligent Computing, Automation and Systems (ICICAS), pp. 450–456. IEEE, (2020)

  • Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, IEEE, pp. 413–422 (2008)

  • Liu, Y., Li, Z., Zhou, C., Jiang, Y., Sun, J., Wang, M., He, X.: Generative adversarial active learning for unsupervised outlier detection. IEEE Trans. Knowl. Data Eng. 32(8), 1517–1528 (2019)

    Google Scholar 

  • Liu, Z., Zhou, Z., Rekatsinas, T.: Picket: guarding against corrupted data in tabular data during learning and inference. VLDB J. pp. 1–29 (2022)

  • Mahdavi, M., Abedjan, Z.: Baran: effective error correction via a unified context representation and transfer learning. Proc. VLDB Endow. 13(12), 1948–1961 (2020)

    Article  Google Scholar 

  • Mahdavi, M., Abedjan, Z.: Semi-supervised data cleaning with raha and baran. In: CIDR, (2021)

  • Mahdavi, M., Abedjan, Z., Castro Fernandez, R., Madden, S., Ouzzani, M., Stonebraker, M., Tang, N.: Raha: A configuration-free error detection system. In: Proceedings of the 2019 International Conference on Management of Data, pp. 865–882. (2019)

  • Marsland, S., Shapiro, J., Nehmzow, U.: A self-organising network that grows when required. Neural Netw. 15(8–9), 1041–1058 (2002)

    Article  Google Scholar 

  • Martínez-Fernández, S., Bogner, J., Franch, X., Oriol, M., Siebert, J., Trendowicz, A., Vollmer, A.M., Wagner, S.: Software engineering for AI-based systems: a survey. ACM Trans. Softw. Eng. Methodol. 31(2), 1–59 (2022). https://doi.org/10.1145/3487043

    Article  Google Scholar 

  • Mauritz, R., Nijweide, F., Goseling, J., van Keulen, M.: A probabilistic database approach to autoencoder-based data cleaning. arXiv preprint arXiv:2106.09764 (2021)

  • Mayfield, C., Neville, J., Prabhakar, S.: Eracer: a database approach for statistical inference and data cleaning. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 75–86. (2010)

  • Mazumder, M., Banbury, C., Yao, X., Karlaš, B., Rojas, W.G., Diamos, S., Diamos, G., He, L., Parrish, A., Kirk, H.R., et al.: Dataperf: Benchmarks for data-centric AI development. arXiv preprint arXiv:2207.10062 (2022)

  • Meduri, V.V., Popa, L., Sen, P., Sarwat, M.: A comprehensive benchmark framework for active learning methods in entity matching. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1133–1147. (2020)

  • Miao, Z., Li, Y., Wang, X.: Rotom: A meta-learned data augmentation framework for entity matching, data cleaning, text classification, and beyond. In: Proceedings of the 2021 International Conference on Management of Data, pp. 1303–1316. (2021)

  • Motulsky, H.J., Brown, R.E.: Detecting outliers when fitting data with nonlinear regression-a new method based on robust nonlinear regression and the false discovery rate. BMC Bioinform. 7(1), 1–20 (2006)

    Article  Google Scholar 

  • Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: A design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, pp 19–34, (2018)

  • Müller, H., Castelo, S., Qazi, M., Freire, J.: From papers to practice: the openclean open-source data cleaning library. Proc. VLDB Endow 14(12), 2763–2766 (2021)

    Article  Google Scholar 

  • Narayan, A., Chami, I., Orr, L., Ré, C.: Can foundation models wrangle your data? (2022). arXiv preprint arXiv:2205.09911

  • Nashaat, M., Ghosh, A., Miller, J., Quader, S.: Tabreformer: unsupervised representation learning for erroneous data detection. ACM/IMS Trans. Data Sci. 2(3), 1–29 (2021)

    Article  Google Scholar 

  • Nassif, A.B., Talib, M.A., Nasir, Q., Dakalbab, F.M.: Machine learning for anomaly detection: a systematic review. IEEE Access 9, 78658–78700 (2021)

    Article  Google Scholar 

  • Neutatz, F., Mahdavi, M., Abedjan, Z.: Ed2: two-stage active learning for error detection–technical report. arXiv preprint arXiv:1908.06309 (2019)

  • Neutatz, F., Chen, B., Abedjan, Z., Wu, E.: From cleaning before ml to cleaning for ml. IEEE Data Eng. Bull. 44(1), 24–41 (2021)

    Google Scholar 

  • Ng, A.: A chat with andrew on mlops: from model-centric to data-centric AI. https://www.youtube.com/watch?v=06-AZXmwHjo &ab_channel=DeepLearningAI (2021)

  • Ng, A., He, L., Laird, D.: Data-centric AI competition. https://https-deeplearning-ai.github.io/data-centric-comp/ (2021)

  • Nie, H., Han, X., He, B., Sun, L., Chen, B., Zhang, W., Wu, S., Kong, H.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 629–638. (2019)

  • Northcutt, C.G., Jiang, L., Chuang, I.L.: Confident learning: Estimating uncertainty in dataset labels. https://doi.org/10.48550/ARXIV.1911.00068, arXiv:1911.00068 (2019)

  • Oliveira, P.H., Kaster, D.S., Ilyas, I.F., et al.: Batchwise probabilistic incremental data cleaning. arXiv preprint arXiv:2011.04730 (2020)

  • OpenAI (2023) https://openai.com/research/gpt-4

  • Pang, G., Shen, C., Cao, L., Hengel, A.V.D.: Deep learning for anomaly detection: a review. ACM Comput. Surv. (CSUR) 54(2), 1–38 (2021)

    Article  Google Scholar 

  • Papastefanopoulos, V., Linardatos, P., Kotsiantis, S.: Unsupervised outlier detection: a meta-learning algorithm based on feature selection. Electronics 10(18), 2236 (2021)

    Article  Google Scholar 

  • Patel, H., Gupta, N., Panwar, N., Sharma Mittal, R., Mehta, S., Guttula, S., Mujumdar, S., Afzal, S., Bedathur, S., Munigala, V.: Automatic assessment of quality of your data for AI. In: Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD), Association for Computing Machinery, New York, NY, USA, CODS-COMAD ’22, pp. 354–357. (2022). https://doi.org/10.1145/3493700.3493774

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  Google Scholar 

  • Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. (2014)

  • Pham, M., Knoblock, C.A., Chen, M., Vu, B., Pujara, J.: Spade: a semi-supervised probabilistic approach for detecting errors in tables. In: IJCAI, pp 3543–3551. (2021)

  • Pise, N.N., Kulkarni, P.: A survey of semi-supervised learning methods. In: 2008 International Conference on Computational Intelligence and Security, IEEE, vol. 2, pp. 30–34. (2008)

  • Pit-Claudel, C., Mariet, Z., Harding, R., Madden, S.: Outlier detection in heterogeneous datasets using automatic tuple expansion. Tech. rep., MIT—Computer Science and Artificial Intelligence Laboratory (MIT-CSAIL-TR-2016-002). (2016)

  • Ponzio, F., Macii, E., Ficarra, E., Di Cataldo, S.: W2wnet: a two-module probabilistic convolutional neural network with embedded data cleansing functionality. arXiv preprint arXiv:2103.13107 (2021)

  • Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M.P., Shyu, M.L., Chen, S.C., Iyengar, S.S.: A survey on deep learning: algorithms, techniques, and applications. ACM Comput. Surv. (CSUR) 51(5), 1–36 (2018)

    Article  Google Scholar 

  • Press, G.: Cleaning big data: Most time-consuming, least enjoyable data science task, survey says. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=4c577cb46f63 (2022)

  • Qian, K., Popa, L., Sen, P.: Active learning for large-scale entity resolution. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1379–1388. (2017)

  • Rahm, E., Do, H.H., et al.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  • Razavi-Far, R., Cheng, B., Saif, M., Ahmadi, M.: Similarity-learning information-fusion schemes for missing data imputation. Knowl. Based Syst. 187, 104805 (2020)

    Article  Google Scholar 

  • Rehbein, I., Ruppenhofer, J.: Detecting annotation noise in automatically labelled data. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), pp. 1160–1170. (2017)

  • Rei, M., Yannakoudakis, H.: Compositional sequence labeling models for error detection in learner writing. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), Association for Computational Linguistics, Berlin, Germany, pp. 1181–1191. https://doi.org/10.18653/v1/P16-1112, https://aclanthology.org/P16-1112 (2016)

  • Rekatsinas, T., Chu, X., Ilyas, I.F., Ré, C.: Holoclean: Holistic data repairs with probabilistic inference. arXiv preprint arXiv:1702.00820 (2017)

  • Ren, M., Zeng, W., Yang, B., Urtasun, R.: Learning to reweight examples for robust deep learning. In: International Conference on Machine Learning, PMLR, pp. 4334–4343. (2018)

  • Roh, Y., Heo, G., Whang, S.E.: A survey on data collection for machine learning: a big data-AI integration perspective. IEEE Trans. Knowl. Data Eng. 33(4), 1328–1347 (2019)

    Article  Google Scholar 

  • Rosner, B.: Percentage points for a generalized esd many-outlier procedure. Technometrics 25(2), 165–172 (1983)

    Article  Google Scholar 

  • Rottmann, M., Reese, M.: Automated detection of label errors in semantic segmentation datasets via deep learning and uncertainty quantification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3214–3223. (2023)

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  • Salekshahrezaee, Z., Leevy, J.L., Khoshgoftaar, T.M.: A reconstruction error-based framework for label noise detection. J. Big Data 8, 1–16 (2021)

    Article  Google Scholar 

  • Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., Aroyo, L.M.: “Everyone wants to do the model work, not the data work”: data cascades in high-stakes AI. In: proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–15. (2021)

  • Santos, E.A., Campbell, J.C., Hindle, A., Amaral, J.N.: Finding and correcting syntax errors using recurrent neural networks. PeerJ PrePrints 5, e3123v1 (2017)

    Google Scholar 

  • Sarker, I.H.: Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2(6), 1–20 (2021)

    Article  MathSciNet  Google Scholar 

  • Schölkopf, B., Williamson, R.C., Smola, A., Shawe-Taylor, J., Platt, J.: Support vector method for novelty detection. Adv. Neural Inf. Process. Syst. 12, (1999)

  • Shi, J., Wu, J.: Distilling effective supervision for robust medical image segmentation with noisy labels. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 2021, Proceedings, Part I 24, Springer, pp. 668–677. (2021)

  • Shi, L., Mu, F., Chen, X., Wang, S., Wang, J., Yang, Y., Li, G., Xia, X., Wang, Q.: Are we building on the rock? On the importance of data preprocessing for code summarization. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 107–119. (2022)

  • Silva-Ramírez, E.L., Cabrera-Sánchez, J.F.: Co-active neuro-fuzzy inference system model as single imputation approach for non-monotone pattern of missing data. Neural Comput. Appl. 33, 8981–9004 (2021)

    Article  Google Scholar 

  • Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  • Smyth, L.: Training-Valuenet: A New Approach for Label Cleaning on Weakly-Supervised Datasets. University of Exeter, (2020)

  • Song, H., Kim, M., Park, D., Shin, Y., Lee, J.G.: Learning from noisy labels with deep neural networks: a survey. IEEE Trans. Neural Netw. Learn. Syst. 34(11), 8135–8153 (2023). https://doi.org/10.1109/TNNLS.2022.3152527

    Article  Google Scholar 

  • Spithourakis, G.P., Augenstein, I., Riedel, S.: Numerically grounded language models for semantic error correction. arXiv preprint arXiv:1608.04147 (2016)

  • Studer, S., Bui, T.B., Drescher, C., Hanuschkin, A., Winkler, L., Peters, S., Müller, K.R.: Towards crisp-ml (q): a machine learning process model with quality assurance methodology. Mach. Learn. Knowl. Extr. 3(2), 392–413 (2021)

    Article  Google Scholar 

  • Su, J., Gao, X., Qin, Y., Guo, S.: Correcting corrupted labels using mode dropping of acgan. In: 2021 15th International Symposium on Medical Information and Communication Technology (ISMICT), IEEE, pp. 98–103. (2021)

  • Surameery, N.M.S., Shakor, M.Y.: Use chat gpt to solve programming bugs. Int. J. Inf. Technol. Comput. Eng. (IJITC) 3(01), 17–22 (2023)

    Google Scholar 

  • Suzuki, K., Kobayashi, Y., Narihira, T.: Data cleansing for deep neural networks with storage-efficient approximation of influence functions. arXiv preprint arXiv:2103.11807 (2021)

  • Tae, K.H., Roh, Y., Oh, Y.H., Kim, H., Whang, S.E.: Data cleaning for accurate, fair, and robust models: A big data-AI integration approach. In: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning, pp. 1–4. (2019)

  • Tambon, F., Laberge, G., An, L., Nikanjam, A., Mindom, P.S.N., Pequignot, Y., Khomh, F., Antoniol, G., Merlo, E., Laviolette, F.: How to certify machine learning based safety-critical systems? A systematic literature review. Autom. Softw. Eng. 29(2), 1–74 (2022)

    Article  Google Scholar 

  • Tang, N., Fan, J., Li, F., Tu, J., Du, X., Li, G., Madden, S., Ouzzani, M.: Relational pretrained transformers towards democratizing data preparation [vision]. arXiv preprint arXiv:2012.02469 (2020)

  • Tawfik, N.S., Spruit, M.R.: Evaluating sentence representations for biomedical text: methods and experimental results. J. Biomed. Inform. 104, 103396 (2020)

    Article  Google Scholar 

  • Team, S.: Data-centric AI for the enterprise (2024). https://snorkel.ai/#

  • Terrades, O.R., Berenguel, A., Gil, D.: A flexible outlier detector based on a topology given by graph communities. Big Data Res. 29, 100332 (2022)

    Article  Google Scholar 

  • Teso, S., Bontempelli, A., Giunchiglia, F., Passerini, A.: Interactive label cleaning with example-based explanations. Adv. Neural Inf. Process. Syst. 34, 12966–12977 (2021)

    Google Scholar 

  • Tfwala, S.S., Wang, Y.M., Lin, Y.C., et al.: Prediction of missing flow records using multilayer perceptron and coactive neurofuzzy inference system. Sci. World J. (2013)

  • Thekumparampil, K.K., Khetan, A., Lin, Z., Oh, S.: Robustness of conditional gans to noisy labels. Adv. Neural Inf. Process. Syst. 31, (2018)

  • Thirumuruganathan, S., Tang, N., Ouzzani, M., Doan, A.: Data curation with deep learning. In: EDBT, pp. 277–286. (2020)

  • Tonolini, F., Moreno, P.G., Damianou, A., Murray-Smith, R.: Tomographic auto-encoder: unsupervised bayesian recovery of corrupted data. arXiv preprint arXiv:2006.16938 (2020)

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, (2017a)

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. CoRR abs/1706.03762, arXiv:1706.03762 (2017b)

  • Veit, A., Alldrin, N., Chechik, G., Krasin, I., Gupta, A., Belongie, S.: Learning from noisy large-scale datasets with minimal supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 839–847. (2017)

  • Visengeriyeva, L., Abedjan, Z.: Metadata-driven error detection. In: Proceedings of the 30th International Conference on Scientific and Statistical Database Management, pp. 1–12. (2018)

  • Visengeriyeva, L., Akbik, A., Kaul, M., Rabl, T., Markl, V.: Improving data quality by leveraging statistical relational learning. In: ICIQ, pp. 220–236. (2016)

  • Wang, H., Bah, M.J., Hammad, M.: Progress in outlier detection techniques: a survey. IEEE Access 7, 107964–108000 (2019). https://doi.org/10.1109/ACCESS.2019.2932769

    Article  Google Scholar 

  • Wang, Q., Tan, Y.: Grammatical error detection with self attention by pairwise training. In: 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1–7. (2020)

  • Wang, R., Li, Y., Wang, J.: Sudowoodo: Contrastive self-supervised learning for multi-purpose data integration and preparation. arXiv preprint arXiv:2207.04122 (2022)

  • Wang, X., Wang, C.: Time series data cleaning: a survey. IEEE Access 8, 1866–1881 (2019)

    Article  Google Scholar 

  • Wang, Z., Sisman, B., Wei, H., Dong, X.L., Ji, S.: Cordel: a contrastive deep learning approach for entity linkage. In: 2020 IEEE International Conference on Data Mining (ICDM), IEEE, pp. 1322–1327. (2020)

  • Wei, J., Zou, K.: Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196 (2019)

  • Whang, S.E., Roh, Y., Song, H., Lee, J.G.: Data collection and quality challenges in deep learning: a data-centric AI perspective. arXiv preprint arXiv:2112.06409 (2021)

  • Whang, S.E., Roh, Y., Song, H., Lee, J.G.: Data collection and quality challenges in deep learning: a data-centric AI perspective. VLDB J. 32(4), 791–813 (2023)

    Article  Google Scholar 

  • White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., Schmidt, D.C.: A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023)

  • Wikipedia (2023a) https://en.wikipedia.org/wiki/Machine_learning

  • Wikipedia (2023b) https://en.wikipedia.org/wiki/Imputation_(statistics)

  • Wikipedia (2023c) Active learning (machine learning). https://en.wikipedia.org/wiki/Active_learning_(machine_learning)

  • Wikipedia (2023d) Boosting (machine learning). https://en.wikipedia.org/wiki/Boosting_(machine_learning)

  • Wikipedia (2023e) Transfer learning. https://en.wikipedia.org/wiki/Transfer_learning

  • Wohlin, C.: Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, pp. 1–10. (2014)

  • Wong, W.E., Gao, R., Li, Y., Abreu, R., Wotawa, F.: A survey on software fault localization. IEEE Trans. Softw. Eng. 42(8), 707–740 (2016). https://doi.org/10.1109/TSE.2016.2521368

    Article  Google Scholar 

  • Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: Zeroer: Entity resolution using zero labeled examples. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1149–1164. (2020)

  • Wu, Y., Weimer, J., Davidson, S.B.: Chef: a cheap and fast pipeline for iteratively cleaning label uncertainties (technical report). arXiv preprint arXiv:2107.08588 (2021)

  • Xiang, S., Ye, X., Xia, J., Wu, J., Chen, Y., Liu, S.: Interactive correction of mislabeled training data. In: 2019 IEEE Conference on Visual Analytics Science and Technology (VAST), IEEE, pp 57–68. (2019)

  • Yu, Q., Aizawa, K.: Unknown class label cleaning for learning with open-set noisy labels. In: 2020 IEEE International Conference on Image Processing (ICIP), IEEE, pp 1731–1735. (2020)

  • Zha, D., Bhat, Z.P., Lai, K.H., Yang, F., Jiang, Z., Zhong, S., Hu, X.: Data-centric artificial intelligence: a survey. arXiv preprint arXiv:2303.10158 (2023)

  • Zhang, A., Song, S., Wang, J., Yu, P.S.: Time series data cleaning: From anomaly detection to anomaly repairing (technical report). arXiv preprint arXiv:2003.12396 (2020a)

  • Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412(2017).

  • Zhang, Q., Fang, C., Ma, Y., Sun, W., Chen, Z.: A survey of learning-based automated program repair. ACM Trans. Softw. Eng. Methodol. 33(2), 1–69 (2023). https://doi.org/10.1145/3631974

    Article  Google Scholar 

  • Zhang, W., Tan, X.: Combining outlier detection and reconstruction error minimization for label noise reduction. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp), IEEE, pp. 1–4. (2019)

  • Zhang, W., Wang, D., Tan, X.: Data cleaning and classification in the presence of label noise with class-specific autoencoder. In: International Symposium on Neural Networks, Springer, pp. 256–264. (2018a)

  • Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: Autoblock: A hands-off blocking framework for entity matching. In: Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 744–752. (2020b)

  • Zhang, X., Ji, Y., Nguyen, C., Wang, T.: Deepclean: data cleaning via question asking. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), IEEE, pp. 283–292. (2018b)

  • Zhang, X., Zhu, X., Wright, S.: Training set debugging using trusted items. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, (2018c)

  • Zhang, Y., Zheng, S., Dalirrooyfard, M., Wu, P., Schneider, A., Raj, A., Nevmyvaka, Y., Chen, C.: Learning to abstain from uninformative data. arXiv preprint arXiv:2309.14240 (2023b)

  • Zhao, C., He, Y.: Auto-em: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: The World Wide Web Conference, pp. 2413–2424. (2019)

  • Zhao, Z., Liu, H.: Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th International Conference on Machine Learning, pp. 1151–1157. (2007)

  • Zhou, X., Jin, Y., Zhang, H., Li, S., Huang, X.: A map of threats to validity of systematic literature reviews in software engineering. In: 2016 23rd Asia-Pacific Software Engineering Conference (APSEC), IEEE, pp. 153–160. (2016)

  • Zhou, X., Liu, X., Wang, C., Zhai, D., Jiang, J., Ji, X.: Learning with noisy labels via sparse regularization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 72–81. (2021)

  • Zhu, X., Ghahramani, Z.: Learning from Labeled and Unlabeled Data with Label Propagation. ProQuest Number: INFORMATION TO ALL USERS (2002)

Download references

Acknowledgements

This work is funded by the Fonds de Recherche du Quebec (FRQ), the Canadian Institute for Advanced Research (CIFAR), and the Natural Sciences and Engineering Research Council of Canada (NSERC). We would like to thank Dr. Hyacinth Ali for contributing to improving this SLR with his valuable comments.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: P-O.C., A.N., F.K.; Paper collection: P-O.C.; Paper selection: P-O.C., A.N.; Data Extraction: P-O.C. (74 papers), N.A. (17 papers), D.H. (6 papers), A.N. (2 papers); Writing - original draft preparation: P-O.C., N.A., D.H., A.N.; Writing - review and editing: A.N., P-O.C., F.K., D.H.; Funding acquisition: A.N., F.K.; Project administration: P-O.C.; Supervision: A.N., F.K.; Validation: A.N., P-O.C.; Software: P-O.C.; Resource: P-O.C.

Corresponding author

Correspondence to Pierre-Olivier Côté.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is funded by the Fonds de Recherche du Quebec (FRQ), the Canadian Institute for Advanced Research (CIFAR), and the Natural Sciences and Engineering Research Council of Canada (NSERC).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Côté, PO., Nikanjam, A., Ahmed, N. et al. Data cleaning and machine learning: a systematic literature review. Autom Softw Eng 31, 54 (2024). https://doi.org/10.1007/s10515-024-00453-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10515-024-00453-w

Keywords