research-article

Open access

SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications

Authors:

Shafaq Siddiqi,

Matthias BoehmAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 3

Article No.: 218, Pages 1 - 26

https://doi.org/10.1145/3617338

Published: 13 November 2023 Publication History

Abstract

In the exploratory data science lifecycle, data scientists often spent the majority of their time finding, integrating, validating and cleaning relevant datasets. Despite recent work on data validation, and numerous error detection and correction algorithms, in practice, data cleaning for ML remains largely a manual, unpleasant, and labor-intensive trial and error process, especially in large-scale, distributed computation. The target ML application---such as classification or regression models---can be used as a signal of valuable feedback though, for selecting effective data cleaning strategies. In this paper, we introduce SAGA, a framework for automatically generating the top-K most effective data cleaning pipelines. SAGA adopts ideas from Auto-ML, feature selection, and hyper-parameter tuning. Our framework is extensible for user-provided constraints, new data cleaning primitives, and ML applications; automatically generates hybrid runtime plans of local and distributed operations; and performs pruning by interesting properties (e.g., monotonicity). Instead of full automation---which is rather unrealistic---SAGA simplifies the mechanical aspects of data cleaning. Our experiments show that SAGA yields robust accuracy improvements over state-of-the-art, and good scalability regarding increasing data sizes and number of evaluated pipelines.

References

[1]

Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data Profiling. In Synthesis Lectures on Data Management. http://sites.computer.org/debull/A18june/p3.pdf

[2]

Giorgos Alexiou, George Papastefanatos, Vassilis Stamatopoulos, Georgia Koutrika, and Nectarios Koziris. 2022. QueryER: A Framework for Fast Analysis-Aware Deduplication over Dirty Data. CoRR, Vol. abs/2202.01546 (2022). showeprint[arXiv]2202.01546 https://arxiv.org/abs/2202.01546

[3]

ASQ/ANSI/ISO. 2015. 9001:2015: Quality management systems - Requirements.

[4]

Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In KDD. 1387--1395. https://doi.org/10.1145/3097983.3098021

Digital Library

[5]

James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-Parameter Optimization. JMLR, Vol. 13 (2012), 281--305. https://doi.org/10.5555/2503308.2188395

Digital Library

[6]

Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic Schema Matching, Ten Years Later. PVLDB, Vol. 4, 11 (2011), 695--701. http://www.vldb.org/pvldb/vol4/p695-bernstein_madhavan_rahm.pdf

[7]

Philip A. Bernstein and Sergey Melnik. 2007. Model management 2.0: manipulating richer mappings. In SIGMOD. https://doi.org/10.1145/1247480.1247482

Digital Library

[8]

Laure Berti-É quille. 2019. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13--17, 2019. ACM, 2580--2586. https://doi.org/10.1145/3308558.3313602

Digital Library

[9]

Felix Bießmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing Value Imputation for Tables. JMLR, Vol. 20 (2019). http://jmlr.org/papers/v20/18--753.html

[10]

Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, Eli Upfal, Robert C. Zeleznik, and Emanuel Zgraggen. 2018. Towards Interactive Curation & Automatic Tuning of ML Pipelines. In DEEM. 1:1--1:4. https://doi.org/10.1145/3209889.3209891

Digital Library

[11]

Matthias Boehm, Iulian Antonov, Sebastian Baunsgaard, Mark Dokter, Robert Ginthö r, Kevin Innerebner, Florijan Klezin, Stefanie N. Lindstaedt, Arnab Phani, Benjamin Rath, Berthold Reinwald, Shafaq Siddiqui, and Sebastian Benjamin Wrede. 2020. SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle. In CIDR. http://cidrdb.org/cidr2020/papers/p22-boehm-cidr20.pdf

[12]

Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, Prithviraj Sen, Yuanyuan Tian, Douglas Burdick, and Shivakumar Vaithyanathan. 2014. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB, Vol. 7, 7 (2014), 553--564. https://doi.org/10.14778/2732286.2732292

Digital Library

[13]

Christoph Bö hm, Gerard de Melo, Felix Naumann, and Gerhard Weikum. 2012. LINDA: distributed web-of-data-scale entity matching. In CIKM. 2104--2108. https://doi.org/10.1145/2396761.2398582

Digital Library

[14]

Matthias Bö hm, Uwe Wloka, Dirk Habich, and Wolfgang Lehner. 2009. GCIP: exploiting the generation and optimization of integration processes. In EDBT, Vol. 360. 1128--1131. https://doi.org/10.1145/1516360.1516494

Digital Library

[15]

Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In MLSys. https://proceedings.mlsys.org/book/267.pdf

[16]

Eric Brochu, Vlad M. Cora, and Nando de Freitas. 2010. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. CoRR (2010). http://arxiv.org/abs/1012.2599

[17]

Sé bastien Bubeck and Nicolò Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Found. Trends Mach. Learn., Vol. 5, 1 (2012), 1--122. https://doi.org/10.1561/2200000024

[18]

Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. 2019. Expressive power of entity-linking frameworks. J. Comput. Syst. Sci., Vol. 100 (2019), 44--69. https://doi.org/10.1016/j.jcss.2018.09.001

[19]

José Cambronero, John K. Feser, Micah J. Smith, and Samuel Madden. 2017. Query optimization for dynamic imputation. PVLDB, Vol. 10 (2017), 1310--1321. https://doi.org/10.14778/3137628.3137641

Digital Library

[20]

Emily Caveness, Paul Suganthan G. C., Zhuo Peng, Neoklis Polyzotis, Sudip Roy, and Martin Zinkevich. 2020. TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines. In SIGMOD. 2793--2796. https://doi.org/10.1145/3318464.3384707

Digital Library

[21]

Austin Animal Center. 2022. Shelter Animal Outcomes competition dataset from Kaggle. https://www.kaggle.com/competitions/shelter-animal-outcomes/data

[22]

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res., Vol. 16 (2002), 321--357. https://doi.org/10.1613/jair.953

Digital Library

[23]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In SIGKDD. 785--794. https://doi.org/10.1145/2939672.2939785

Digital Library

[24]

Xu Chu, Ihab F. Ilyas, and Paraschos Koutris. 2016. Distributed Data Deduplication. PVLDB, Vol. 9, 11 (2016), 864--875. https://doi.org/10.14778/2983200.2983203

Digital Library

[25]

Xu Chu, Mourad Ouzzani, John Morcos, Ihab F. Ilyas, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing. PVLDB, Vol. 8, 12 (2015). https://doi.org/10.14778/2824032.2824109

Digital Library

[26]

Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. 2020. Automated Data Slicing for Model Validation: A Big Data - AI Integration Approach. IEEE Trans. Knowl. Data Eng., Vol. 32, 12 (2020), 2284--2296. https://doi.org/10.1109/TKDE.2019.2916074

[27]

Ekin D. Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V. Le. 2019. AutoAugment: Learning Augmentation Strategies From Data. In CVPR. 113--123. https://doi.org/10.1109/CVPR.2019.00020

[28]

Can Cui, Wei Wang, Meihui Zhang, Gang Chen, Zhaojing Luo, and Beng Chin Ooi. 2021. AlphaEvolve: A Learning Framework to Discover Novel Alphas in Quantitative Investment. In SIGMOD. 2208--2216. https://doi.org/10.1145/3448016.3457324

Digital Library

[29]

Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: a commodity data cleaning system. In SIGMOD. 541--552. https://doi.org/10.1145/2463676.2465327

Digital Library

[30]

Data.Nashville.gov. 2020. Nashville Traffic Accidents Dataset. https://data.nashville.gov/Police/Traffic-Accidents/6v6w-hpcw

[31]

Delve Datasets. 2022. Puma Dataset. https://www.cs.toronto.edu/ delve/data/datasets.html

[32]

data.world. 2016. OLS Regression Challenge - Cancer. https://data.world/nrippner/ols-regression-challenge

[33]

Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. (2017). http://cidrdb.org/cidr2017/papers/p44-deng-cidr17.pdf

[34]

Mike Dreves, Gene Huang, Zhuo Peng, Neoklis Polyzotis, Evan Rosen, and Paul Suganthan G. C. 2020. From Data to Models and Back. In DEEM@SIGMOD Workshop. https://doi.org/10.1145/3399579.3399868

Digital Library

[35]

Stefan Falkner, Aaron Klein, and Frank Hutter. 2018. BOHB: Robust and Efficient Hyperparameter Optimization at Scale. In ICML. 1436--1445. http://proceedings.mlr.press/v80/falkner18a.html

[36]

Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. 2020. Auto-Sklearn 2.0: The Next Generation. CoRR (2020). https://arxiv.org/abs/2007.04074

[37]

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2019. Auto-sklearn: Efficient and Robust Automated Machine Learning. In Automated Machine Learning - Methods, Systems, Challenges. 113--134. https://doi.org/10.1007/978--3-030-05318--5

[38]

Nicoló Fusi, Rishit Sheth, and Melih Elibol. 2018. Probabilistic Matrix Factorization for Automated Machine Learning. In NeurIPS. 3352--3361. https://proceedings.neurips.cc/paper/2018/file/b59a51a3c0bf9c5228fde841714f523a-Paper.pdf

[39]

Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. 2021. BEER: Blocking for Effective Entity Resolution. In SIGMOD. 2711--2715. https://doi.org/10.1145/3448016.3452747

Digital Library

[40]

Gá bor E. Gé vay, Jorge-Arnulfo Quiané -Ruiz, and Volker Markl. 2021. The Power of Nested Parallelism in Big Data Processing - Hitting Three Flies with One Slap. In SIGMOD. 605--618. https://doi.org/10.1145/3448016.3457287

Digital Library

[41]

Joachim Hammer, Michael Stonebraker, and Oguzhan Topsakal. 2005. THALIA: Test Harness for the Assessment of Legacy Information Integration Approaches. In ICDE. 485--486. https://doi.org/10.1109/ICDE.2005.140

Digital Library

[42]

Chicago health services. 2022. Chicago Food Inspection Dataset. https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5/data

[43]

Yuval Heffetz, Roman Vainshtein, Gilad Katz, and Lior Rokach. 2020. DeepLine: AutoML Tool for Pipelines Generation using Deep Reinforcement Learning and Hierarchical Actions Filtering. In KDD. 2103--2113. https://doi.org/10.1145/3394486.3403261

Digital Library

[44]

Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-shot learning for error detection. SIGMOD (2019), 829--846. https://doi.org/10.1145/3299869.3319888

Digital Library

[45]

Christoph Hube, Besnik Fetahu, and Ujwal Gadiraju. 2019. Understanding and Mitigating Worker Biases in the Crowdsourced Collection of Subjective Judgments. In CHI. ACM, 407. https://doi.org/10.1145/3290605.3300637

Digital Library

[46]

Kevin G. Jamieson and Ameet Talwalkar. 2016. Non-stochastic Best Arm Identification and Hyperparameter Optimization. In AISTATS (JMLR Workshop and Conference Proceedings, Vol. 51). 240--248. http://proceedings.mlr.press/v51/jamieson16.html

[47]

Kaggle. 2022a. House Prices - Advanced Regression Techniques. https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

[48]

Kaggle. 2022b. Titanic Dataset. https://www.kaggle.com/competitions/titanic/data

[49]

Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabá s Pó czos, and Eric P. Xing. 2018. Neural Architecture Search with Bayesian Optimisation and Optimal Transport. In NeurIPS. 2020--2029. https://proceedings.neurips.cc/paper/2018/hash/f33ba15effa5c10e873bf3842afb46a6-Abstract.html

[50]

Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2011. Wrangler: interactive visual specification of data transformation scripts. In CHI. 3363--3372. https://doi.org/10.1145/1978942.1979444

Digital Library

[51]

Daniel Kang, Deepti Raghavan, Peter Bailis, and Matei Zaharia. 2020. Model Assertions for Monitoring and Improving ML Models. In MLSys. https://proceedings.mlsys.org/book/319.pdf

[52]

Bojan Karlas, Matteo Interlandi, Cé dric Renggli, Wentao Wu, Ce Zhang, Deepak Mukunthu Iyappan Babu, Jordan Edwards, Chris Lauren, Andy Xu, and Markus Weimer. 2020a. Building Continuous Integration Services for Machine Learning. In KDD. 2407--2415. https://doi.org/10.1145/3394486.3403290

Digital Library

[53]

Bojan Karlas, Peng Li, Renzhi Wu, Nezihe Merve Gü rel, Xu Chu, Wentao Wu, and Ce Zhang. 2020b. Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions. PVLDB, Vol. 14, 3 (2020), 255--267. https://doi.org/10.5555/3430915.3442426

[54]

Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané -Ruiz, Nan Tang, and Si Yin. 2015. BigDansing: A System for Big Data Cleansing. In SIGMOD. 1215--1230. https://doi.org/10.1145/2723372.2747646

Digital Library

[55]

Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks. PVLDB, Vol. 9, 13 (2016), 1581--1584. https://doi.org/10.14778/3007263.3007314

Digital Library

[56]

Lars Kotthoff, Chris Thornton, Holger H. Hoos, Frank Hutter, and Kevin Leyton-Brown. 2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. J. Mach. Learn. Res., Vol. 18 (2017), 25:1--25:5. http://jmlr.org/papers/v18/16--261.html

[57]

Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR, Vol. abs/1711.01299 (2017). http://arxiv.org/abs/1711.01299

[58]

Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. Activeclean: Interactive data cleaning for statistical modeling. PVLDB, Vol. 9, 12 (2016), 948--959. https://doi.org/10.14778/2994509.2994514

Digital Library

[59]

Sanjay Krishnan and Eugene Wu. 2019. AlphaClean: Automatic Generation of Data Cleaning Pipelines. (2019). http://arxiv.org/abs/1904.11827

[60]

Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. J. Mach. Learn. Res., Vol. 18 (2017), 185:1--185:52. http://jmlr.org/papers/v18/16--558.html

[61]

Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In ICDE. 13--24. https://doi.org/10.1109/ICDE51399.2021.00009

[62]

Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang. 2018. Ease.Ml: Towards Multi-Tenant Resource Sharing for Machine Learning Workloads. PVLDB, Vol. 11, 5 (2018), 607--620. https://doi.org/10.1145/3187009.3177737

Digital Library

[63]

Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. PVLDB, Vol. 13, 11 (2020), 1948--1961. https://doi.org/10.14778/3407790.3407801

Digital Library

[64]

Mohammad Mahdavi, Samuel Madden, Ziawasch Abedjan, Mourad Ouzzani, Nan Tang, Raul Castro Fernandez, and Michael Stonebraker. 2019. Raha: A configuration-free error detection system. SIGMOD (2019), 865--882. https://doi.org/10.1145/3299869.3324956

Digital Library

[65]

Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2010. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD. 75--86. https://doi.org/10.1145/1807167.1807178

Digital Library

[66]

Supun Nakandala, Yuhao Zhang, and Arun Kumar. 2019. Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems. In DEEM@SIGMOD Workshop. https://doi.org/10.1145/3329486.3329496

Digital Library

[67]

Supun Nakandala, Yuhao Zhang, and Arun Kumar. 2020. Cerebro: A Data System for Optimized Deep Learning Model Selection. PVLDB, Vol. 13, 11 (2020), 2159--2173. https://doi.org/10.14778/3447689.3447691

Digital Library

[68]

Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. IEEE Data Eng. Bull., Vol. 44, 1 (2021), 24--41. http://sites.computer.org/debull/A21mar/p24.pdf

[69]

Felix Neutatz, Binger Chen, Yazan Alkhatib, Jingwen Ye, and Ziawasch Abedjan. 2022. Data Cleaning and AutoML: Would an Optimizer Choose to Clean? Datenbank-Spektrum, Vol. 22, 2 (2022), 121--130. https://doi.org/10.1007/s13222-022-00413--2

[70]

Uchechukwu Njoku, Besim Bilalli, Alberto Abelló, and Gianluca Bontempi. 2023. Wrapper Methods for Multi-Objective Feature Selection. In EDBT. 697--709. https://doi.org/10.48786/edbt.2023.58

[71]

Randal S. Olson and Jason H. Moore. 2019. TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning. In Automated Machine Learning - Methods, Systems, Challenges. 151--160. https://doi.org/10.1007/978--3-030-05318--5_8

[72]

Yongjoo Park, Jingyi Qing, Xiaoyang Shen, and Barzan Mozafari. 2019. BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees. In SIGMOD. 1135--1152. https://doi.org/10.1145/3299869.3300077

Digital Library

[73]

Eliana Pastor, Elena Baralis, and Luca de Alfaro. 2023. A Hierarchical Approach to Anomalous Subgroup Discovery. In ICDE. 2647--2659. https://doi.org/10.1109/ICDE55515.2023.00203

[74]

Dessislava Petrova-Antonova and Rumyana Tancheva. 2020. Data Cleaning: A Case Study with OpenRefine and Trifacta Wrangler. In QUATIC, Vol. 1266. 32--40. https://doi.org/10.1007/978--3-030--58793--2_3

[75]

Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018. Efficient Neural Architecture Search via Parameter Sharing. In ICML. 4092--4101. http://proceedings.mlr.press/v80/pham18a.html

[76]

Arnab Phani, Benjamin Rath, and Matthias Boehm. 2021. LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems. In SIGMOD. ACM, 1426--1439. https://doi.org/10.1145/3448016.3452788

Digital Library

[77]

Clément Pit-Claudel, Zelda E. Mariet, Rachael Harding, and Samuel Madden. 2016. Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion. In Technical Report MIT-CSAIL-TR-2016-002. http://hdl.handle.net/1721.1/101150

[78]

Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data Management Challenges in Production Machine Learning. In SIGMOD. 1723--1726. https://doi.org/10.1145/3035918.3054782

Digital Library

[79]

Abdulhakim A. Qahtan, Ahmed Elmagarmid, Raul Castro Fernandez, Mourad Ouzzani, and Nan Tang. 2018a. FAHES: A Robust Disguised Missing Values Detector. (2018), 2100--2109. https://doi.org/10.1145/3219819.3220109

Digital Library

[80]

Abdulhakim Ali Qahtan, Ahmed K. Elmagarmid, Raul Castro Fernandez, Mourad Ouzzani, and Nan Tang. 2018b. FAHES: A Robust Disguised Missing Values Detector. In SIGKDD. 2100--2109. https://doi.org/10.1145/3219819.3220109

Digital Library

[81]

Kun Qian, Lucian Popa, and Prithviraj Sen. 2019. SystemER: A Human-in-the-loop System for Explainable Entity Resolution. PVLDB, Vol. 12, 12 (2019), 1794--1797. https://doi.org/10.14778/3352063.3352068

Digital Library

[82]

Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter's Wheel: An Interactive Data Cleaning System. In VLDB. 381--390. http://www.vldb.org/conf/2001/P381.pdf

Digital Library

[83]

Theodoros Rekatsinas, Xu Chuy, Ihab F. Ilyasy, and Christopher Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. PVLDB, Vol. 10 (2017), 1190--1201. https://doi.org/10.14778/3137628.3137631

Digital Library

[84]

Cé dric Renggli, Bojan Karlas, Bolin Ding, Feng Liu, Kevin Schawinski, Wentao Wu, and Ce Zhang. 2019. Continuous Integration of Machine Learning Models with ease.ml/ci: Towards a Rigorous Yet Practical Treatment. In MLSys. https://proceedings.mlsys.org/book/266.pdf

[85]

UCI Repository. 2013. EEG Eye State Dataset. https://archive.ics.uci.edu/ml/datasets/EEGEyeState

[86]

Svetlana Sagadeeva and Matthias Boehm. 2021. SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging. In SIGMOD. 2290--2299. https://doi.org/10.1145/3448016.3457323

Digital Library

[87]

Ricardo Salazar, Felix Neutatz, and Ziawasch Abedjan. 2021. Automated Feature Engineering for Algorithmic Fairness. PVLDB, Vol. 14, 9 (2021), 1694--1702. https://doi.org/10.14778/3461535.3463474

Digital Library

[88]

Sebastian Schelter, Felix Bießmann, Dustin Lange, Tammo Rukat, Philipp Schmidt, Stephan Seufert, Pierre Brunelle, and Andrey Taptunov. 2019. Unit Testing Data with Deequ. In SIGMOD. 1993--1996. https://doi.org/10.1145/3299869.3320210

Digital Library

[89]

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Bießmann, and Andreas Grafberger. 2018. Automating Large-Scale Data Quality Verification. PVLDB, Vol. 11, 12 (2018), 1781--1794. https://doi.org/10.14778/3229863.3229867

Digital Library

[90]

Sebastian Schelter, Tammo Rukat, and Felix Bießmann. 2020. Learning to Validate the Predictions of Black Box Classifiers on Unseen Data. In SIGMOD. 1289--1299. https://doi.org/10.1145/3318464.3380604

Digital Library

[91]

Sebastian Schelter, Tammo Rukat, and Felix Biessmann. 2021. JENGA - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models. In EDBT. 529--534. https://doi.org/10.5441/002/edbt.2021.63

[92]

Erich Schubert, Jö rg Sander, Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. 2017. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Database Syst., Vol. 42, 3 (2017), 19:1--19:21. https://doi.org/10.1145/3068335

Digital Library

[93]

Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD. 1171--1188. https://doi.org/10.1145/3299869.3319863

Digital Library

[94]

Alkis Simitsis, Panos Vassiliadis, and Timos K. Sellis. 2005. Optimizing ETL Processes in Data Warehouses. In ICDE. 564--575. https://doi.org/10.1145/2463676.2465247

Digital Library

[95]

Alkis Simitsis, Kevin Wilkinson, Malú Castellanos, and Umeshwar Dayal. 2009. QoX-driven ETL design: reducing the cost of ETL consulting engagements. In SIGMOD. 953--960. https://doi.org/10.1145/1559845.1559954

Digital Library

[96]

Alkis Simitsis, Kevin Wilkinson, and Petar Jovanovic. 2013. xPAD: a platform for analytic data flows. In SIGMOD. 1109--1112.

[97]

Evan R. Sparks, Ameet Talwalkar, Daniel Haas, Michael J. Franklin, Michael I. Jordan, and Tim Kraska. 2015. Automating Model Search for Large Scale Machine Learning. In SoCC. 368--380. https://doi.org/10.1145/2806777.2806945

Digital Library

[98]

Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch Cherniack, Stanley B. Zdonik, Alexander Pagan, and Shan Xu. 2013. Data Curation at Scale: The Data Tamer System. In CIDR. http://cidrdb.org/cidr2013/Papers/CIDR13_Paper28.pdf

[99]

Michael Stonebraker and Ihab F. Ilyas. 2018. Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull., Vol. 41 (2018), 3--9.

[100]

Ki Hyun Tae and Steven Euijong Whang. 2021. Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models. In SIGMOD. 1771--1783. https://doi.org/10.1145/3448016.3452792

Digital Library

[101]

Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In KDD. 847--855. https://doi.org/10.1145/2487575.2487629

Digital Library

[102]

Stef van Buuren and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, Vol. 45, 3 (2011), 1--67. https://www.jstatsoft.org/index.php/jss/article/view/v045i03

[103]

Doris Xin, Stephen Macke, Litian Ma, Jialin Liu, Shuchen Song, and Aditya G. Parameswaran. 2018. Helix: Holistic Optimization for Accelerating Iterative Machine Learning. PVLDB, Vol. 12, 4 (2018), 446--460. https://doi.org/10.14778/3297753.3297763

Digital Library

[104]

Mohamed Yakout, Laure Berti-É quille, and Ahmed K. Elmagarmid. 2013. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD. https://doi.org/10.1145/2463676.2463706

Digital Library

[105]

Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided data repair. PVLDB, Vol. 4, 5 (2011), 279--289. https://doi.org/10.14778/1952376.1952378

Digital Library

[106]

Matei A. Zaharia. 2013. An Architecture for and Fast and General Data Processing on Large Clusters. Ph.,D. Dissertation. University of California, Berkeley, USA.

[107]

Amrapali Zaveri and Anisa Rula. 2019. Data Quality and Data Cleansing of Semantic Data. (2019). https://doi.org/10.1007/978--3--319--63962--8_289--1

[108]

Aoqian Zhang, Shaoxu Song, Jianmin Wang, and Philip S. Yu. 2017a. Time series data cleaning: From anomaly detection to anomaly repairing. PVLDB, Vol. 10, 10 (2017), 1046--1057. https://doi.org/10.14778/3115404.3115410

Digital Library

[109]

Hantian Zhang, Luyuan Zeng, Wentao Wu, and Ce Zhang. 2017b. How Good Are Machine Learning Clouds for Binary Classification with Good Features? CoRR, Vol. abs/1707.09562 (2017). http://arxiv.org/abs/1707.09562

[110]

Barret Zoph and Quoc V. Le. 2017. Neural Architecture Search with Reinforcement Learning. In ICLR. https://openreview.net/forum?id=r1Ue8Hcxg

Index Terms

SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications
1. Computing methodologies
  1. Distributed computing methodologies
2. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning
  2. Information systems applications
    1. Data mining
      1. Data cleaning

Recommendations

Machine Learning and Data Cleaning: Which Serves the Other?
The last few years witnessed significant advances in building automated or semi-automated data quality, data cleaning and data integration systems powered by machine learning (ML). In parallel, large deployment of ML systems in business, science, ...
Statistical Data Cleaning with Applications in R
Data cleaning and machine learning: a systematic literature review
Abstract
Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 3

PACMMOD

September 2023

472 pages

EISSN:2836-6573

DOI:10.1145/3632968

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2023

Published in PACMMOD Volume 1, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
531
Total Downloads

Downloads (Last 12 months)531
Downloads (Last 6 weeks)84

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents