Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications

Published: 13 November 2023 Publication History

Abstract

In the exploratory data science lifecycle, data scientists often spent the majority of their time finding, integrating, validating and cleaning relevant datasets. Despite recent work on data validation, and numerous error detection and correction algorithms, in practice, data cleaning for ML remains largely a manual, unpleasant, and labor-intensive trial and error process, especially in large-scale, distributed computation. The target ML application---such as classification or regression models---can be used as a signal of valuable feedback though, for selecting effective data cleaning strategies. In this paper, we introduce SAGA, a framework for automatically generating the top-K most effective data cleaning pipelines. SAGA adopts ideas from Auto-ML, feature selection, and hyper-parameter tuning. Our framework is extensible for user-provided constraints, new data cleaning primitives, and ML applications; automatically generates hybrid runtime plans of local and distributed operations; and performs pruning by interesting properties (e.g., monotonicity). Instead of full automation---which is rather unrealistic---SAGA simplifies the mechanical aspects of data cleaning. Our experiments show that SAGA yields robust accuracy improvements over state-of-the-art, and good scalability regarding increasing data sizes and number of evaluated pipelines.

References

[1]
Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data Profiling. In Synthesis Lectures on Data Management. http://sites.computer.org/debull/A18june/p3.pdf
[2]
Giorgos Alexiou, George Papastefanatos, Vassilis Stamatopoulos, Georgia Koutrika, and Nectarios Koziris. 2022. QueryER: A Framework for Fast Analysis-Aware Deduplication over Dirty Data. CoRR, Vol. abs/2202.01546 (2022). showeprint[arXiv]2202.01546 https://arxiv.org/abs/2202.01546
[3]
ASQ/ANSI/ISO. 2015. 9001:2015: Quality management systems - Requirements.
[4]
Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In KDD. 1387--1395. https://doi.org/10.1145/3097983.3098021
[5]
James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-Parameter Optimization. JMLR, Vol. 13 (2012), 281--305. https://doi.org/10.5555/2503308.2188395
[6]
Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic Schema Matching, Ten Years Later. PVLDB, Vol. 4, 11 (2011), 695--701. http://www.vldb.org/pvldb/vol4/p695-bernstein_madhavan_rahm.pdf
[7]
Philip A. Bernstein and Sergey Melnik. 2007. Model management 2.0: manipulating richer mappings. In SIGMOD. https://doi.org/10.1145/1247480.1247482
[8]
Laure Berti-É quille. 2019. Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation. In The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13--17, 2019. ACM, 2580--2586. https://doi.org/10.1145/3308558.3313602
[9]
Felix Bießmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, and David Salinas. 2019. DataWig: Missing Value Imputation for Tables. JMLR, Vol. 20 (2019). http://jmlr.org/papers/v20/18--753.html
[10]
Carsten Binnig, Benedetto Buratti, Yeounoh Chung, Cyrus Cousins, Tim Kraska, Zeyuan Shang, Eli Upfal, Robert C. Zeleznik, and Emanuel Zgraggen. 2018. Towards Interactive Curation & Automatic Tuning of ML Pipelines. In DEEM. 1:1--1:4. https://doi.org/10.1145/3209889.3209891
[11]
Matthias Boehm, Iulian Antonov, Sebastian Baunsgaard, Mark Dokter, Robert Ginthö r, Kevin Innerebner, Florijan Klezin, Stefanie N. Lindstaedt, Arnab Phani, Benjamin Rath, Berthold Reinwald, Shafaq Siddiqui, and Sebastian Benjamin Wrede. 2020. SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle. In CIDR. http://cidrdb.org/cidr2020/papers/p22-boehm-cidr20.pdf
[12]
Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, Prithviraj Sen, Yuanyuan Tian, Douglas Burdick, and Shivakumar Vaithyanathan. 2014. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB, Vol. 7, 7 (2014), 553--564. https://doi.org/10.14778/2732286.2732292
[13]
Christoph Bö hm, Gerard de Melo, Felix Naumann, and Gerhard Weikum. 2012. LINDA: distributed web-of-data-scale entity matching. In CIKM. 2104--2108. https://doi.org/10.1145/2396761.2398582
[14]
Matthias Bö hm, Uwe Wloka, Dirk Habich, and Wolfgang Lehner. 2009. GCIP: exploiting the generation and optimization of integration processes. In EDBT, Vol. 360. 1128--1131. https://doi.org/10.1145/1516360.1516494
[15]
Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In MLSys. https://proceedings.mlsys.org/book/267.pdf
[16]
Eric Brochu, Vlad M. Cora, and Nando de Freitas. 2010. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. CoRR (2010). http://arxiv.org/abs/1012.2599
[17]
Sé bastien Bubeck and Nicolò Cesa-Bianchi. 2012. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Found. Trends Mach. Learn., Vol. 5, 1 (2012), 1--122. https://doi.org/10.1561/2200000024
[18]
Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. 2019. Expressive power of entity-linking frameworks. J. Comput. Syst. Sci., Vol. 100 (2019), 44--69. https://doi.org/10.1016/j.jcss.2018.09.001
[19]
José Cambronero, John K. Feser, Micah J. Smith, and Samuel Madden. 2017. Query optimization for dynamic imputation. PVLDB, Vol. 10 (2017), 1310--1321. https://doi.org/10.14778/3137628.3137641
[20]
Emily Caveness, Paul Suganthan G. C., Zhuo Peng, Neoklis Polyzotis, Sudip Roy, and Martin Zinkevich. 2020. TensorFlow Data Validation: Data Analysis and Validation in Continuous ML Pipelines. In SIGMOD. 2793--2796. https://doi.org/10.1145/3318464.3384707
[21]
Austin Animal Center. 2022. Shelter Animal Outcomes competition dataset from Kaggle. https://www.kaggle.com/competitions/shelter-animal-outcomes/data
[22]
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res., Vol. 16 (2002), 321--357. https://doi.org/10.1613/jair.953
[23]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In SIGKDD. 785--794. https://doi.org/10.1145/2939672.2939785
[24]
Xu Chu, Ihab F. Ilyas, and Paraschos Koutris. 2016. Distributed Data Deduplication. PVLDB, Vol. 9, 11 (2016), 864--875. https://doi.org/10.14778/2983200.2983203
[25]
Xu Chu, Mourad Ouzzani, John Morcos, Ihab F. Ilyas, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing. PVLDB, Vol. 8, 12 (2015). https://doi.org/10.14778/2824032.2824109
[26]
Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. 2020. Automated Data Slicing for Model Validation: A Big Data - AI Integration Approach. IEEE Trans. Knowl. Data Eng., Vol. 32, 12 (2020), 2284--2296. https://doi.org/10.1109/TKDE.2019.2916074
[27]
Ekin D. Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V. Le. 2019. AutoAugment: Learning Augmentation Strategies From Data. In CVPR. 113--123. https://doi.org/10.1109/CVPR.2019.00020
[28]
Can Cui, Wei Wang, Meihui Zhang, Gang Chen, Zhaojing Luo, and Beng Chin Ooi. 2021. AlphaEvolve: A Learning Framework to Discover Novel Alphas in Quantitative Investment. In SIGMOD. 2208--2216. https://doi.org/10.1145/3448016.3457324
[29]
Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: a commodity data cleaning system. In SIGMOD. 541--552. https://doi.org/10.1145/2463676.2465327
[30]
Data.Nashville.gov. 2020. Nashville Traffic Accidents Dataset. https://data.nashville.gov/Police/Traffic-Accidents/6v6w-hpcw
[31]
Delve Datasets. 2022. Puma Dataset. https://www.cs.toronto.edu/ delve/data/datasets.html
[32]
data.world. 2016. OLS Regression Challenge - Cancer. https://data.world/nrippner/ols-regression-challenge
[33]
Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. (2017). http://cidrdb.org/cidr2017/papers/p44-deng-cidr17.pdf
[34]
Mike Dreves, Gene Huang, Zhuo Peng, Neoklis Polyzotis, Evan Rosen, and Paul Suganthan G. C. 2020. From Data to Models and Back. In DEEM@SIGMOD Workshop. https://doi.org/10.1145/3399579.3399868
[35]
Stefan Falkner, Aaron Klein, and Frank Hutter. 2018. BOHB: Robust and Efficient Hyperparameter Optimization at Scale. In ICML. 1436--1445. http://proceedings.mlr.press/v80/falkner18a.html
[36]
Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. 2020. Auto-Sklearn 2.0: The Next Generation. CoRR (2020). https://arxiv.org/abs/2007.04074
[37]
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2019. Auto-sklearn: Efficient and Robust Automated Machine Learning. In Automated Machine Learning - Methods, Systems, Challenges. 113--134. https://doi.org/10.1007/978--3-030-05318--5
[38]
Nicoló Fusi, Rishit Sheth, and Melih Elibol. 2018. Probabilistic Matrix Factorization for Automated Machine Learning. In NeurIPS. 3352--3361. https://proceedings.neurips.cc/paper/2018/file/b59a51a3c0bf9c5228fde841714f523a-Paper.pdf
[39]
Sainyam Galhotra, Donatella Firmani, Barna Saha, and Divesh Srivastava. 2021. BEER: Blocking for Effective Entity Resolution. In SIGMOD. 2711--2715. https://doi.org/10.1145/3448016.3452747
[40]
Gá bor E. Gé vay, Jorge-Arnulfo Quiané -Ruiz, and Volker Markl. 2021. The Power of Nested Parallelism in Big Data Processing - Hitting Three Flies with One Slap. In SIGMOD. 605--618. https://doi.org/10.1145/3448016.3457287
[41]
Joachim Hammer, Michael Stonebraker, and Oguzhan Topsakal. 2005. THALIA: Test Harness for the Assessment of Legacy Information Integration Approaches. In ICDE. 485--486. https://doi.org/10.1109/ICDE.2005.140
[42]
Chicago health services. 2022. Chicago Food Inspection Dataset. https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5/data
[43]
Yuval Heffetz, Roman Vainshtein, Gilad Katz, and Lior Rokach. 2020. DeepLine: AutoML Tool for Pipelines Generation using Deep Reinforcement Learning and Hierarchical Actions Filtering. In KDD. 2103--2113. https://doi.org/10.1145/3394486.3403261
[44]
Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-shot learning for error detection. SIGMOD (2019), 829--846. https://doi.org/10.1145/3299869.3319888
[45]
Christoph Hube, Besnik Fetahu, and Ujwal Gadiraju. 2019. Understanding and Mitigating Worker Biases in the Crowdsourced Collection of Subjective Judgments. In CHI. ACM, 407. https://doi.org/10.1145/3290605.3300637
[46]
Kevin G. Jamieson and Ameet Talwalkar. 2016. Non-stochastic Best Arm Identification and Hyperparameter Optimization. In AISTATS (JMLR Workshop and Conference Proceedings, Vol. 51). 240--248. http://proceedings.mlr.press/v51/jamieson16.html
[47]
Kaggle. 2022a. House Prices - Advanced Regression Techniques. https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data
[48]
Kaggle. 2022b. Titanic Dataset. https://www.kaggle.com/competitions/titanic/data
[49]
Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabá s Pó czos, and Eric P. Xing. 2018. Neural Architecture Search with Bayesian Optimisation and Optimal Transport. In NeurIPS. 2020--2029. https://proceedings.neurips.cc/paper/2018/hash/f33ba15effa5c10e873bf3842afb46a6-Abstract.html
[50]
Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2011. Wrangler: interactive visual specification of data transformation scripts. In CHI. 3363--3372. https://doi.org/10.1145/1978942.1979444
[51]
Daniel Kang, Deepti Raghavan, Peter Bailis, and Matei Zaharia. 2020. Model Assertions for Monitoring and Improving ML Models. In MLSys. https://proceedings.mlsys.org/book/319.pdf
[52]
Bojan Karlas, Matteo Interlandi, Cé dric Renggli, Wentao Wu, Ce Zhang, Deepak Mukunthu Iyappan Babu, Jordan Edwards, Chris Lauren, Andy Xu, and Markus Weimer. 2020a. Building Continuous Integration Services for Machine Learning. In KDD. 2407--2415. https://doi.org/10.1145/3394486.3403290
[53]
Bojan Karlas, Peng Li, Renzhi Wu, Nezihe Merve Gü rel, Xu Chu, Wentao Wu, and Ce Zhang. 2020b. Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions. PVLDB, Vol. 14, 3 (2020), 255--267. https://doi.org/10.5555/3430915.3442426
[54]
Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané -Ruiz, Nan Tang, and Si Yin. 2015. BigDansing: A System for Big Data Cleansing. In SIGMOD. 1215--1230. https://doi.org/10.1145/2723372.2747646
[55]
Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeffrey F. Naughton, Shishir Prasad, Ganesh Krishnan, Rohit Deep, and Vijay Raghavendra. 2016. Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks. PVLDB, Vol. 9, 13 (2016), 1581--1584. https://doi.org/10.14778/3007263.3007314
[56]
Lars Kotthoff, Chris Thornton, Holger H. Hoos, Frank Hutter, and Kevin Leyton-Brown. 2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. J. Mach. Learn. Res., Vol. 18 (2017), 25:1--25:5. http://jmlr.org/papers/v18/16--261.html
[57]
Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning. CoRR, Vol. abs/1711.01299 (2017). http://arxiv.org/abs/1711.01299
[58]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. Activeclean: Interactive data cleaning for statistical modeling. PVLDB, Vol. 9, 12 (2016), 948--959. https://doi.org/10.14778/2994509.2994514
[59]
Sanjay Krishnan and Eugene Wu. 2019. AlphaClean: Automatic Generation of Data Cleaning Pipelines. (2019). http://arxiv.org/abs/1904.11827
[60]
Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2017. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. J. Mach. Learn. Res., Vol. 18 (2017), 185:1--185:52. http://jmlr.org/papers/v18/16--558.html
[61]
Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In ICDE. 13--24. https://doi.org/10.1109/ICDE51399.2021.00009
[62]
Tian Li, Jie Zhong, Ji Liu, Wentao Wu, and Ce Zhang. 2018. Ease.Ml: Towards Multi-Tenant Resource Sharing for Machine Learning Workloads. PVLDB, Vol. 11, 5 (2018), 607--620. https://doi.org/10.1145/3187009.3177737
[63]
Mohammad Mahdavi and Ziawasch Abedjan. 2020. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. PVLDB, Vol. 13, 11 (2020), 1948--1961. https://doi.org/10.14778/3407790.3407801
[64]
Mohammad Mahdavi, Samuel Madden, Ziawasch Abedjan, Mourad Ouzzani, Nan Tang, Raul Castro Fernandez, and Michael Stonebraker. 2019. Raha: A configuration-free error detection system. SIGMOD (2019), 865--882. https://doi.org/10.1145/3299869.3324956
[65]
Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. 2010. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD. 75--86. https://doi.org/10.1145/1807167.1807178
[66]
Supun Nakandala, Yuhao Zhang, and Arun Kumar. 2019. Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems. In DEEM@SIGMOD Workshop. https://doi.org/10.1145/3329486.3329496
[67]
Supun Nakandala, Yuhao Zhang, and Arun Kumar. 2020. Cerebro: A Data System for Optimized Deep Learning Model Selection. PVLDB, Vol. 13, 11 (2020), 2159--2173. https://doi.org/10.14778/3447689.3447691
[68]
Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. IEEE Data Eng. Bull., Vol. 44, 1 (2021), 24--41. http://sites.computer.org/debull/A21mar/p24.pdf
[69]
Felix Neutatz, Binger Chen, Yazan Alkhatib, Jingwen Ye, and Ziawasch Abedjan. 2022. Data Cleaning and AutoML: Would an Optimizer Choose to Clean? Datenbank-Spektrum, Vol. 22, 2 (2022), 121--130. https://doi.org/10.1007/s13222-022-00413--2
[70]
Uchechukwu Njoku, Besim Bilalli, Alberto Abelló, and Gianluca Bontempi. 2023. Wrapper Methods for Multi-Objective Feature Selection. In EDBT. 697--709. https://doi.org/10.48786/edbt.2023.58
[71]
Randal S. Olson and Jason H. Moore. 2019. TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning. In Automated Machine Learning - Methods, Systems, Challenges. 151--160. https://doi.org/10.1007/978--3-030-05318--5_8
[72]
Yongjoo Park, Jingyi Qing, Xiaoyang Shen, and Barzan Mozafari. 2019. BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees. In SIGMOD. 1135--1152. https://doi.org/10.1145/3299869.3300077
[73]
Eliana Pastor, Elena Baralis, and Luca de Alfaro. 2023. A Hierarchical Approach to Anomalous Subgroup Discovery. In ICDE. 2647--2659. https://doi.org/10.1109/ICDE55515.2023.00203
[74]
Dessislava Petrova-Antonova and Rumyana Tancheva. 2020. Data Cleaning: A Case Study with OpenRefine and Trifacta Wrangler. In QUATIC, Vol. 1266. 32--40. https://doi.org/10.1007/978--3-030--58793--2_3
[75]
Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018. Efficient Neural Architecture Search via Parameter Sharing. In ICML. 4092--4101. http://proceedings.mlr.press/v80/pham18a.html
[76]
Arnab Phani, Benjamin Rath, and Matthias Boehm. 2021. LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems. In SIGMOD. ACM, 1426--1439. https://doi.org/10.1145/3448016.3452788
[77]
Clément Pit-Claudel, Zelda E. Mariet, Rachael Harding, and Samuel Madden. 2016. Outlier Detection in Heterogeneous Datasets using Automatic Tuple Expansion. In Technical Report MIT-CSAIL-TR-2016-002. http://hdl.handle.net/1721.1/101150
[78]
Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data Management Challenges in Production Machine Learning. In SIGMOD. 1723--1726. https://doi.org/10.1145/3035918.3054782
[79]
Abdulhakim A. Qahtan, Ahmed Elmagarmid, Raul Castro Fernandez, Mourad Ouzzani, and Nan Tang. 2018a. FAHES: A Robust Disguised Missing Values Detector. (2018), 2100--2109. https://doi.org/10.1145/3219819.3220109
[80]
Abdulhakim Ali Qahtan, Ahmed K. Elmagarmid, Raul Castro Fernandez, Mourad Ouzzani, and Nan Tang. 2018b. FAHES: A Robust Disguised Missing Values Detector. In SIGKDD. 2100--2109. https://doi.org/10.1145/3219819.3220109
[81]
Kun Qian, Lucian Popa, and Prithviraj Sen. 2019. SystemER: A Human-in-the-loop System for Explainable Entity Resolution. PVLDB, Vol. 12, 12 (2019), 1794--1797. https://doi.org/10.14778/3352063.3352068
[82]
Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter's Wheel: An Interactive Data Cleaning System. In VLDB. 381--390. http://www.vldb.org/conf/2001/P381.pdf
[83]
Theodoros Rekatsinas, Xu Chuy, Ihab F. Ilyasy, and Christopher Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. PVLDB, Vol. 10 (2017), 1190--1201. https://doi.org/10.14778/3137628.3137631
[84]
Cé dric Renggli, Bojan Karlas, Bolin Ding, Feng Liu, Kevin Schawinski, Wentao Wu, and Ce Zhang. 2019. Continuous Integration of Machine Learning Models with ease.ml/ci: Towards a Rigorous Yet Practical Treatment. In MLSys. https://proceedings.mlsys.org/book/266.pdf
[85]
UCI Repository. 2013. EEG Eye State Dataset. https://archive.ics.uci.edu/ml/datasets/EEGEyeState
[86]
Svetlana Sagadeeva and Matthias Boehm. 2021. SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging. In SIGMOD. 2290--2299. https://doi.org/10.1145/3448016.3457323
[87]
Ricardo Salazar, Felix Neutatz, and Ziawasch Abedjan. 2021. Automated Feature Engineering for Algorithmic Fairness. PVLDB, Vol. 14, 9 (2021), 1694--1702. https://doi.org/10.14778/3461535.3463474
[88]
Sebastian Schelter, Felix Bießmann, Dustin Lange, Tammo Rukat, Philipp Schmidt, Stephan Seufert, Pierre Brunelle, and Andrey Taptunov. 2019. Unit Testing Data with Deequ. In SIGMOD. 1993--1996. https://doi.org/10.1145/3299869.3320210
[89]
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Bießmann, and Andreas Grafberger. 2018. Automating Large-Scale Data Quality Verification. PVLDB, Vol. 11, 12 (2018), 1781--1794. https://doi.org/10.14778/3229863.3229867
[90]
Sebastian Schelter, Tammo Rukat, and Felix Bießmann. 2020. Learning to Validate the Predictions of Black Box Classifiers on Unseen Data. In SIGMOD. 1289--1299. https://doi.org/10.1145/3318464.3380604
[91]
Sebastian Schelter, Tammo Rukat, and Felix Biessmann. 2021. JENGA - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models. In EDBT. 529--534. https://doi.org/10.5441/002/edbt.2021.63
[92]
Erich Schubert, Jö rg Sander, Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. 2017. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Database Syst., Vol. 42, 3 (2017), 19:1--19:21. https://doi.org/10.1145/3068335
[93]
Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD. 1171--1188. https://doi.org/10.1145/3299869.3319863
[94]
Alkis Simitsis, Panos Vassiliadis, and Timos K. Sellis. 2005. Optimizing ETL Processes in Data Warehouses. In ICDE. 564--575. https://doi.org/10.1145/2463676.2465247
[95]
Alkis Simitsis, Kevin Wilkinson, Malú Castellanos, and Umeshwar Dayal. 2009. QoX-driven ETL design: reducing the cost of ETL consulting engagements. In SIGMOD. 953--960. https://doi.org/10.1145/1559845.1559954
[96]
Alkis Simitsis, Kevin Wilkinson, and Petar Jovanovic. 2013. xPAD: a platform for analytic data flows. In SIGMOD. 1109--1112.
[97]
Evan R. Sparks, Ameet Talwalkar, Daniel Haas, Michael J. Franklin, Michael I. Jordan, and Tim Kraska. 2015. Automating Model Search for Large Scale Machine Learning. In SoCC. 368--380. https://doi.org/10.1145/2806777.2806945
[98]
Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch Cherniack, Stanley B. Zdonik, Alexander Pagan, and Shan Xu. 2013. Data Curation at Scale: The Data Tamer System. In CIDR. http://cidrdb.org/cidr2013/Papers/CIDR13_Paper28.pdf
[99]
Michael Stonebraker and Ihab F. Ilyas. 2018. Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull., Vol. 41 (2018), 3--9.
[100]
Ki Hyun Tae and Steven Euijong Whang. 2021. Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models. In SIGMOD. 1771--1783. https://doi.org/10.1145/3448016.3452792
[101]
Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In KDD. 847--855. https://doi.org/10.1145/2487575.2487629
[102]
Stef van Buuren and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, Vol. 45, 3 (2011), 1--67. https://www.jstatsoft.org/index.php/jss/article/view/v045i03
[103]
Doris Xin, Stephen Macke, Litian Ma, Jialin Liu, Shuchen Song, and Aditya G. Parameswaran. 2018. Helix: Holistic Optimization for Accelerating Iterative Machine Learning. PVLDB, Vol. 12, 4 (2018), 446--460. https://doi.org/10.14778/3297753.3297763
[104]
Mohamed Yakout, Laure Berti-É quille, and Ahmed K. Elmagarmid. 2013. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD. https://doi.org/10.1145/2463676.2463706
[105]
Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F. Ilyas. 2011. Guided data repair. PVLDB, Vol. 4, 5 (2011), 279--289. https://doi.org/10.14778/1952376.1952378
[106]
Matei A. Zaharia. 2013. An Architecture for and Fast and General Data Processing on Large Clusters. Ph.,D. Dissertation. University of California, Berkeley, USA.
[107]
Amrapali Zaveri and Anisa Rula. 2019. Data Quality and Data Cleansing of Semantic Data. (2019). https://doi.org/10.1007/978--3--319--63962--8_289--1
[108]
Aoqian Zhang, Shaoxu Song, Jianmin Wang, and Philip S. Yu. 2017a. Time series data cleaning: From anomaly detection to anomaly repairing. PVLDB, Vol. 10, 10 (2017), 1046--1057. https://doi.org/10.14778/3115404.3115410
[109]
Hantian Zhang, Luyuan Zeng, Wentao Wu, and Ce Zhang. 2017b. How Good Are Machine Learning Clouds for Binary Classification with Good Features? CoRR, Vol. abs/1707.09562 (2017). http://arxiv.org/abs/1707.09562
[110]
Barret Zoph and Quoc V. Le. 2017. Neural Architecture Search with Reinforcement Learning. In ICLR. https://openreview.net/forum?id=r1Ue8Hcxg

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 3
PACMMOD
September 2023
472 pages
EISSN:2836-6573
DOI:10.1145/3632968
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2023
Published in PACMMOD Volume 1, Issue 3

Permissions

Request permissions for this article.

Author Tags

  1. data cleaning for ML
  2. data cleaning pipelines
  3. data- and task-parallel execution
  4. evolutionary algorithms
  5. hyper-parameter tuning
  6. linear-algebra-based primitives

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 531
    Total Downloads
  • Downloads (Last 12 months)531
  • Downloads (Last 6 weeks)84
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media