research-article

Open access

DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data

Authors:

Kexin RongAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 2

Article No.: 183, Pages 1 - 26

https://doi.org/10.1145/3589328

Published: 20 June 2023 Publication History

Abstract

Data preprocessing is a crucial step in the machine learning process that transforms raw data into a more usable format for downstream ML models. However, it can be costly and time-consuming, often requiring the expertise of domain experts. Existing automated machine learning (AutoML) frameworks claim to automate data preprocessing. However, they often use a restricted search space of data preprocessing pipelines which limits the potential performance gains, and they are often too slow as they require training the ML model multiple times. In this paper, we propose DiffPrep, a method that can automatically and efficiently search for a data preprocessing pipeline for a given tabular dataset and a differentiable ML model such that the performance of the ML model is maximized. We formalize the problem of data preprocessing pipeline search as a bi-level optimization problem. To solve this problem efficiently, we transform and relax the discrete, non-differential search space into a continuous and differentiable one, which allows us to perform the pipeline search using gradient descent with training the ML model only once. Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated and improves the model's test accuracy by up to 6.6 percentage points.

Supplemental Material

MP4 File

Presentation video for SIGMOD 2023

Download
88.89 MB

References

[1]

Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: a system for large-scale machine learning. In Osdi, Vol. 16. Savannah, GA, USA, 265--283.

Digital Library

[2]

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting data errors: Where are we and what needs to be done? Proceedings of the VLDB Endowment, Vol. 9, 12 (2016), 993--1004.

Digital Library

[3]

Ryan Prescott Adams and Richard S Zemel. 2011. Ranking via sinkhorn propagation. arXiv preprint arXiv:1106.1925 (2011).

[4]

Microsoft Azure. [n.,d.]. Azure AutoML. https://azure.microsoft.com/en-us/services/machine-learning/automatedml/. Accessed: today.

[5]

Laure Berti-Equille. 2019. Learn2clean: Optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference. 2580--2586.

Digital Library

[6]

Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating embeddings of heterogeneous relational datasets for data integration tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1335--1349.

Digital Library

[7]

Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2201--2206.

Digital Library

[8]

Yuanzheng Ci, Chen Lin, Ming Sun, Boyu Chen, Hongwen Zhang, and Wanli Ouyang. 2021. Evolving search space for neural architecture search. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6659--6669.

[9]

Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. 2016. Deep direct reinforcement learning for financial signal representation and trading. IEEE transactions on neural networks and learning systems, Vol. 28, 3 (2016), 653--664.

[10]

Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. 2020. Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning. arXiv:2007.04074 [cs.LG] (2020).

[11]

Matthias Feurer, Aaron Klein, Jost Eggensperger, Katharina Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning. In Advances in Neural Information Processing Systems 28 (2015). 2962--2970.

Digital Library

[12]

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. 2017. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning. PMLR, 1165--1173.

[13]

Salvador Garc'ia, Julián Luengo, and Francisco Herrera. 2015. Data preprocessing in data mining. Springer.

[14]

Joseph Giovanelli, Besim Bilalli, and Alberto Abelló Gamazo. 2021. Effective data pre-processing for AutoML. In Proceedings of the 23rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP): co-located with the 24th International Conference on Extending Database Technology and the 24th International Conference on Database Theory (EDBT/ICDT 2021): Nicosia, Cyprus, March 23, 2021. CEUR-WS. org, 1--10.

[15]

Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter. 2002. Outlier detection using replicator neural networks. In International Conference on Data Warehousing and Knowledge Discovery. Springer, 170--180.

Digital Library

[16]

Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A survey of the state-of-the-art. Knowledge-Based Systems, Vol. 212 (2021), 106622.

[17]

Benjamin Hilprecht, Christian Hammacher, Eduardo Reis, Mohamed Abdelaal, and Carsten Binnig. 2022. DiffML: End-to-end Differentiable ML Pipelines. arXiv preprint arXiv:2207.01269 (2022).

[18]

Bojan Karlavs, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2021. Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions. In PVLDB, Vol. 14.

[19]

Konstantina Kourou, Themis P Exarchos, Konstantinos P Exarchos, Michalis V Karamouzis, and Dimitrios I Fotiadis. 2015. Machine learning applications in cancer prognosis and prediction. Computational and structural biotechnology journal, Vol. 13 (2015), 8--17.

[20]

Sanjay Krishnan, Michael J Franklin, Ken Goldberg, and Eugene Wu. 2017. Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017).

[21]

Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. ActiveClean: interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment, Vol. 9, 12 (2016), 948--959.

Digital Library

[22]

Sanjay Krishnan and Eugene Wu. 2019. Alphaclean: Automatic generation of data cleaning pipelines. arXiv preprint arXiv:1904.11827 (2019).

[23]

Erin LeDell and Sebastien Poirier. 2020. H2o automl: Scalable automatic machine learning. In Proceedings of the AutoML Workshop at ICML, Vol. 2020.

[24]

Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: a study for evaluating the impact of data cleaning on ml classification tasks. In 2021 ICDE. IEEE, 13--24.

[25]

Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).

[26]

Risheng Liu, Jiaxin Gao, Jin Zhang, Deyu Meng, and Zhouchen Lin. 2021. Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

[27]

Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. Data Engineering (2021), 24.

[28]

Felix Neutatz, Binger Chen, Yazan Alkhatib, Jingwen Ye, and Ziawasch Abedjan. 2022. Data Cleaning and AutoML: Would an optimizer choose to clean? Datenbank-Spektrum (2022), 1--10.

[29]

Randal S Olson and Jason H Moore. 2016. TPOT: A tree-based pipeline optimization tool for automating machine learning. In Workshop on automatic machine learning. PMLR, 66--74.

[30]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).

[31]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, Vol. 12 (2011), 2825--2830.

Digital Library

[32]

Leif E Peterson. 2009. K-nearest neighbor. Scholarpedia, Vol. 4, 2 (2009), 1883.

[33]

Alexandre Quemy. 2020. Two-stage optimization for machine learning workflow. Information Systems, Vol. 92 (2020), 101483.

[34]

Jyoti Ramteke, Samarth Shah, Darshan Godhia, and Aadil Shaikh. 2016. Election result prediction using Twitter sentiment analysis. In 2016 international conference on inventive computation technologies (ICICT), Vol. 1. IEEE, 1--5.

[35]

Barna Saha and Divesh Srivastava. 2014. Data quality: The other face of big data. In 2014 ICDE. IEEE, 1294--1297.

[36]

Rodrigo Santa Cruz, Basura Fernando, Anoop Cherian, and Stephen Gould. 2017. Deeppermnet: Visual permutation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3949--3957.

[37]

Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing data science through interactive curation of ml pipelines. In Proceedings of the 2019 International Conference on Management of Data. 1171--1188.

Digital Library

[38]

Richard Sinkhorn. 1964. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, Vol. 35, 2 (1964), 876--879.

[39]

Richard Sinkhorn and Paul Knopp. 1967. Concerning nonnegative matrices and doubly stochastic matrices. Pacific J. Math., Vol. 21, 2 (1967), 343--348.

[40]

Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 847--855.

Digital Library

[41]

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations, Vol. 15, 2 (2013), 49--60. https://doi.org/10.1145/2641190.2641198

Digital Library

[42]

Zixuan Zhao and Raul Castro Fernandez. 2022. Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation. In Proceedings of the 2022 International Conference on Management of Data. 1504--1517.

Digital Library

Cited By

Li HDi SLi CChen LZhou X(2024)Fight Fire with Fire: Towards Robust Graph Neural Networks on Dynamic Graphs via Actively DefenseProceedings of the VLDB Endowment10.14778/3659437.365945717:8(2050-2063)Online publication date: 31-May-2024
https://dl.acm.org/doi/10.14778/3659437.3659457
Gao SLi YShen YShao YChen L(2024)ETC: Efficient Training of Temporal Graph Neural Networks over Large-Scale Dynamic GraphsProceedings of the VLDB Endowment10.14778/3641204.364121517:5(1060-1072)Online publication date: 2-May-2024
https://dl.acm.org/doi/10.14778/3641204.3641215
Gao HCai SDinh THuang ZOoi B(2024)CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine LearningProceedings of the ACM on Management of Data10.1145/36988312:6(1-27)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698831
Show More Cited By

Index Terms

DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning

Recommendations

Data Preprocessing in Supply Chain Management Analytics - A Review of Methods, the Operations They Fulfill, and the Tasks They Accomplish.: Data Preprocessing in Supply Chain Management Analytics.
ICCMB '23: Proceedings of the 2023 6th International Conference on Computers in Management and Business

Data preprocessing is thought of as one of the most important steps in data analytics. This is especially true for the field of Supply Chain Management (SCM), in which the handling of huge data sets is the norm. Data preprocessing consists of multiple ...
Transforming IoT Data Preprocessing: A Holistic, Normalized and Distributed Approach
SenSys '22: Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems

Data preprocessing is an integral part of Artificial Intelligence (AI) pipelines. It transforms raw data into input data that fulfill algorithmic criteria and improve prediction accuracy. As the adoption of Internet of Things (IoT) gains more momentum, ...
A Novel Machine Learning Data Preprocessing Method for Enhancing Classification Algorithms Performance
EANN '15: Proceedings of the 16th International Conference on Engineering Applications of Neural Networks (INNS)

Data preprocessing describes any type of processing methods performed on raw data to prepare it for another processing procedure. Commonly used as a preliminary data mining practice, data preprocessing methods transforms the data into a format that will ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 2

PACMMOD

June 2023

2310 pages

EISSN:2836-6573

DOI:10.1145/3605748

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2023

Published in PACMMOD Volume 1, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
1,224
Total Downloads

Downloads (Last 12 months)743
Downloads (Last 6 weeks)90

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li HDi SLi CChen LZhou X(2024)Fight Fire with Fire: Towards Robust Graph Neural Networks on Dynamic Graphs via Actively DefenseProceedings of the VLDB Endowment10.14778/3659437.365945717:8(2050-2063)Online publication date: 31-May-2024
https://dl.acm.org/doi/10.14778/3659437.3659457
Gao SLi YShen YShao YChen L(2024)ETC: Efficient Training of Temporal Graph Neural Networks over Large-Scale Dynamic GraphsProceedings of the VLDB Endowment10.14778/3641204.364121517:5(1060-1072)Online publication date: 2-May-2024
https://dl.acm.org/doi/10.14778/3641204.3641215
Gao HCai SDinh THuang ZOoi B(2024)CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine LearningProceedings of the ACM on Management of Data10.1145/36988312:6(1-27)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698831
Gao SLi YZhang XShen YShao YChen L(2024)SIMPLE: Efficient Temporal Graph Neural Network Training at Scale with Dynamic Data PlacementProceedings of the ACM on Management of Data10.1145/36549772:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654977
Wei ZTrummer I(2024)ROME: Robust Query Optimization via Parallel Multi-Plan ExecutionProceedings of the ACM on Management of Data10.1145/36549732:3(1-25)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654973
Li YYang YCao JLiu STang HXu GBaeza-Yates RBonchi F(2024)Toward Structure Fairness in Dynamic Graph Embedding: A Trend-aware Dual Debiasing ApproachProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671848(1701-1712)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3671848
Wang WLi YLi AZhang JMa WLiu YRoychoudhury APaiva AAbreu RStorey M(2024)An Empirical Study on Noisy Label Learning for Program UnderstandingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639217(1-12)Online publication date: 20-May-2024
https://dl.acm.org/doi/10.1145/3597503.3639217
Abdelaal MYayak AKlede KSchöning H(2024)ReClean: Reinforcement Learning for Automated Data Cleaning in ML Pipelines*2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00048(324-330)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDEW61823.2024.00048
Liu ZDong FLiu CDeng XYang TZhao YLi JCui BZhang G(2024)WavingSketch: an unbiased and generic sketch for finding top-k items in data streamsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00869-633:5(1697-1722)Online publication date: 29-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-024-00869-6
Li HWang LChen QJi JWu YZhao YYang TAkella A(2023)ChainedFilter: Combining Membership Filters by Chain RuleProceedings of the ACM on Management of Data10.1145/36267211:4(1-27)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626721
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents