Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data

Published: 20 June 2023 Publication History

Abstract

Data preprocessing is a crucial step in the machine learning process that transforms raw data into a more usable format for downstream ML models. However, it can be costly and time-consuming, often requiring the expertise of domain experts. Existing automated machine learning (AutoML) frameworks claim to automate data preprocessing. However, they often use a restricted search space of data preprocessing pipelines which limits the potential performance gains, and they are often too slow as they require training the ML model multiple times. In this paper, we propose DiffPrep, a method that can automatically and efficiently search for a data preprocessing pipeline for a given tabular dataset and a differentiable ML model such that the performance of the ML model is maximized. We formalize the problem of data preprocessing pipeline search as a bi-level optimization problem. To solve this problem efficiently, we transform and relax the discrete, non-differential search space into a continuous and differentiable one, which allows us to perform the pipeline search using gradient descent with training the ML model only once. Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated and improves the model's test accuracy by up to 6.6 percentage points.

Supplemental Material

MP4 File
Presentation video for SIGMOD 2023

References

[1]
Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: a system for large-scale machine learning. In Osdi, Vol. 16. Savannah, GA, USA, 265--283.
[2]
Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang. 2016. Detecting data errors: Where are we and what needs to be done? Proceedings of the VLDB Endowment, Vol. 9, 12 (2016), 993--1004.
[3]
Ryan Prescott Adams and Richard S Zemel. 2011. Ranking via sinkhorn propagation. arXiv preprint arXiv:1106.1925 (2011).
[4]
Microsoft Azure. [n.,d.]. Azure AutoML. https://azure.microsoft.com/en-us/services/machine-learning/automatedml/. Accessed: today.
[5]
Laure Berti-Equille. 2019. Learn2clean: Optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference. 2580--2586.
[6]
Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating embeddings of heterogeneous relational datasets for data integration tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1335--1349.
[7]
Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2201--2206.
[8]
Yuanzheng Ci, Chen Lin, Ming Sun, Boyu Chen, Hongwen Zhang, and Wanli Ouyang. 2021. Evolving search space for neural architecture search. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6659--6669.
[9]
Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. 2016. Deep direct reinforcement learning for financial signal representation and trading. IEEE transactions on neural networks and learning systems, Vol. 28, 3 (2016), 653--664.
[10]
Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. 2020. Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning. arXiv:2007.04074 [cs.LG] (2020).
[11]
Matthias Feurer, Aaron Klein, Jost Eggensperger, Katharina Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning. In Advances in Neural Information Processing Systems 28 (2015). 2962--2970.
[12]
Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. 2017. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning. PMLR, 1165--1173.
[13]
Salvador Garc'ia, Julián Luengo, and Francisco Herrera. 2015. Data preprocessing in data mining. Springer.
[14]
Joseph Giovanelli, Besim Bilalli, and Alberto Abelló Gamazo. 2021. Effective data pre-processing for AutoML. In Proceedings of the 23rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP): co-located with the 24th International Conference on Extending Database Technology and the 24th International Conference on Database Theory (EDBT/ICDT 2021): Nicosia, Cyprus, March 23, 2021. CEUR-WS. org, 1--10.
[15]
Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter. 2002. Outlier detection using replicator neural networks. In International Conference on Data Warehousing and Knowledge Discovery. Springer, 170--180.
[16]
Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A survey of the state-of-the-art. Knowledge-Based Systems, Vol. 212 (2021), 106622.
[17]
Benjamin Hilprecht, Christian Hammacher, Eduardo Reis, Mohamed Abdelaal, and Carsten Binnig. 2022. DiffML: End-to-end Differentiable ML Pipelines. arXiv preprint arXiv:2207.01269 (2022).
[18]
Bojan Karlavs, Peng Li, Renzhi Wu, Nezihe Merve Gürel, Xu Chu, Wentao Wu, and Ce Zhang. 2021. Nearest neighbor classifiers over incomplete information: From certain answers to certain predictions. In PVLDB, Vol. 14.
[19]
Konstantina Kourou, Themis P Exarchos, Konstantinos P Exarchos, Michalis V Karamouzis, and Dimitrios I Fotiadis. 2015. Machine learning applications in cancer prognosis and prediction. Computational and structural biotechnology journal, Vol. 13 (2015), 8--17.
[20]
Sanjay Krishnan, Michael J Franklin, Ken Goldberg, and Eugene Wu. 2017. Boostclean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017).
[21]
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. ActiveClean: interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment, Vol. 9, 12 (2016), 948--959.
[22]
Sanjay Krishnan and Eugene Wu. 2019. Alphaclean: Automatic generation of data cleaning pipelines. arXiv preprint arXiv:1904.11827 (2019).
[23]
Erin LeDell and Sebastien Poirier. 2020. H2o automl: Scalable automatic machine learning. In Proceedings of the AutoML Workshop at ICML, Vol. 2020.
[24]
Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: a study for evaluating the impact of data cleaning on ml classification tasks. In 2021 ICDE. IEEE, 13--24.
[25]
Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).
[26]
Risheng Liu, Jiaxin Gao, Jin Zhang, Deyu Meng, and Zhouchen Lin. 2021. Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[27]
Felix Neutatz, Binger Chen, Ziawasch Abedjan, and Eugene Wu. 2021. From Cleaning before ML to Cleaning for ML. Data Engineering (2021), 24.
[28]
Felix Neutatz, Binger Chen, Yazan Alkhatib, Jingwen Ye, and Ziawasch Abedjan. 2022. Data Cleaning and AutoML: Would an optimizer choose to clean? Datenbank-Spektrum (2022), 1--10.
[29]
Randal S Olson and Jason H Moore. 2016. TPOT: A tree-based pipeline optimization tool for automating machine learning. In Workshop on automatic machine learning. PMLR, 66--74.
[30]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
[31]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, Vol. 12 (2011), 2825--2830.
[32]
Leif E Peterson. 2009. K-nearest neighbor. Scholarpedia, Vol. 4, 2 (2009), 1883.
[33]
Alexandre Quemy. 2020. Two-stage optimization for machine learning workflow. Information Systems, Vol. 92 (2020), 101483.
[34]
Jyoti Ramteke, Samarth Shah, Darshan Godhia, and Aadil Shaikh. 2016. Election result prediction using Twitter sentiment analysis. In 2016 international conference on inventive computation technologies (ICICT), Vol. 1. IEEE, 1--5.
[35]
Barna Saha and Divesh Srivastava. 2014. Data quality: The other face of big data. In 2014 ICDE. IEEE, 1294--1297.
[36]
Rodrigo Santa Cruz, Basura Fernando, Anoop Cherian, and Stephen Gould. 2017. Deeppermnet: Visual permutation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3949--3957.
[37]
Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing data science through interactive curation of ml pipelines. In Proceedings of the 2019 International Conference on Management of Data. 1171--1188.
[38]
Richard Sinkhorn. 1964. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, Vol. 35, 2 (1964), 876--879.
[39]
Richard Sinkhorn and Paul Knopp. 1967. Concerning nonnegative matrices and doubly stochastic matrices. Pacific J. Math., Vol. 21, 2 (1967), 343--348.
[40]
Chris Thornton, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 847--855.
[41]
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. 2013. OpenML: Networked Science in Machine Learning. SIGKDD Explorations, Vol. 15, 2 (2013), 49--60. https://doi.org/10.1145/2641190.2641198
[42]
Zixuan Zhao and Raul Castro Fernandez. 2022. Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation. In Proceedings of the 2022 International Conference on Management of Data. 1504--1517.

Cited By

View all
  • (2024)Fight Fire with Fire: Towards Robust Graph Neural Networks on Dynamic Graphs via Actively DefenseProceedings of the VLDB Endowment10.14778/3659437.365945717:8(2050-2063)Online publication date: 31-May-2024
  • (2024)ETC: Efficient Training of Temporal Graph Neural Networks over Large-Scale Dynamic GraphsProceedings of the VLDB Endowment10.14778/3641204.364121517:5(1060-1072)Online publication date: 2-May-2024
  • (2024)CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine LearningProceedings of the ACM on Management of Data10.1145/36988312:6(1-27)Online publication date: 20-Dec-2024
  • Show More Cited By

Index Terms

  1. DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 1, Issue 2
      PACMMOD
      June 2023
      2310 pages
      EISSN:2836-6573
      DOI:10.1145/3605748
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 June 2023
      Published in PACMMOD Volume 1, Issue 2

      Permissions

      Request permissions for this article.

      Author Tags

      1. automated machine learning
      2. data cleaning
      3. data preprocessing

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)743
      • Downloads (Last 6 weeks)90
      Reflects downloads up to 25 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Fight Fire with Fire: Towards Robust Graph Neural Networks on Dynamic Graphs via Actively DefenseProceedings of the VLDB Endowment10.14778/3659437.365945717:8(2050-2063)Online publication date: 31-May-2024
      • (2024)ETC: Efficient Training of Temporal Graph Neural Networks over Large-Scale Dynamic GraphsProceedings of the VLDB Endowment10.14778/3641204.364121517:5(1060-1072)Online publication date: 2-May-2024
      • (2024)CtxPipe: Context-aware Data Preparation Pipeline Construction for Machine LearningProceedings of the ACM on Management of Data10.1145/36988312:6(1-27)Online publication date: 20-Dec-2024
      • (2024)SIMPLE: Efficient Temporal Graph Neural Network Training at Scale with Dynamic Data PlacementProceedings of the ACM on Management of Data10.1145/36549772:3(1-25)Online publication date: 30-May-2024
      • (2024)ROME: Robust Query Optimization via Parallel Multi-Plan ExecutionProceedings of the ACM on Management of Data10.1145/36549732:3(1-25)Online publication date: 30-May-2024
      • (2024)Toward Structure Fairness in Dynamic Graph Embedding: A Trend-aware Dual Debiasing ApproachProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671848(1701-1712)Online publication date: 25-Aug-2024
      • (2024)An Empirical Study on Noisy Label Learning for Program UnderstandingProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639217(1-12)Online publication date: 20-May-2024
      • (2024)ReClean: Reinforcement Learning for Automated Data Cleaning in ML Pipelines*2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00048(324-330)Online publication date: 13-May-2024
      • (2024)WavingSketch: an unbiased and generic sketch for finding top-k items in data streamsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00869-633:5(1697-1722)Online publication date: 29-Jul-2024
      • (2023)ChainedFilter: Combining Membership Filters by Chain RuleProceedings of the ACM on Management of Data10.1145/36267211:4(1-27)Online publication date: 12-Dec-2023
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media