Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines

Published: 20 June 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are often brittle with respect to their input data. As a consequence, data scientists have to run different kinds of data centric what-if analyses to evaluate the robustness and reliability of such pipelines, e.g., with respect to data errors or preprocessing techniques. These what-if analyses follow a common pattern: they take an existing ML pipeline, create a pipeline variant by introducing a small change, and execute this pipeline variant to see how the change impacts the pipeline's output score. The application of existing analysis techniques to ML pipelines is technically challenging as they are hard to integrate into existing pipeline code and their execution introduces large overheads due to repeated work.
    We propose mlwhatif to address these integration and efficiency challenges for data-centric what-if analyses on ML pipelines. mlwhatif enables data scientists to declaratively specify what-if analyses for an ML pipeline, and to automatically generate, optimize and execute the required pipeline variants. Our approach employs pipeline patches to specify changes to the data, operators and models of a pipeline. Based on these patches, we define a multi-query optimizer for efficiently executing the resulting pipeline variants jointly, with four subsumption-based optimization rules. Subsequently, we detail how to implement the pipeline variant generation and optimizer of mlwhatif. For that, we instrument native ML pipelines written in Python to extract dataflow plans with re-executable operators.
    We experimentally evaluate mlwhatif, and find that its speedup scales linearly with the number of pipeline variants in applicable cases, and is invariant to the input data size. In end-to-end experiments with four analyses on more than 60 pipelines, we show speedups of up to 13x compared to sequential execution, and find that the speedup is invariant to the model and featurization in the pipeline. Furthermore, we confirm the low instrumentation overhead of mlwhatif.

    Supplemental Material

    MP4 File
    Presentation video for SIGMOD 2023

    References

    [1]
    Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, et al. Tfx: A tensorflow-based production-scale machine learning platform. KDD (2017).
    [2]
    Sumon Biswas and Hridesh Rajan. Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline. ESEC/FSE (2021).
    [3]
    Matthias Boehm, Iulian Antonov, Sebastian Baunsgaard, Mark Dokter, et al . SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle. CIDR (2020).
    [4]
    Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. Data Validation for Machine Learning. MLSys (2019).
    [5]
    Leo Breiman. Random forests. JMLR 45, 1 (2001).
    [6]
    Felix S. Campbell, Bahareh Sadat Arab, and Boris Glavic. Efficient Answering of Historical What-If Queries. SIGMOD (2022).
    [7]
    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. KDD (2016).
    [8]
    François Chollet et al. Keras. https://github.com/fchollet/keras.
    [9]
    Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Zoi Kaoudi, Tilmann Rabl, and Volker Markl. Materialization and Reuse Optimizations for Production Data Science Pipelines. SIGMOD (2021).
    [10]
    Daniel Deutch, Zachary G Ives, Tova Milo, and Val Tannen. Caravan: Provisioning for What-If Analysis. CIDR (2013).
    [11]
    DS3Lab ETH Zuerich. DSPipes. https://github.com/DS3Lab/datascope-pipelines/tree/main/dspipes.
    [12]
    Lampros Flokas, Weiyuan Wu, Yejia Liu, Jiannan Wang, Nakul Verma, and Eugene Wu. Complaint-Driven Training Data Debugging at Interactive Speeds. SIGMOD (2022).
    [13]
    François Chollet. Simple MNIST convnet. https://keras.io/examples/vision/mnist_convnet/.
    [14]
    Sainyam Galhotra, Amir Gilad, Sudeepa Roy, and Babak Salimi. HypeR: Hypothetical Reasoning With What-If and How-To Queries Using a Probabilistic Causal Approach. SIGMOD (2022).
    [15]
    Stefan Grafberger, Paul Groth, and Sebastian Schelter. Towards data-centric what-if analysis for native machine learning pipelines. DEEM workshop @ SIGMOD (2022).
    [16]
    Stefan Grafberger, Paul Groth, Julia Stoyanovich, and Sebastian Schelter. Data distribution debugging in machine learning pipelines. VLDBJ (2022).
    [17]
    Stefan Grafberger, Shubha Guha, Julia Stoyanovich, and Sebastian Schelter. MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines. SIGMOD (2021).
    [18]
    Alireza Heidari, Joshua McGrath, Ihab F Ilyas, and Theodoros Rekatsinas. Holodetect: Few-shot learning for error detection. SIGMOD (2019).
    [19]
    Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas J Spanos, and Dawn Song. Efficient task-specific data valuation for nearest neighbor algorithms. VLDB (2019).
    [20]
    Sayash Kapoor and Arvind Narayanan. Leakage and the Reproducibility Crisis in ML-based Science. https://arxiv.org/abs/2207.07048
    [21]
    Bojan Karla?, David Dao, Matteo Interlandi, Bo Li, Sebastian Schelter, Wentao Wu, and Ce Zhang. 2022. Data Debugging with Shapley Importance over End-to-End Machine Learning Pipelines. https://arxiv.org/abs/2204.11131
    [22]
    Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. Activeclean: Interactive data cleaning for statistical modeling. PVLDB 9, 12 (2016).
    [23]
    Andreas Kunft, Alexander Alexandrov, Asterios Katsifodimos, and Volker Markl. Bridging the gap: towards optimization across linear and relational algebra. BeyondMR workshop @ SIGMOD (2016).
    [24]
    Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Sebastian Breß, Tilmann Rabl, and Volker Markl. An Intermediate Representation for Optimizing Machine Learning Pipelines. PVLDB (2019).
    [25]
    Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. Cleanml: A benchmark for joint data cleaning and machine learning [experiments and analysis]. ICDE (2019).
    [26]
    Mohammad Mahdavi and Ziawasch Abedjan. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. PVLDB 13, 12 (2020).
    [27]
    Mohammad Mahdavi and Ziawasch Abedjan. Semi-Supervised Data Cleaning with Raha and Baran. CIDR (2021).
    [28]
    Stefan Manegold, Arjan Pellenkoft, and Martin Kersten. A Multi-Query Optimizer for Monet. Advances in Databases (2000).
    [29]
    Wes McKinney et al . pandas: a foundational Python library for data analysis and statistics. Python for high performance and scientific computing 14, 9 (2011).
    [30]
    Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. Mllib: Machine learning in apache spark. JMLR 17, 1 (2016).
    [31]
    Yuval Moskovitch, Jinyang Li, and H. V. Jagadish. Bias Analysis and Mitigation in Data-Driven Tools Using Provenance (TaPP '22).
    [32]
    Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels. JAIR 70 (2021).
    [33]
    Shoumik Palkar, James Thomas, Deepak Narayanan, Pratiksha Thaker, Rahul Palamuttam, Parimajan Negi, Anil Shanbhag, Malte Schwarzkopf, Holger Pirk, Saman Amarasinghe, et al. Evaluating end-to-end optimization for data analytics applications in weld. PVLDB 11, 9 (2018).
    [34]
    Kwanghyun Park, Karla Saur, Dalitso Banda, Rathijit Sen, Matteo Interlandi, and Konstantinos Karanasos. End-to-End Optimization of Machine Learning Prediction Queries. SIGMOD (2022).
    [35]
    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in Python. JMLR 12 (2011).
    [36]
    Eduardo HM Pena, Edson R Lucas Filho, Eduardo C de Almeida, and Felix Naumann. Efficient detection of data dependency violations. CIKM (2020).
    [37]
    Devin Petersohn, Stephen Macke, Doris Xin, William Ma, Doris Lee, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya Parameswaran. Towards Scalable Dataframe Systems. PVLDB 13, 12 (2020).
    [38]
    Arnab Phani, Lukas Erlbacher, and Matthias Boehm. UPLIFT: Parallelization Strategies for Feature Transformations in Machine Learning Workloads. VLDB (2022).
    [39]
    Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. Data lifecycle challenges in production machine learning: a survey. SIGMOD Record 47, 2 (2018).
    [40]
    Romila Pradhan, Jiongli Zhu, Boris Glavic, and Babak Salimi. Interpretable data-based explanations for fairness debugging. SIGMOD (2021).
    [41]
    Mark Raasveldt and Hannes Mühleisen. Duckdb: an embeddable analytical database. SIGMOD (2019).
    [42]
    Nicholas Roussopoulos. View Indexing in Relational Databases. ACM Transactions on Database Systems 7, 2 (1982).
    [43]
    Prasan Roy, S. Seshadri, S. Sudarshan, and Siddhesh Bhobe. Efficient and Extensible Algorithms for Multi Query Optimization. SIGMOD (2000).
    [44]
    Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas, Stephan Seufert, and Gyuri Szarvas. On challenges in machine learning model management. IEEE Data Engineering Bulletin (2018).
    [45]
    Sebastian Schelter, Stefan Grafberger, Shubha Guha, Olivier Sprangers, Bojan Karla?, and Ce Zhang. Screening Native ML Pipelines with "ArgusEyes". CIDR (2022).
    [46]
    Sebastian Schelter, Yuxuan He, Jatin Khilnani, and Julia Stoyanovich. Fairprep: Promoting data to a first-class citizen in studies on fairness-enhancing interventions. EDBT (2019).
    [47]
    Sebastian Schelter, Tammo Rukat, and Felix Biessmann. JENGA - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models. EDBT (2021).
    [48]
    Maximilian E. Schüle, Luca Scalerandi, Alfons Kemper, and Thomas Neumann. Blue Elephants Inspecting Pandas: Inspection and Execution of Machine Learning Pipelines in SQL. (2023).
    [49]
    Timos K. Sellis. Multiple-Query Optimization. ACM Transactions on Database Systems 13, 1 (1988).
    [50]
    Evan R Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J Franklin, and Benjamin Recht. Keystoneml: Optimizing pipelines for large-scale advanced analytics. ICDE (2017).
    [51]
    Julia Stoyanovich, Bill Howe, Serge Abiteboul, H.V. Jagadish, and Sebastian Schelter. Responsible Data Management. Commun. ACM (2022).
    [52]
    Subbu N. Subramanian and Shivakumar Venkataraman. Cost-Based Optimization of Decision Support Queries Using Transient-Views. SIGMOD Record 27, 2 (1998).
    [53]
    Joaquin Vanschoren, Jan N Van Rijn, Bernd Bischl, and Luis Torgo. OpenML: networked science in machine learning. KDD (2014).
    [54]
    James Wexler, Mahima Pushkarna, Tolga Bolukbasi, Martin Wattenberg, Fernanda Viegas, and Jimbo Wilson. The What-If Tool: Interactive Probing of Machine Learning Models. IEEE Transactions on Visualization and Computer Graphics (2019).
    [55]
    Doris Xin, Stephen Macke, Litian Ma, Jialin Liu, Shuchen Song, and Aditya Parameswaran. HELIX: Holistic Optimization for Accelerating Iterative Machine Learning. VLDB (2018).
    [56]
    Doris Xin, Hui Miao, Aditya Parameswaran, and Neoklis Polyzotis. Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities. SIGMOD (2021).
    [57]
    CleanML benchmark. https://github.com/chu-data-lab/CleanML.
    [58]
    FairPreprocessing benchmark. https://github.com/sumonbis/FairPreprocessing/tree/c644dd38615f34dba39320397fb00d5509602864/benchmark.
    [59]
    Permutation Feature Importance. https://scikit-learn.org/stable/modules/permutation_importance.html.
    [60]
    Task abstraction in Jenga. https://github.com/schelterlabs/jenga/blob/c219c645c664d2e81b7dfab2c51262e64e20f4ab/src/jenga/basis.py#L25.
    [61]
    Anaconda.com. 2020. The State of Data Science. https://www.anaconda.com/state-of-data-science-2020.
    [62]
    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Pual, and Michael Jordan. Ray: A distributed framework for emerging {AI} applications. OSDI (2018).
    [63]
    Linea.py. 2022. Move fast from data science prototype to pipeline. https://lineapy.org.
    [64]
    Databricks. 2022. Mlflow recipes. https://www.mlflow.org/docs/latest/recipes.html.
    [65]
    Ray. 2022. Ray Dataset API. https://docs.ray.io/en/latest/data/api/dataset.html.
    [66]
    Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring Adult: New Datasets for Fair Machine Learning. NeurIPS (2018).
    [67]
    Cardio Data Set. 2022. Cardio Kaggle Dataset. https://www.kaggle.com/datasets/mdshamimrahman/cardio-data-set.
    [68]
    Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. Can Foundation Models Wrangle Your Data?. VLDB (2022).
    [69]
    Nils Reimers, and Iryna Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP (2019).

    Cited By

    View all
    • (2024)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage FormatProceedings of the ACM on Management of Data10.1145/36393072:1(1-31)Online publication date: 26-Mar-2024
    • (2024)Towards automating microservices orchestration through data-driven evolutionary architecturesService Oriented Computing and Applications10.1007/s11761-024-00387-x18:1(1-12)Online publication date: 27-Feb-2024
    • (2023)Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain DataProceedings of the VLDB Endowment10.14778/3583140.358315116:6(1346-1358)Online publication date: 1-Feb-2023

    Index Terms

    1. Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Proceedings of the ACM on Management of Data
        Proceedings of the ACM on Management of Data  Volume 1, Issue 2
        PACMMOD
        June 2023
        2310 pages
        EISSN:2836-6573
        DOI:10.1145/3605748
        Issue’s Table of Contents
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 20 June 2023
        Published in PACMMOD Volume 1, Issue 2

        Author Tags

        1. data preparation for machine learning
        2. data-centric ai
        3. machine learning pipelines

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)444
        • Downloads (Last 6 weeks)38

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)The Image Calculator: 10x Faster Image-AI Inference by Replacing JPEG with Self-designing Storage FormatProceedings of the ACM on Management of Data10.1145/36393072:1(1-31)Online publication date: 26-Mar-2024
        • (2024)Towards automating microservices orchestration through data-driven evolutionary architecturesService Oriented Computing and Applications10.1007/s11761-024-00387-x18:1(1-12)Online publication date: 27-Feb-2024
        • (2023)Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain DataProceedings of the VLDB Endowment10.14778/3583140.358315116:6(1346-1358)Online publication date: 1-Feb-2023

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media