demonstration

TASCA : Tool for Automatic SCalable Acceleration of ML pipelines✱

Authors:

Archisman Bhowmick,

Rekha SinghalAuthors Info & Claims

CODS-COMAD '24: Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)

Pages 514 - 518

https://doi.org/10.1145/3632410.3632504

Published: 04 January 2024 Publication History

Abstract

Data Scientists use Python for building ML pipelines including pre-processing on data for cleansing and transformation. The computational overheads, due to performance anti-patterns, on data processing can be expensive, especially on large data size. FASCA [10] is a framework to identify such performance anti-patterns, with their corresponding performant versions, from ML pipelines. However, it needs human intervention to generate the performant code. Recent growth in maturity of Large Language Models (LLM) for code generation motivated us to exploit them for automating the process of transforming performance anti-patterns with their performant versions, the feasibility of which has been discussed in [5].

This paper presents, TASCA, a tool which automatically detects potential performance anti-patterns for large data size in ML pipelines and replaces them with their performant version using Large Language Models like GPTNeo3.5/4. The tool has been tested empirically on three real-world workloads, showing substantial performance improvements, including a 70% speedup for a Netflix Series Recommendation pipeline, a 50% boost for a Movie Recommendation pipeline, and a 40% enhancement for an In-house recommendation system training pipeline.

Supplementary Material

MP4 File (f1.mp4)

This is the TASCA demo video .

Download
16.36 MB

References

[1]

[n. d.]. AutoML. https://www.automl.org/automl/

[2]

Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith. 2011. Cython: The best of both worlds. Computing in Science & Engineering 13, 2 (2011), 31–39.

Digital Library

[3]

Emery D Berger. 2020. Scalene: Scripting-Language Aware Profiling for Python. arXiv preprint arXiv:2006.03879 (2020).

[4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. arxiv:2107.03374 [cs.LG]

[5]

Sharod Roy Choudhury and Mayank Mishra& Rekha Singhal& Sirish Karande. [n. d.]. APERFCODE: Auto Conversion to PErformant Code. ([n. d.]).

[6]

Jesse Daniel. 2019. Data Science with Python and Dask. Simon and Schuster.

[7]

Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A llvm-based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. 1–6.

Digital Library

[8]

Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, and Satish Chandra. 2019. Aroma: Code Recommendation via Structural Code Search. Proc. ACM Program. Lang. 3, OOPSLA, Article 152 (oct 2019), 28 pages. https://doi.org/10.1145/3360578

Digital Library

[9]

Yuetian Mao, Shuai Yuan, Nan Cui, Tianjiao Du, Beijun Shen, and Yuting Chen. 2021. DeFiHap: Detecting and Fixing HiveQL Anti-Paterns. In In proceedings of International Conference on Very Large Data Bases (VLDB).

Digital Library

[10]

Mayank Mishra, Archisman Bhowmick, and Rekha Singhal. 2021. FASCA: Framework for Automatic Scalable Acceleration of ML Pipeline. In 2021 IEEE International Conference on Big Data (Big Data). 1867–1876. https://doi.org/10.1109/BigData52589.2021.9671376

[11]

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (Carlsbad, CA, USA) (OSDI’18). USENIX Association, USA, 561–577.

[12]

Derek G. Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk. 2021. tf.data: A Machine Learning Data Processing Framework. arxiv:2101.12127 [cs.LG]

[13]

Shoumik Palkar, James Thomas, Deepak Narayanan, Anil Shanbhag, Rahul Palamuttam, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, Samuel Madden, and Matei Zaharia. 2017. Weld: Rethinking the interface between data-intensive applications. arXiv preprint arXiv:1709.06416 (2017).

[14]

Shoumik Palkar and Matei Zaharia. 2019. Optimizing data-intensive computations in existing libraries with split annotations. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 291–305.

Digital Library

[15]

Ravi Kumar Singh, Mayank Mishra, and Rekha Singhal. 2023. Accelerating Model Training: Performance Antipatterns Eliminator Framework. In Proceedings of the 3rd Workshop on Machine Learning and Systems. 163–170.

Digital Library

Index Terms

TASCA : Tool for Automatic SCalable Acceleration of ML pipelines✱
1. Computing methodologies
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types

Index terms have been assigned to the content through auto-classification.

Recommendations

Metrics for Code Smells of ML Pipelines
Product-Focused Software Process Improvement
Abstract
ML pipelines, as key components of ML systems, shall be developed following quality assurance techniques. Unfortunately, it is often the case in which they present maintainability issues, due to the experimentatal nature of data collection and ML ...
DiffML: End-to-end Differentiable ML Pipelines
DEEM '23: Proceedings of the Seventh Workshop on Data Management for End-to-End Machine Learning

In this paper, we present our vision of differentiable ML pipelines called DiffML that truly allows to automate the construction of ML pipelines in an end-to-end fashion. DiffML allows to jointly train not just the ML model itself but also the entire ...
Understanding Parallelization Tradeoffs for Linear Pipelines
PMAM'18: Proceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores

Pipelining techniques execute some loops with cross-iteration dependences in parallel, by partitioning the loop body into a sequence of stages such that the data dependences are not violated. Obtaining good performance for all kinds of loops is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CODS-COMAD '24: Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)

January 2024

627 pages

ISBN:9798400716348

DOI:10.1145/3632410

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2024

Check for updates

Author Tags

Qualifiers

Demonstration
Research
Refereed limited

Conference

CODS-COMAD 2024

CODS-COMAD 2024: 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)

January 4 - 7, 2024

Bangalore, India

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
39
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)2

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents