Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3632410.3632504acmotherconferencesArticle/Chapter ViewAbstractPublication PagescomadConference Proceedingsconference-collections
demonstration

TASCA : Tool for Automatic SCalable Acceleration of ML pipelines✱

Published: 04 January 2024 Publication History

Abstract

Data Scientists use Python for building ML pipelines including pre-processing on data for cleansing and transformation. The computational overheads, due to performance anti-patterns, on data processing can be expensive, especially on large data size. FASCA [10] is a framework to identify such performance anti-patterns, with their corresponding performant versions, from ML pipelines. However, it needs human intervention to generate the performant code. Recent growth in maturity of Large Language Models (LLM) for code generation motivated us to exploit them for automating the process of transforming performance anti-patterns with their performant versions, the feasibility of which has been discussed in  [5].
This paper presents, TASCA, a tool which automatically detects potential performance anti-patterns for large data size in ML pipelines and replaces them with their performant version using Large Language Models like GPTNeo3.5/4. The tool has been tested empirically on three real-world workloads, showing substantial performance improvements, including a 70% speedup for a Netflix Series Recommendation pipeline, a 50% boost for a Movie Recommendation pipeline, and a 40% enhancement for an In-house recommendation system training pipeline.

Supplementary Material

MP4 File (f1.mp4)
This is the TASCA demo video .

References

[1]
[n. d.]. AutoML. https://www.automl.org/automl/
[2]
Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith. 2011. Cython: The best of both worlds. Computing in Science & Engineering 13, 2 (2011), 31–39.
[3]
Emery D Berger. 2020. Scalene: Scripting-Language Aware Profiling for Python. arXiv preprint arXiv:2006.03879 (2020).
[4]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. arxiv:2107.03374 [cs.LG]
[5]
Sharod Roy Choudhury and Mayank Mishra& Rekha Singhal& Sirish Karande. [n. d.]. APERFCODE: Auto Conversion to PErformant Code. ([n. d.]).
[6]
Jesse Daniel. 2019. Data Science with Python and Dask. Simon and Schuster.
[7]
Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A llvm-based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. 1–6.
[8]
Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, and Satish Chandra. 2019. Aroma: Code Recommendation via Structural Code Search. Proc. ACM Program. Lang. 3, OOPSLA, Article 152 (oct 2019), 28 pages. https://doi.org/10.1145/3360578
[9]
Yuetian Mao, Shuai Yuan, Nan Cui, Tianjiao Du, Beijun Shen, and Yuting Chen. 2021. DeFiHap: Detecting and Fixing HiveQL Anti-Paterns. In In proceedings of International Conference on Very Large Data Bases (VLDB).
[10]
Mayank Mishra, Archisman Bhowmick, and Rekha Singhal. 2021. FASCA: Framework for Automatic Scalable Acceleration of ML Pipeline. In 2021 IEEE International Conference on Big Data (Big Data). 1867–1876. https://doi.org/10.1109/BigData52589.2021.9671376
[11]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. 2018. Ray: A Distributed Framework for Emerging AI Applications. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (Carlsbad, CA, USA) (OSDI’18). USENIX Association, USA, 561–577.
[12]
Derek G. Murray, Jiri Simsa, Ana Klimovic, and Ihor Indyk. 2021. tf.data: A Machine Learning Data Processing Framework. arxiv:2101.12127 [cs.LG]
[13]
Shoumik Palkar, James Thomas, Deepak Narayanan, Anil Shanbhag, Rahul Palamuttam, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, Samuel Madden, and Matei Zaharia. 2017. Weld: Rethinking the interface between data-intensive applications. arXiv preprint arXiv:1709.06416 (2017).
[14]
Shoumik Palkar and Matei Zaharia. 2019. Optimizing data-intensive computations in existing libraries with split annotations. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 291–305.
[15]
Ravi Kumar Singh, Mayank Mishra, and Rekha Singhal. 2023. Accelerating Model Training: Performance Antipatterns Eliminator Framework. In Proceedings of the 3rd Workshop on Machine Learning and Systems. 163–170.

Index Terms

  1. TASCA : Tool for Automatic SCalable Acceleration of ML pipelines✱
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      CODS-COMAD '24: Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)
      January 2024
      627 pages
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 January 2024

      Check for updates

      Author Tags

      1. Code Acceleration
      2. ML Pipeline
      3. Scalability Bottlenecks.

      Qualifiers

      • Demonstration
      • Research
      • Refereed limited

      Conference

      CODS-COMAD 2024

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 39
        Total Downloads
      • Downloads (Last 12 months)39
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 13 Jan 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media