Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3448016.3450571acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
abstract

Model-Parallel Model Selection for Deep Learning Systems

Published: 18 June 2021 Publication History

Abstract

As deep learning becomes more expensive, both in terms of time and compute, inefficiencies in machine learning training prevent practical usage of state-of-the-art models for most users. The newest model architectures are simply too large to be fit onto a single processor. To address the issue, many ML practitioners have turned to model parallelism as a method of distributing the computational requirements across several devices. Unfortunately, the sequential nature of neural networks causes very low efficiency and device utilization in model parallel training jobs. We propose a new form of "shard parallelism" combining task parallelism and model parallelism, and package it into a framework we name Hydra. Hydra recasts the problem of model parallelism in the multi-model context to produce a fine-grained parallel workload of independent model shards, rather than independent models. This new parallel design promises dramatic speedups relative to the traditional model parallelism paradigm.

References

[1]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
[2]
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962v2, 2019.
[3]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016.
[4]
Supun Nakandala, Yuhao Zhang, and Arun Kumar. Cerebro: Efficient and reproducible model selection on deep learning systems. pages 1--4, 06 2019.
[5]
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS'12, page 1223--1231, Red Hook, NY, USA, 2012. Curran Associates Inc.
[6]
Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond data and model parallelism for deep neural networks, 2018.
[7]
Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. Taso: Optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP '19, page 47--62, New York, NY, USA, 2019. Association for Computing Machinery.
[8]
Amir Gholami, Ariful Azad, Peter Jin, Kurt Keutzer, and Aydin Buluc. Integrated model, batch and domain parallelism in training neural networks, 2018.
[9]
Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D. Sculley. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '17, page 1487--1495, New York, NY, USA, 2017. Association for Computing Machinery.
[10]
Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E. Gonzalez, and Ion Stoica. Tune: A research platform for distributed model selection and training, 2018.

Cited By

View all
  • (2024)D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural NetworksProceedings of the VLDB Endowment10.14778/3681954.368196117:11(2764-2777)Online publication date: 1-Jul-2024
  • (2024)Saturn: An Optimized Data System for Multi-Large-Model Deep Learning WorkloadsProceedings of the VLDB Endowment10.14778/3636218.363622717:4(712-725)Online publication date: 5-Mar-2024
  • (2023)InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation ModelsProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3608778(430-442)Online publication date: 14-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Check for updates

Author Tags

  1. GPU
  2. database systems
  3. deep learning
  4. efficiency
  5. machine learning
  6. machine learning systems
  7. memory
  8. model parallelism
  9. model training
  10. parallelism
  11. scheduling
  12. systems

Qualifiers

  • Abstract

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)5
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural NetworksProceedings of the VLDB Endowment10.14778/3681954.368196117:11(2764-2777)Online publication date: 1-Jul-2024
  • (2024)Saturn: An Optimized Data System for Multi-Large-Model Deep Learning WorkloadsProceedings of the VLDB Endowment10.14778/3636218.363622717:4(712-725)Online publication date: 5-Mar-2024
  • (2023)InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation ModelsProceedings of the 17th ACM Conference on Recommender Systems10.1145/3604915.3608778(430-442)Online publication date: 14-Sep-2023
  • (2023)A Load-Balancing Strategy Based on Multi-Task Learning in a Distributed Training Environment2023 International Conference on Advances in Electrical Engineering and Computer Applications (AEECA)10.1109/AEECA59734.2023.00158(862-868)Online publication date: 18-Aug-2023
  • (2023)2D-THA-ADMM: communication efficient distributed ADMM algorithm framework based on two-dimensional torus hierarchical AllReduceInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-01903-915:2(207-226)Online publication date: 28-Jun-2023
  • (2022)Online Content Veracity Assessment using Deep Representation Learning2022 19th International Bhurban Conference on Applied Sciences and Technology (IBCAST)10.1109/IBCAST54850.2022.9990148(325-330)Online publication date: 16-Aug-2022
  • (2022)Deep Learning in Robotics for Strengthening Industry 4.0.: Opportunities, Challenges and Future DirectionsRobotics and AI for Cybersecurity and Critical Infrastructure in Smart Cities10.1007/978-3-030-96737-6_1(1-19)Online publication date: 29-Mar-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media