Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3627703.3629585acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines

Published: 22 April 2024 Publication History

Abstract

Multi-task model training has been adopted to enable a single deep neural network model (often a large language model) to handle multiple tasks (e.g., question answering and text summarization). Multi-task training commonly receives input sequences of highly different lengths due to the diverse contexts of different tasks. Padding (to the same sequence length) or packing (short examples into long sequences of the same length) is usually adopted to prepare input samples for model training, which is nonetheless not space or computation efficient. This paper proposes a dynamic micro-batching approach to tackle sequence length variation and enable efficient multi-task model training. We advocate pipelineparallel training of the large model with variable-length micro-batches, each of which potentially comprises a different number of samples. We optimize micro-batch construction using a dynamic programming-based approach, and handle micro-batch execution time variation through dynamic pipeline and communication scheduling, enabling highly efficient pipeline training. Extensive evaluation on the FLANv2 dataset demonstrates up to 4.39x higher training throughput when training T5, and 3.25x when training GPT, as compared with packing-based baselines. DynaPipe's source code is publicly available at https://github.com/awslabs/optimizing-multitask-training-through-dynamic-pipelines.

References

[1]
Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. 2021. Muppet: Massive Multi-task Representations with Pre-Finetuning. arXiv:2101.11038 [cs.CL]
[2]
Amazon Web Services, Inc. 2023. Amazon EC2 P4d Instances. https://aws.amazon.com/ec2/instance-types/p4/.
[3]
Amazon Web Services, Inc. 2023. Elastic Fabric Adapter. https://aws.amazon.com/hpc/efa/.
[4]
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. (2020). arXiv:2004.05150 [cs.CL]
[5]
Tami Boudoukh, Michal Penn, and Gideon Weiss. 2001. Scheduling jobshops with some identical or similar jobs. Journal of Scheduling 4, 4 (2001), 177--199.
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Proc. of NeurIPS, Vol. 33. 1877--1901.
[7]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. (2016). arXiv:1604.06174 [cs.LG]
[8]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. (2019). arXiv:1904.10509 [cs.LG]
[9]
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Proc. of NeurIPS.
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of NAACL. ACL, 4171--4186.
[11]
Google. 2023. GPU platforms. https://cloud.google.com/compute/docs/gpus#a100-40gb.
[12]
Ananth Gottumukkala, Dheeru Dua, Sameer Singh, and Matt Gardner. 2020. Dynamic sampling strategies for multi-task reading comprehension. In Proc. of ACL. ACL, 920--924.
[13]
Stephen C Graves, Harlan C Meal, Daniel Stefek, and Abdel Hamid Zeghmi. 1983. Scheduling of Re-Entrant Flow Shops. Journal of operations management 3, 4 (1983), 197--207.
[14]
Karl Moritz Hermann, Tomás Kociský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching Machines to Read and Comprehend. In Proc. of NeurIPS. 1693--1701.
[15]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. 2023. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models. (2023). arXiv:2309.14509 [cs.LG]
[16]
Narendra Karmarkar and Richard M Karp. 1982. The differencing method of set partitioning. Computer Science Division (EECS), University of California Berkeley.
[17]
Richard M Karp. 2010. Reducibility among combinatorial problems. Springer.
[18]
Mario Michael Krell, Matej Kosec, Sergio P. Perez, and Andrew Fitzgibbon. 2022. Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance. arXiv:2107.02027 [cs.CL]
[19]
Dacheng Li, Rulin Shao, Anze Xie, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023. LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers. (2023). arXiv:2310.03294 [cs.LG]
[20]
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts. 2023. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. arXiv:2301.13688 [cs.AI]
[21]
Redis Ltd. 2023. Redis. https://redis.io/
[22]
Microsoft. 2023. DeepSpeed. https://github.com/microsoft/DeepSpeed.
[23]
Microsoft. 2023. ND A100 v4-series. https://learn.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series, Last accessed on 2023-10-31.
[24]
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2022. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In Proc. of ACL. ACL, 3470--3487.
[25]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proc. of SOSP. ACM, 1--15.
[26]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In Proc. of SC. ACM, 1--15.
[27]
NVIDIA. 2023. NCCL. https://developer.nvidia.com/nccl
[28]
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. Fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proc. of NAACL. ACL.
[29]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proc. of NeurIPS. 8024--8035.
[30]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1--67.
[31]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory optimizations toward training trillion parameter models. In Proc. of SC. IEEE, 1--16.
[32]
Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2022. Multitask Prompted Training Enables Zero-Shot Task Generalization. In Proc. of ICLR. OpenReview.net.
[33]
Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2Tensor for Neural Machine Translation. (2018). arXiv:1803.07416 [cs.LG]
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Proc. of NeurIPS 30 (2017).
[35]
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned Language Models are Zero-Shot Learners. In Proc. of ICLR. OpenReview.net.
[36]
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proc. of NAACL. ACL, 1112--1122.
[37]
Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. 2023. ByteTransformer: A high-performance transformer boosted for variable-length inputs. In Proc. of IPDPS. IEEE, 344--355.
[38]
Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chu, George Karypis, Trishul Chilimbi, Mu Li, and Xin Jin. 2022. MiCS: Near-Linear Scaling for Training Gigantic Model on Public Cloud. Proc. VLDB Endow. 16, 1 (2022), 37--50.
[39]
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. 2023. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel. (2023). arXiv:2304.11277 [cs.DC]
[40]
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In Proc. of OSDI. USENIX, 559--578.

Cited By

View all
  • (2024)Training and Serving System of Foundation Models: A Comprehensive SurveyIEEE Open Journal of the Computer Society10.1109/OJCS.2024.33808285(107-119)Online publication date: 2024

Index Terms

  1. DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems
      April 2024
      1245 pages
      ISBN:9798400704376
      DOI:10.1145/3627703
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 April 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      Author Tags

      1. distributed systems
      2. multi-task learning
      3. pipeline parallelism

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      EuroSys '24
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 241 of 1,308 submissions, 18%

      Upcoming Conference

      EuroSys '25
      Twentieth European Conference on Computer Systems
      March 30 - April 3, 2025
      Rotterdam , Netherlands

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)287
      • Downloads (Last 6 weeks)51
      Reflects downloads up to 23 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Training and Serving System of Foundation Models: A Comprehensive SurveyIEEE Open Journal of the Computer Society10.1109/OJCS.2024.33808285(107-119)Online publication date: 2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media