research-article

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Authors:

Chien-Chin Huang,

Hamid Shojanazeri,

Alban Desmaison,

Pritam Damania,

Bernard Nguyen,

Shen LiAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 16, Issue 12

Pages 3848 - 3860

https://doi.org/10.14778/3611540.3611569

Published: 01 August 2023 Publication History

Abstract

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

References

[1]

2023. torch.amp Gradient Scaling. https://pytorch.org/docs/2.0/amp.html#gradient-scaling.

[2]

Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, and Yinlong Xu. 2021. Gradient compression supercharged high-performance data parallel dnn training. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 359--375.

Digital Library

[3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.

[4]

Jiarui Fang, Zilin Zhu, Shenggui Li, Hui Su, Yang Yu, Jie Zhou, and Yang You. 2022. Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management. IEEE Transactions on Parallel and Distributed Systems 34, 1 (2022), 304--315.

[5]

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).

[6]

Chaoyang He, Shen Li, Mahdi Soltanolkotabi, and Salman Avestimehr. 2021. Pipetransformer: Automated elastic pipelining for distributed training of transformers. arXiv preprint arXiv:2102.03161 (2021).

[7]

Xin He, Jianhua Sun, Hao Chen, and Dong Li. 2022. Campo: Cost-Aware Performance Optimization for Mixed-Precision Neural Network Training. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). USENIX Association, Carlsbad, CA, 505--518. https://www.usenix.org/conference/atc22/presentation/he

[8]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019).

[9]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks.

[10]

Andrej Karpathy. 2020. MinGPT Transformer model. https://github.com/karpathy/minGPT.

[11]

Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, and Sungwoong Kim. 2020. torchgpipe: On-the-fly pipeline parallelism for training giant models. arXiv preprint arXiv:2004.09910 (2020).

[12]

Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. 2020. Dynamic Tensor Rematerialization.

[13]

Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198 (2022).

[14]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 (2020).

[15]

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. 2021. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning. PMLR, 6543--6552.

[16]

Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, and Kishore Atreya. 2017. Incbricks: Toward in-network computation with an in-network cache. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 795--809.

Digital Library

[17]

Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. 2020. Plink: Discovering and exploiting locality for accelerated distributed training on the public cloud. Proceedings of Machine Learning and Systems 2 (2020), 82--97.

[18]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2017. Mixed Precision Training.

[19]

Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, et al. 2021. High-performance, distributed training of large-scale deep learning recommendation models. arXiv preprint arXiv:2104.05158 (2021).

[20]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1--15.

Digital Library

[21]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.

Digital Library

[22]

NVIDIA. 2023. The NVIDIA Collective Communication Library (NCCL). https://developer.nvidia.com/nccl.

[23]

OpenAI. 2023. ChatGPT. https://chat.openai.com/.

[24]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024--8035. http://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

Digital Library

[25]

Team PyTorch. 2023. DISTRIBUTED RPC FRAMEWORK. https://pytorch.org/docs/stable/rpc.html.

[26]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485--5551.

Digital Library

[27]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--16.

[28]

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training. In USENIX Annual Technical Conference. 551--564.

[29]

Nick Schneider, Florian Piewak, Christoph Stiller, and Uwe Franke. 2017. Reg-Net: Multimodal sensor registration using deep neural networks. In 2017 IEEE intelligent vehicles symposium (IV). IEEE, 1803--1810.

[30]

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Hongjun Choi, Blake Hechtman, and Shibo Wang. 2020. Automatic cross-replica sharding of weight update in data-parallel training. arXiv preprint arXiv:2004.13336 (2020).

[31]

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. 2021. GSPMD: general and scalable parallelization for ML computation graphs. arXiv preprint arXiv:2105.04663 (2021).

[32]

Jinhui Yuan, Xinqi Li, Cheng Cheng, Juncheng Liu, Ran Guo, Shenghang Cai, Chi Yao, Fei Yang, Xiaodong Yi, Chuan Wu, et al. 2021. Oneflow: Redesign the distributed deep learning framework from scratch. arXiv preprint arXiv:2110.15032 (2021).

[33]

Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, Yang Liu, Huayu Li, Yasmine Badr, Jongsoo Park, Jiyan Yang, Dheevatsa Mudigere, and Ellie Wen. 2022. DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction.

[34]

Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chiu, George Karypis, Trishul Chilimbi, Mu Li, and Xin Jin. 2022. MiCS: near-linear scaling for training gigantic model on public cloud. arXiv preprint arXiv:2205.00119 (2022).

[35]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. 2022. Alpa: Automating Inter-and {Intra-Operator} Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559--578.

[36]

Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. 2021. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021).

Cited By

Liu WLi MTan GJia W(2025)Mario: Near Zero-cost Activation Checkpointing in Pipeline ParallelismProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710878(197-211)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710878
Zhao PZhang HFu FNie XLiu QYang FPeng YJiao DLi SXue JTao YCui B(2025)MEMO: Fine-grained Tensor Management For Ultra-long Context LLM TrainingProceedings of the ACM on Management of Data10.1145/37097033:1(1-28)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709703
Alabed SBelov DChrzaszcz BFranco JGrewe DMaclaurin DMolloy JNatan TNorman TPan XPaszke ARink NSchaarschmidt MSitdikov TSwietlik AVytiniotis DWee JEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)PartIR: Composing SPMD Partitioning Strategies for Machine LearningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707284(794-810)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707284
Show More Cited By

Recommendations

Parallel Computing Experiences with CUDA

The CUDA programming model provides a straightforward means of describing inherently parallel computations, and NVIDIA's Tesla GPU architecture delivers high computational throughput on massively parallel problems. This article surveys experiences ...
Scaling soft matter physics to thousands of graphics processing units in parallel

We describe a multi-graphics processing unit GPU implementation of the Ludwig application, which specialises in simulating a variety of complex fluids via lattice Boltzmann fluid dynamics coupled to additional physics describing complex fluid ...
Parallel Transposition of Sparse Data Structures
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Many applications in computational sciences and social sciences exploit sparsity and connectivity of acquired data. Even though many parallel sparse primitives such as sparse matrix-vector (SpMV) multiplication have been extensively studied, some other ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 16, Issue 12

August 2023

685 pages

ISSN:2150-8097

Editors:
Georgia Koutrika
Athena Research Center
,
Jun Yang
Duke University

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2023

Published in PVLDB Volume 16, Issue 12

Check for updates

Badges

Artifacts Available / v1.1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
335
Total Downloads

Downloads (Last 12 months)257
Downloads (Last 6 weeks)26

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu WLi MTan GJia W(2025)Mario: Near Zero-cost Activation Checkpointing in Pipeline ParallelismProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710878(197-211)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710878
Zhao PZhang HFu FNie XLiu QYang FPeng YJiao DLi SXue JTao YCui B(2025)MEMO: Fine-grained Tensor Management For Ultra-long Context LLM TrainingProceedings of the ACM on Management of Data10.1145/37097033:1(1-28)Online publication date: 11-Feb-2025
https://dl.acm.org/doi/10.1145/3709703
Alabed SBelov DChrzaszcz BFranco JGrewe DMaclaurin DMolloy JNatan TNorman TPan XPaszke ARink NSchaarschmidt MSitdikov TSwietlik AVytiniotis DWee JEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)PartIR: Composing SPMD Partitioning Strategies for Machine LearningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707284(794-810)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707284
Hayes TRao RAkin HSofroniew NOktay DLin ZVerkuil RTran VDeaton JWiggert MBadkundri RShafkat IGong JDerry AMolina RThomas NKhan YMishra CKim CBartie LNemeth MHsu PSercu TCandido SRives A(2025)Simulating 500 million years of evolution with a language modelScience10.1126/science.ads0018Online publication date: 16-Jan-2025
https://doi.org/10.1126/science.ads0018
Kanervisto ABignell DWen LGrayson MGeorgescu RValcarcel Macua STan SRashid TPearce TCao YLemkhenter AJiang CCostello GGupta GTot MIshida SGupta TArora UWhite RDevlin SMorrison CHofmann K(2025)World and Human Action Models towards gameplay ideationNature10.1038/s41586-025-08600-3638:8051(656-663)Online publication date: 19-Feb-2025
https://doi.org/10.1038/s41586-025-08600-3
Rausch IBhaskar BSafont-Andreu AEwetz HKolovratnik DOravecz CRunonen M(2025)Towards effective continued pre-training of EU institutional LLMs on EuroHPC supercomputersProcedia Computer Science10.1016/j.procs.2025.02.256255(13-22)Online publication date: 2025
https://doi.org/10.1016/j.procs.2025.02.256
Karadag CTopaloglu N(2025)Partitioned neural network training via synthetic intermediate labelsMultimedia Tools and Applications10.1007/s11042-025-20666-9Online publication date: 5-Feb-2025
https://doi.org/10.1007/s11042-025-20666-9
Huang CRianto BSun JFu ZLee C(2025)Privacy Protection and Standardization of Electronic Medical Records Using Large Language ModelLarge Language Models for Automatic Deidentification of Electronic Health Record Notes10.1007/978-981-97-7966-6_5(60-71)Online publication date: 26-Jan-2025
https://doi.org/10.1007/978-981-97-7966-6_5
Lù XKasner ZReddy SSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)WEBLINXProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693410(33007-33056)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693410
Karamcheti SNair SBalakrishna ALiang PKollar TSadigh DSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Prismatic VLMsProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693000(23123-23144)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693000
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents