Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3580305.3599573acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
abstract
Free access

Training Large-scale Foundation Models on Emerging AI Chips

Published: 04 August 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Foundation models such as ChatGPT and GPT-4 have garnered significant interest from both academia and industry due to their emergent capabilities, such as few-shot prompting, multi-step reasoning, instruction following, and model calibration. Such capabilities were previously only attainable with specially designed models, such as those using knowledge graphs, but can now be achieved on a much larger scale with foundation models. As the capabilities of foundation models have increased, so too have their sizes at a rate much faster than Moore's law. For example, the BERT large model was initially released as a 334M model in 2018, and by 2023, the largest GPT-4 models are estimated to range between 200-300B, representing an increase of three orders of magnitude in just five years. The training of foundation models requires massive computing power. For instance, training a BERT model on a single state-of-the-art GPU machine with multi-A100 chips can take several days, while training GPT-3 models on a large multi-instance GPU cluster can take several months to complete the estimated 3 X 1023 flops.
    This tutorial provides an overview of the latest progress in supporting foundation model training and inference with new AI chips. It reviews progress on the modeling side, with an emphasis on the transformer architecture, and presents the system architecture supporting training and serving foundation models. This includes programming language frameworks such as PyTorch and Tensorflow, graph compilers, 3D parallelisms, and accelerators such as the GPU H100, TPU, and Trainium. Finally, the tutorial presents our experience of training foundation models using different systems.

    References

    [1]
    Tom B. Brown et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors. https://proceedings.neurips.cc/paper/2020/hash/1 457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
    [2]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Jill Burstein, Christy Doran, and Thamar Solorio, editors. Association for Computational Linguistics, 4171--4186.
    [3]
    Yanping Huang et al. 2019. Gpipe: efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32.
    [4]
    Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems, 1, 1--13.
    [5]
    Yinhan Liu et al. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692. http://arxiv.org/abs/1907.11692 arXiv: 1907.116 92.
    [6]
    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, 1--15.
    [7]
    Deepak Narayanan et al. 2021. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1--15.
    [8]
    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. [n. d.] Language models are unsupervised multitask learners.
    [9]
    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21, 140:1--140:67. http://jmlr.org/papers/v21/20-074.html.
    [10]
    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--16.
    [11]
    Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. Zero-offload: democratizing billion-scale model training. In USENIX Annual Technical Conference, 551--564.
    [12]
    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 10674--10685. /CVPR52688.2022.01042.
    [13]
    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108. http://arxiv.org/abs/1910.01108 arXiv: 1910.01108.
    [14]
    Yuanzhong Xu et al. 2021. Gspmd: general and scalable parallelization for ml computation graphs. arXiv preprint arXiv:2105.04663.
    [15]
    Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chiu, George Karypis, Trishul Chilimbi, Mu Li, and Xin Jin. 2022. Mics: near-linear scaling for training gigantic model on public cloud. arXiv preprint arXiv:2205.00119.
    [16]
    Yanli Zhao et al. 2023. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277.

    Cited By

    View all
    • (2024)Expansive data, extensive model: Investigating discussion topics around LLM through unsupervised machine learning in academic papers and newsPLOS ONE10.1371/journal.pone.030468019:5(e0304680)Online publication date: 31-May-2024
    • (2024)Intel Accelerators Ecosystem: An SoC-Oriented Perspective : Industry Product2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00066(848-862)Online publication date: 29-Jun-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
    August 2023
    5996 pages
    ISBN:9798400701030
    DOI:10.1145/3580305
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 August 2023

    Check for updates

    Author Tags

    1. ai accelerator
    2. foundation models
    3. gpu
    4. tpu
    5. trainium

    Qualifiers

    • Abstract

    Conference

    KDD '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)384
    • Downloads (Last 6 weeks)18
    Reflects downloads up to 09 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Expansive data, extensive model: Investigating discussion topics around LLM through unsupervised machine learning in academic papers and newsPLOS ONE10.1371/journal.pone.030468019:5(e0304680)Online publication date: 31-May-2024
    • (2024)Intel Accelerators Ecosystem: An SoC-Oriented Perspective : Industry Product2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00066(848-862)Online publication date: 29-Jun-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media

    Access Granted

    The conference sponsors are committed to making content openly accessible in a timely manner.
    This article is provided by ACM and the conference, through the ACM OpenTOC service.