research-article

Open access

TCCL: Discovering Better Communication Paths for PCIe GPU Clusters

Authors:

Junyeol Ryu, and

Jaejin LeeAuthors Info & Claims

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

April 2024

Pages 999 - 1015

https://doi.org/10.1145/3620666.3651362

Published: 27 April 2024 Publication History

Abstract

Exploiting parallelism to train deep learning models requires GPUs to cooperate through collective communication primitives. While systems like DGX, equipped with proprietary interconnects, have been extensively studied, the systems where GPUs mainly communicate through PCIe have received limited attention. This paper introduces TCCL, a collective communication library designed explicitly for such systems. TCCL has three components: a profiler for multi-transfer performance measurement, a pathfinder to discover optimal communication paths, and a modified runtime of NCCL to utilize the identified paths. The focus is on ring-based collective communication algorithms that apply to popular communication operations in deep learning, such as AllReduce and AllGather. The evaluation results of TCCL on three different PCIe-dependent GPU clusters show that TCCL outperforms (up to ×2.07) the state-of-the-art communication libraries, NCCL and MSCCL. We also evaluate TCCL with DL training workloads with various combinations of parallelism types.

References

[1]

Saksham Agarwal, Rachit Agarwal, Behnam Montazeri, Masoud Moshref, Khaled Elmeleegy, Luigi Rizzo, Marc Asher de Kruijf, Gautam Kumar, Sylvia Ratnasamy, David Culler, and Amin Vahdat. Understanding host interconnect congestion. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, HotNets '22, page 198--204, New York, NY, USA, 2022. Association for Computing Machinery.

Digital Library

[2]

Devendar Bureddy, Hao Wang, Akshay Venkatesh, Sreeram Potluri, and Dhabaleswar K Panda. OMB-GPU: A micro-benchmark suite for evaluating MPI libraries on GPU clusters. In Recent Advances in the Message Passing Interface: 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, September 23-26, 2012. Proceedings 19, pages 110--120. Springer, 2012.

Digital Library

[3]

Zixian Cai, Zhengyang Liu, Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. Synthesizing optimal collective algorithms. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 62--75, 2021.

Digital Library

[4]

Alibaba Cloud. Alibaba machine learning platform for AI. https://www.alibabacloud.com/product/machine-learning, 2023.

[5]

Meghan Cowan, Saeed Maleki, Madanlal Musuvathi, Olli Saarikivi, and Yifan Xiong. GC3: An optimizing compiler for GPU collective communication. arXiv preprint arXiv:2201.11840, 2022.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[7]

Edsger W Dijkstra. A note on two problems in connexion with graphs. In Edsger Wybe Dijkstra: His Life, Work, and Legacy, pages 287--290. 2022.

Digital Library

[8]

William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232--5270, 2022.

Digital Library

[9]

Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 11th European PVM/MPI Users' Group Meeting Budapest, Hungary, September 19-22, 2004. Proceedings 11, pages 97--104. Springer, 2004.

[10]

Nadeen Gebara, Manya Ghobadi, and Paolo Costa. In-network aggregation for shared machine learning clusters. Proceedings of Machine Learning and Systems, 3:829--844, 2021.

[11]

Google. Google Cloud Vertex AI. https://cloud.google.com/vertex-ai, 2023.

[12]

Roger W. Hockney. The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel computing, 20(3):389--398, 1994.

Digital Library

[13]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. GPipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.

[14]

Advanced Micro Devices, Inc. AMD DirectGMA. https://github.com/GPUOpen-LibrariesAndSDKs/DirectGMA_P2P, 2023.

[15]

Advanced Micro Devices, Inc. AMD Infinity Architecture. https://www.amd.com/en/technologies/infinity-architecture, 2023.

[16]

Advanced Micro Devices, Inc. MI200 high-performance computing and tuning guide. https://docs.amd.com/en/latest/how-to/tuning-guides/mi200.html, 2023.

[17]

Advanced Micro Devices, Inc. RCCL. https://github.com/ROCmSoftwarePlatform/rccl, 2023.

[18]

Amazon Web Services, Inc. Amazon machine learning. https://docs.aws.amazon.com/machine-learning, 2023.

[19]

Intel. Intel® QuickPath Interconnect. https://www.intel.com/content/www/us/en/io/quickpath-technology/quickpath-technology-general.html, 2009.

[20]

Intel. oneCCL. https://github.com/oneapi-src/oneCCL, 2023.

[21]

Intel. Tuning Guides for Intel® Xeon® Scalable Processor-Based Systems. https://www.intel.com/content/www/us/en/developer/articles/guide/xeon-performance-tuning-and-solution-guides.html, 2023.

[22]

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023.

[23]

Abhinav Jangda, Jun Huang, Guodong Liu, Amir H. N. Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 402--416, 2022.

Digital Library

[24]

Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. Analysis of large-scale multi-tenant GPU clusters for DNN training workloads. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 947--960, 2019.

[25]

Vijay A. Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large Transformer models. Proceedings of Machine Learning and Systems, 5, 2023.

[26]

ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. ATP: In-network aggregation for multi-tenant learning. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 741--761, 2021.

[27]

Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R Tallent, and Kevin J Barker. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems, 31(1):94--110, 2019.

Digital Library

[28]

Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence training from system perspective. arXiv preprint arXiv:2105.13120, 2021.

[29]

Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. PLink: Discovering and exploiting locality for accelerated distributed training on the public cloud. Proceedings of Machine Learning and Systems, 2:82--97, 2020.

[30]

Kshiteej Mahajan, Ching-Hsiang Chu, Srinivas Sridharan, and Aditya Akella. Better Together: Jointly optimizing ML collective scheduling and execution planning using SYNDICATE. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 809--824, 2023.

[31]

Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R Alam, Thomas C Schulthess, and Torsten Hoefler. A pcie congestion-aware performance model for densely populated accelerator servers. In SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 739--749. IEEE, 2016.

[32]

Microsoft. Azure AI. https://azure.microsoft.com/solutions/ai, 2023.

[33]

Seung Won Min, Vikram Sharma Mailthody, Zaid Qureshi, Jinjun Xiong, Eiman Ebrahimi, and Wen-mei Hwu. Emogi: Efficient memory-access for out-of-memory graph-traversal in gpus. arXiv preprint arXiv:2006.06890, 2020.

[34]

David L. Mulnix. Intel® Xeon® Processor Scalable Family Technical Overview. https://www.intel.com/content/www/us/en/developer/articles/technical/xeon-processor-scalable-family-technical-overview.html, 2022.

[35]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on GPU clusters using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15, 2021.

Digital Library

[36]

Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W Moore. Understanding pcie performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 327--341, 2018.

Digital Library

[37]

NVIDIA. Peer-to-Peer & Unified Virtual Addressing. https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf, 2011.

[38]

NVIDIA. NVIDIA NVSWITCH - The World's Highest-Bandwidth On-Node Switch. https://images.nvidia.com/content/pdf/nvswitch-technical-overview.pdf, 2018.

[39]

NVIDIA. cuBLAS. https://developer.nvidia.com/cublas, 2023.

[40]

NVIDIA. NCCL. https://github.com/NVIDIA/nccl, 2023.

[41]

NVIDIA. nvbandwidth. https://github.com/NVIDIA/nvbandwidth, 2023.

[42]

NVIDIA. NVIDIA cuDNN. https://developer.nvidia.com/cudnn, 2023.

[43]

NVIDIA. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect, 2023.

[44]

NVIDIA. NVLink. https://www.nvidia.com/en-us/design-visualization/nvlink-bridges/, 2023.

[45]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 8024--8035. 2019.

[46]

Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I-Hsin Chung, Jinjun Xiong, and Wen-Mei Hwu. Evaluating characteristics of cuda communication primitives on high-bandwidth interconnects. In Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, pages 209--218, 2019.

Digital Library

[47]

Jelena Pjesivac-Grbovic. Towards automatic and adaptive optimizations of MPI collective operations. 2007.

[48]

Kishore Punniyamurthy, Bradford M. Beckmann, and Khaled Hamidouche. GPU-initiated fine-grained overlap of collective communication with computation. arXiv preprint arXiv:2305.06942, 2023.

[49]

Rolf Rabenseifner and Jesper L. Träff. More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 11th European PVM/MPI Users' Group Meeting Budapest, Hungary, September 19-22, 2004. Proceedings 11, pages 36--46. Springer, 2004.

[50]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

[51]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text Transformer. The Journal of Machine Learning Research, 21(1):5485--5551, 2020.

Digital Library

[52]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020.

[53]

Saeed Rashidi, Matthew Denton, Srinivas Sridharan, Sudarshan Srinivasan, Amoghavarsha Suresh, Jade Nie, and Tushar Krishna. Enabling compute-communication overlap in distributed deep learning training platforms. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 540--553. IEEE, 2021.

Digital Library

[54]

TIRIAS Research. AMD Optimizes EPYC Memory with NUMA. https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/AMD-Optimizes-EPYC-Memory-With-NUMA.pdf, 2018.

[55]

Peter Sanders, Jochen Speck, and Jesper L. Träff. Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Computing, 35(12):581--594, 2009.

Digital Library

[56]

Jo Sanghoon, Hyojun Son, and John Kim. Logical/physical topology-aware collective communication in deep learning training. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 56--68. IEEE, 2023.

[57]

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, and Peter Richtárik. Scaling distributed machine learning with in-network aggregation. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 785--808, 2021.

[58]

Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, and Rachee Singh. TACCL: Guiding collective algorithm synthesis using communication sketches. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 593--612, 2023.

[59]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.

[60]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, 19(1):49--66, 2005.

Digital Library

[61]

Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Nikhil Devanur, Jorgen Thelin, and Ion Stoica. Blink: Fast and generic collectives for distributed ML. Proceedings of Machine Learning and Systems, 2:172--186, 2020.

[62]

Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, Sameer Kumar, Tongfei Guo, Yuanzhong Xu, and Zongwei Zhou. Overlap communication with dependent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, pages 93--106, 2022.

Digital Library

[63]

Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. MLaaS in the wild: Workload analysis and scheduling in large-scale heterogeneous GPU clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 945--960, 2022.

[64]

William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Ajaya Durg, Swati Gupta, and Tushar Krishna. TACOS: Topology-aware collective algorithm synthesizer for distributed training. arXiv preprint arXiv:2304.05301, 2023.

[65]

Ningning Xie, Tamara Norman, Dominik Grewe, and Dimitrios Vytiniotis. Synthesizing optimal parallelism placement and reduction strategies on hierarchical systems for deep learning. Proceedings of Machine Learning and Systems, 4:548--566, 2022. Received 30 November 2023; accepted 6 March 2024

Index Terms

TCCL: Discovering Better Communication Paths for PCIe GPU Clusters
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Interconnection architectures
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Software architectures
        Cooperating communicating processes

Recommendations

Enhancing Intra-Node GPU-to-GPU Performance in MPI+UCX through Multi-Path Communication
ExHET '24: Proceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions

Efficient communication among GPUs is crucial for achieving high performance in modern GPU-accelerated applications. This paper introduces a multi-path communication framework within the MPI+UCX library to enhance P2P communication performance between ...
Read More
Efficient simulation of agent-based models on multi-GPU and multi-core clusters
SIMUTools '10: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques

An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-...
Read More
Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes
OpenCL is an open standard to write parallel applications for heterogeneous computing systems. Since its usage is restricted to a single operating system instance, programmers need to use a mix of OpenCL and MPI to program a heterogeneous cluster. In this ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

April 2024

1106 pages

ISBN:9798400703867

DOI:10.1145/3620666

General Chairs:
Nael Abu-Ghazaleh,
Rajiv Gupta,
Program Chairs:
Madan Musuvathi,
Dan Tsafrir

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2024

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Research Foundation of Korea
Institute for Information & communications Technology Promotion of Korea

Conference

ASPLOS '24

Sponsor:

ASPLOS '24: 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

April 27 - May 1, 2024

CA, La Jolla, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
1,216
Total Downloads

Downloads (Last 12 months)1,216
Downloads (Last 6 weeks)499

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents