research-article

Bandwidth-optimal all-to-all exchanges in fat tree networks

Authors:

Bogdan Prisacari,

German Rodriguez,

Cyriel Minkenberg,

Torsten HoeflerAuthors Info & Claims

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Pages 139 - 148

https://doi.org/10.1145/2464996.2465434

Published: 10 June 2013 Publication History

Abstract

The personalized all-to-all collective exchange is one of the most challenging communication patterns in HPC applications in terms of performance and scalability. In the context of the fat tree family of interconnection networks, widely used in current HPC systems and datacenters, we show that there is potential for optimizing this traffic pattern by deriving a tight theoretical lower bound for the bandwidth needed in the network to support such communication in a non-contending way. Current state of the art methods require up to twice as much bisection bandwidth as this theoretical minimum. We propose a set of optimized exchanges that use exactly the minimum amount of resources and exhibit close to ideal performance. This enables cost-effective networks, i.e., with as little as half the bisection bandwidth required by current state of the art methods, to exhibit quasi optimal performance under all-to-all traffic. In addition to supporting our claims by mathematical proofs, we include simulation results that confirm their correctness in practical system configurations.

References

[1]

A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: incorporating long messages into the LogP model - one step closer towards a realistic model for parallel computation. In Proceedings of the 7th annual ACM symposium on Parallel algorithms and architectures (SPAA), pages 95--105, NY, USA, 1995. ACM.

Digital Library

[2]

G. Almasi, P. Hargrove, I. Gabriel, and T. Y. Zheng. UPC collectives library 2.0. In 5th Conference on Partitioned Global Address Space Programming Models, 2011.

[3]

P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. Lusk, R. Thakur, and J. L. Traeff. MPI on Millions of Cores. Parallel Processing Letters (PPL), 21(1):45--60, Mar. 2011.

[4]

J. Bruck, C.-T. Ho, E. Upfal, S. Kipnis, and D. Weathersby. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Trans. Parallel Distrib. Syst., 8(11):1143--1156, Nov. 1997.

Digital Library

[5]

E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. On optimizing collective communication. In Cluster Computing, 2004 IEEE International Conference on, pages 145--155, sept. 2004.

Digital Library

[6]

R. Fiedler. Preparing Applications for Sustained Petascale Performance, 2011.

[7]

J. Flich, M. P. Malumbres, P. L÷pez, and J. Duato. Improving routing performance in Myrinet networks. In Proc. of the 14th International Parallel and Distributed Processing Symposium, pages 27--32, Los Alamitos, CA, USA, 2000. IEEE Computer Society.

Digital Library

[8]

P. Geoffray and T. Hoefler. Adaptive routing strategies for modern high performance networks. In High Performance Interconnects, 2008. HOTI '08. 16th IEEE Symposium on, pages 165--172, Aug. 2008.

Digital Library

[9]

S. Gorlatch. Send-receive considered harmful: Myths and realities of message passing. ACM Trans. Program. Lang. Syst., 26(1):47--56, Jan. 2004.

Digital Library

[10]

R. I. Greenberg and C. E. Leiserson. Randomized routing on fat-trees. In Proc. of the 26th Annual Symposium on the Foundations of Computer Science, pages 241--249, 1985.

Digital Library

[11]

T. Hoefler and A. Lumsdaine. Optimizing non-blocking collective operations for InfiniBand. In IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pages 1--8, 2008.

Digital Library

[12]

T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and performance analysis of non-blocking collective operations for MPI. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC '07, pages 1--10, nov. 2007.

Digital Library

[13]

P. Husbands and J. C. Hoe. MPI-StarT: delivering network performance to numerical applications. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing '98, pages 1--15, Washington, DC, USA, 1998.

Digital Library

[14]

H. Kariniemi. On-Line Reconfigurable Extended Generalized Fat Tree Network-on-Chip for Multiprocessor System-on-Chip Circuits. PhD thesis, Tampere University of Technology, 2006.

[15]

M. A. Kinsy, M. H. Cho, T. Wen, E. Suh, M. van Dijk, and S. Devadas. Application-aware deadlock-free oblivious routing. In Proceedings of the 36th annual international symposium on Computer architecture, ISCA '09, pages 208--219, New York, NY, USA, 2009. ACM.

Digital Library

[16]

C. Kurmann, F. Rauch, and T. M. Stricker. Cost/performance tradeoffs in network interconnects for clusters of commodity PCs. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing, IPDPS '03, pages 196.2--, Washington, DC, USA, 2003. IEEE Computer Society.

Digital Library

[17]

C. Leiserson et al. The network architecture of the Connection Machine CM-5. In Proc. of the Fourth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 272--285, San Diego, CA, USA, June 1992.

Digital Library

[18]

X.-Y. Lin, Y.-C. Chung, and T.-Y. Huang. A multiple LID routing scheme for fat-tree-based InfiniBand networks. Proc. of the 18th International Parallel and Distributed Processing Symposium, pages 11--, 2004.

[19]

A. Mamidala, R. Kumar, D. De, and D. Panda. MPI collectives on modern multicore clusters: Performance optimizations and communication characteristics. In Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on, pages 130--137, May 2008.

Digital Library

[20]

J. C. Martínez, J. Flich, A. Robles, P. López, and J. Duato. Supporting fully adaptive routing in InfiniBand networks. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing, IPDPS '03, pages 44.1--, Washington, DC, USA, 2003. IEEE Computer Society.

Digital Library

[21]

J. Mellor-Crummey, L. Adhianto, W. N. Scherer, III, and G. Jin. A new vision for coarray fortran. In Proceedings of the 3th Conference on Partitioned Global Address Space Programing Models, PGAS '09, pages 5:1--5:9, New York, NY, USA, 2009. ACM.

Digital Library

[22]

C. Minkenberg, W. Denzel, G. Rodriguez, and R. Birke. End-to-end modeling and simulation of high-performance computing systems. Springer Proceedings in Physics: Use Cases of Discrete Event Simulation: Appliance and Research, page 201, 2012.

[23]

P. Moin and K. Mahesh. Direct numerical simulation: a tool in turbulence research. Annual Review of Fluid Mechanics, 30(1):539--578, 1998.

[24]

C. A. Moritz and M. I. Frank. LoGPC: modeling network contention in message-passing programs. SIGMETRICS Perform. Eval. Rev., 26(1):254--263, June 1998.

Digital Library

[25]

MPI Forum. MPI: A Message-Passing Interface Standard. Version 3.0, September 2012.

[26]

S. R. Öhring, M. Ibel, S. K. Das, and M. J. Kumar. On generalized fat trees. In Proceedings of the 9th International Parallel Processing Symposium, page 37, Washington, DC, USA, 1995. IEEE Computer Society.

Digital Library

[27]

F. Petrini and M. Vanneschi. A comparison of wormhole-routed interconnection networks. In Proc. 3th International Conference on Computer Science and Informatics, NC, USA, Mar. 1997.

[28]

F. Petrini and M. Vanneschi. k -ary n -trees: High performance networks for massively parallel architectures. IPPS, 00:87, 1997.

Digital Library

[29]

J. Pješivac-Grbović, T. Angskun, G. Bosilca, G. Fagg, E. Gabriel, and J. Dongarra. Performance analysis of mpi collective operations. Cluster Computing, 10(2):127--143, 2007.

Digital Library

[30]

S. Ranka, J.-C. Wang, and G. C. Fox. Static and run-time algorithms for all-to-many personalized communication on permutation networks. IEEE Trans. Parallel Distrib. Syst., 5(12):1266--1274, Dec. 1994.

Digital Library

[31]

C. G. Requena, F. G. Villamón, M. E. Gómez, P. López, and J. Duato. Deterministic versus adaptive routing in fat-trees. Proc. of the 21st Parallel and Distributed Processing Symposium, 2007, pages 1--8, Mar. 2007.

[32]

P. Sack and W. Gropp. Faster topology-aware collective algorithms through non-minimal communication. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 45--54, New York, NY, USA, 2012. ACM.

Digital Library

[33]

M. Schroeder, A. Birrell, M. Burrows, H. Murray, R. Needham, T. Rodeheffer, E. Satterthwaite, and C. Thacker. Autonet: a high-speed, self-configuring local area network using point-to-point links. Selected Areas in Communications, IEEE Journal on, 9(8):1318--1335, Oct 1991.

Digital Library

[34]

S. Sistare, R. vandeVaart, and E. Loh. Optimization of MPI collectives on clusters of large-scale SMP's. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing '99, New York, NY, USA, 1999. ACM.

Digital Library

[35]

R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. IJHPCA, 19(1):49--66, 2005.

[36]

L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990.

Digital Library

[37]

J. J. Willcock, T. Hoefler, N. G. Edmonds, and A. Lumsdaine. Active pebbles: parallel programming for data-driven applications. In Proceedings of the international conference on Supercomputing, ICS '11, pages 235--244, New York, NY, USA, 2011. ACM.

Digital Library

[38]

M. Xie, Y. Lu, L. Liu, H. Cao, and X. Yang. Implementation and evaluation of network interface and message passing services for TianHe-1A supercomputer. In Proceedings of the 2011 IEEE 19th Annual Symposium on High Performance Interconnects, HOTI '11, pages 78--86, Washington, DC, USA, 2011. IEEE Computer Society.

Digital Library

[39]

W. Yu, D. K. Panda, and D. Buntinas. Scalable, high-performance nic-based all-to-all broadcast over Myrinet/GM. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, CLUSTER '04, pages 125--134, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[40]

E. Zahavi. Fat-trees routing and node ordering providing contention free traffic for MPI global collectives. In IPDPS Workshops, pages 761--770. IEEE, 2011.

Digital Library

[41]

E. Zahavi, G. Johnson, D. J. Kerbyson, and M. Lang. Optimized InfiniBand fat-tree routing for shift all-to-all communication pattern. In International Supercomputing Conference (ISC07), Dresden, Germany, June 2007.

[42]

E. Zahavi, G. Johnson, D. J. Kerbyson, and M. Lang. Optimized InfiniBand(TM) fat-tree routing for shift all-to-all communication patterns. Concurr. Comput.: Pract. Exper., 22(2):217--231, Feb. 2010.

Digital Library

Cited By

Maniotis PKuchta D(2024)Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis [Invited]Journal of Optical Communications and Networking10.1364/JOCN.50142716:2(A143)Online publication date: 8-Jan-2024
https://doi.org/10.1364/JOCN.501427
Maniotis PSchares LKuchta D(2023)How Data Center Networks Can Improve Through Co-packaged Optics2023 Optical Fiber Communications Conference and Exhibition (OFC)10.23919/OFC49934.2023.10116657(1-3)Online publication date: Mar-2023
https://doi.org/10.23919/OFC49934.2023.10116657
Warnke BFischer SGroppe S(2023)Distributed SPARQL queries in collaboration with the routing protocolProceedings of the 27th International Database Engineered Applications Symposium10.1145/3589462.3589497(99-106)Online publication date: 5-May-2023
https://dl.acm.org/doi/10.1145/3589462.3589497
Show More Cited By

Index Terms

Bandwidth-optimal all-to-all exchanges in fat tree networks
1. Networks
  1. Network protocols

Recommendations

Beyond Fat--tree: Unidirectional Load--Balanced Multistage Interconnection Network

The fat-tree is one of the most widely-used topologies by interconnection network manufacturers. Recently, it has been demonstrated that a deterministic routing algorithm that optimally balances the network traffic can not only achieve almost the same ...
Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network

The topological explorations of on-chip networks are important for efficiently using their enormous wire resources for low-latency and high-throughput communications using a modest silicon budget. In this paper, we propose a novel tree-based ...
Preliminary Performance Analysis of Multi-rail Fat-tree Networks
CCGrid '17: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Among the low-diameter, high-radix networks being deployed in next-generation HPC systems, dual-rail fat-tree networks are a promising approach. Adding additional injection connections (rails) to one or more network planes allows multi-rail fat-tree ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

June 2013

512 pages

ISBN:9781450321303

DOI:10.1145/2464996

General Chair:
Allen D. Malony
University of Oregon, USA
,
Program Chairs:
Mario Nemirovsky
Barcelona Supercomputing Center, Spain
,
Sam Midkiff
Purdue University, USA

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS'13

Sponsor:

SIGARCH

ICS'13: International Conference on Supercomputing

June 10 - 14, 2013

Oregon, Eugene, USA

Acceptance Rates

ICS '13 Paper Acceptance Rate 43 of 202 submissions, 21%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
433
Total Downloads

Downloads (Last 12 months)104
Downloads (Last 6 weeks)3

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Maniotis PKuchta D(2024)Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis [Invited]Journal of Optical Communications and Networking10.1364/JOCN.50142716:2(A143)Online publication date: 8-Jan-2024
https://doi.org/10.1364/JOCN.501427
Maniotis PSchares LKuchta D(2023)How Data Center Networks Can Improve Through Co-packaged Optics2023 Optical Fiber Communications Conference and Exhibition (OFC)10.23919/OFC49934.2023.10116657(1-3)Online publication date: Mar-2023
https://doi.org/10.23919/OFC49934.2023.10116657
Warnke BFischer SGroppe S(2023)Distributed SPARQL queries in collaboration with the routing protocolProceedings of the 27th International Database Engineered Applications Symposium10.1145/3589462.3589497(99-106)Online publication date: 5-May-2023
https://dl.acm.org/doi/10.1145/3589462.3589497
Stein EBramas QColombo TPelsser C(2023)Fault-adaptive Scheduling for Data Acquisition Networks2023 IEEE 48th Conference on Local Computer Networks (LCN)10.1109/LCN58197.2023.10223324(1-4)Online publication date: 2-Oct-2023
https://doi.org/10.1109/LCN58197.2023.10223324
Izzi DMassini A(2023)Realizing Optimal All-to-All Personalized Communication Using Butterfly-Based NetworksIEEE Access10.1109/ACCESS.2023.327949411(51064-51083)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3279494
Weingram ALi YQi HNg DDai LLu X(2023)xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep LearningJournal of Computer Science and Technology10.1007/s11390-023-2894-638:1(166-195)Online publication date: 31-Mar-2023
https://doi.org/10.1007/s11390-023-2894-6
Maniotis PDupuis NSchares LLee BKuchta D(2022)Intra-node High-performance Computing Network Architecture with Fast Optical Switch Fabrics2022 27th OptoElectronics and Communications Conference (OECC) and 2022 International Conference on Photonics in Switching and Computing (PSC)10.23919/OECC/PSC53152.2022.9850165(1-4)Online publication date: 3-Jul-2022
https://doi.org/10.23919/OECC/PSC53152.2022.9850165
Maniotis PSchares LKuchta DKaracali B(2022)Toward higher-radix switches with co-packaged optics for improved network locality in data center and HPC networks [Invited]Journal of Optical Communications and Networking10.1364/JOCN.45144914:6(C1)Online publication date: 4-Mar-2022
https://doi.org/10.1364/JOCN.451449
Chochia GSolt DHursey J(2022)Applying on Node Aggregation Methods to MPI Alltoall Collectives: Matrix Block Aggregation AlgorithmProceedings of the 29th European MPI Users' Group Meeting10.1145/3555819.3555821(11-17)Online publication date: 14-Sep-2022
https://dl.acm.org/doi/10.1145/3555819.3555821
Hoefler TBonato TDe Sensi DDi Girolamo SLi SHeddes MBelk JGoel DCastro MScott S(2022)HammingMesh: A Network Topology for Large-Scale Deep LearningSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00016(1-18)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00016
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents