Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2464996.2465434acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Bandwidth-optimal all-to-all exchanges in fat tree networks

Published: 10 June 2013 Publication History

Abstract

The personalized all-to-all collective exchange is one of the most challenging communication patterns in HPC applications in terms of performance and scalability. In the context of the fat tree family of interconnection networks, widely used in current HPC systems and datacenters, we show that there is potential for optimizing this traffic pattern by deriving a tight theoretical lower bound for the bandwidth needed in the network to support such communication in a non-contending way. Current state of the art methods require up to twice as much bisection bandwidth as this theoretical minimum. We propose a set of optimized exchanges that use exactly the minimum amount of resources and exhibit close to ideal performance. This enables cost-effective networks, i.e., with as little as half the bisection bandwidth required by current state of the art methods, to exhibit quasi optimal performance under all-to-all traffic. In addition to supporting our claims by mathematical proofs, we include simulation results that confirm their correctness in practical system configurations.

References

[1]
A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: incorporating long messages into the LogP model - one step closer towards a realistic model for parallel computation. In Proceedings of the 7th annual ACM symposium on Parallel algorithms and architectures (SPAA), pages 95--105, NY, USA, 1995. ACM.
[2]
G. Almasi, P. Hargrove, I. Gabriel, and T. Y. Zheng. UPC collectives library 2.0. In 5th Conference on Partitioned Global Address Space Programming Models, 2011.
[3]
P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. Lusk, R. Thakur, and J. L. Traeff. MPI on Millions of Cores. Parallel Processing Letters (PPL), 21(1):45--60, Mar. 2011.
[4]
J. Bruck, C.-T. Ho, E. Upfal, S. Kipnis, and D. Weathersby. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Trans. Parallel Distrib. Syst., 8(11):1143--1156, Nov. 1997.
[5]
E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. On optimizing collective communication. In Cluster Computing, 2004 IEEE International Conference on, pages 145--155, sept. 2004.
[6]
R. Fiedler. Preparing Applications for Sustained Petascale Performance, 2011.
[7]
J. Flich, M. P. Malumbres, P. L÷pez, and J. Duato. Improving routing performance in Myrinet networks. In Proc. of the 14th International Parallel and Distributed Processing Symposium, pages 27--32, Los Alamitos, CA, USA, 2000. IEEE Computer Society.
[8]
P. Geoffray and T. Hoefler. Adaptive routing strategies for modern high performance networks. In High Performance Interconnects, 2008. HOTI '08. 16th IEEE Symposium on, pages 165--172, Aug. 2008.
[9]
S. Gorlatch. Send-receive considered harmful: Myths and realities of message passing. ACM Trans. Program. Lang. Syst., 26(1):47--56, Jan. 2004.
[10]
R. I. Greenberg and C. E. Leiserson. Randomized routing on fat-trees. In Proc. of the 26th Annual Symposium on the Foundations of Computer Science, pages 241--249, 1985.
[11]
T. Hoefler and A. Lumsdaine. Optimizing non-blocking collective operations for InfiniBand. In IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, pages 1--8, 2008.
[12]
T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and performance analysis of non-blocking collective operations for MPI. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC '07, pages 1--10, nov. 2007.
[13]
P. Husbands and J. C. Hoe. MPI-StarT: delivering network performance to numerical applications. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing '98, pages 1--15, Washington, DC, USA, 1998.
[14]
H. Kariniemi. On-Line Reconfigurable Extended Generalized Fat Tree Network-on-Chip for Multiprocessor System-on-Chip Circuits. PhD thesis, Tampere University of Technology, 2006.
[15]
M. A. Kinsy, M. H. Cho, T. Wen, E. Suh, M. van Dijk, and S. Devadas. Application-aware deadlock-free oblivious routing. In Proceedings of the 36th annual international symposium on Computer architecture, ISCA '09, pages 208--219, New York, NY, USA, 2009. ACM.
[16]
C. Kurmann, F. Rauch, and T. M. Stricker. Cost/performance tradeoffs in network interconnects for clusters of commodity PCs. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing, IPDPS '03, pages 196.2--, Washington, DC, USA, 2003. IEEE Computer Society.
[17]
C. Leiserson et al. The network architecture of the Connection Machine CM-5. In Proc. of the Fourth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 272--285, San Diego, CA, USA, June 1992.
[18]
X.-Y. Lin, Y.-C. Chung, and T.-Y. Huang. A multiple LID routing scheme for fat-tree-based InfiniBand networks. Proc. of the 18th International Parallel and Distributed Processing Symposium, pages 11--, 2004.
[19]
A. Mamidala, R. Kumar, D. De, and D. Panda. MPI collectives on modern multicore clusters: Performance optimizations and communication characteristics. In Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on, pages 130--137, May 2008.
[20]
J. C. Martínez, J. Flich, A. Robles, P. López, and J. Duato. Supporting fully adaptive routing in InfiniBand networks. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing, IPDPS '03, pages 44.1--, Washington, DC, USA, 2003. IEEE Computer Society.
[21]
J. Mellor-Crummey, L. Adhianto, W. N. Scherer, III, and G. Jin. A new vision for coarray fortran. In Proceedings of the 3th Conference on Partitioned Global Address Space Programing Models, PGAS '09, pages 5:1--5:9, New York, NY, USA, 2009. ACM.
[22]
C. Minkenberg, W. Denzel, G. Rodriguez, and R. Birke. End-to-end modeling and simulation of high-performance computing systems. Springer Proceedings in Physics: Use Cases of Discrete Event Simulation: Appliance and Research, page 201, 2012.
[23]
P. Moin and K. Mahesh. Direct numerical simulation: a tool in turbulence research. Annual Review of Fluid Mechanics, 30(1):539--578, 1998.
[24]
C. A. Moritz and M. I. Frank. LoGPC: modeling network contention in message-passing programs. SIGMETRICS Perform. Eval. Rev., 26(1):254--263, June 1998.
[25]
MPI Forum. MPI: A Message-Passing Interface Standard. Version 3.0, September 2012.
[26]
S. R. Öhring, M. Ibel, S. K. Das, and M. J. Kumar. On generalized fat trees. In Proceedings of the 9th International Parallel Processing Symposium, page 37, Washington, DC, USA, 1995. IEEE Computer Society.
[27]
F. Petrini and M. Vanneschi. A comparison of wormhole-routed interconnection networks. In Proc. 3th International Conference on Computer Science and Informatics, NC, USA, Mar. 1997.
[28]
F. Petrini and M. Vanneschi. k -ary n -trees: High performance networks for massively parallel architectures. IPPS, 00:87, 1997.
[29]
J. Pješivac-Grbović, T. Angskun, G. Bosilca, G. Fagg, E. Gabriel, and J. Dongarra. Performance analysis of mpi collective operations. Cluster Computing, 10(2):127--143, 2007.
[30]
S. Ranka, J.-C. Wang, and G. C. Fox. Static and run-time algorithms for all-to-many personalized communication on permutation networks. IEEE Trans. Parallel Distrib. Syst., 5(12):1266--1274, Dec. 1994.
[31]
C. G. Requena, F. G. Villamón, M. E. Gómez, P. López, and J. Duato. Deterministic versus adaptive routing in fat-trees. Proc. of the 21st Parallel and Distributed Processing Symposium, 2007, pages 1--8, Mar. 2007.
[32]
P. Sack and W. Gropp. Faster topology-aware collective algorithms through non-minimal communication. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 45--54, New York, NY, USA, 2012. ACM.
[33]
M. Schroeder, A. Birrell, M. Burrows, H. Murray, R. Needham, T. Rodeheffer, E. Satterthwaite, and C. Thacker. Autonet: a high-speed, self-configuring local area network using point-to-point links. Selected Areas in Communications, IEEE Journal on, 9(8):1318--1335, Oct 1991.
[34]
S. Sistare, R. vandeVaart, and E. Loh. Optimization of MPI collectives on clusters of large-scale SMP's. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing '99, New York, NY, USA, 1999. ACM.
[35]
R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. IJHPCA, 19(1):49--66, 2005.
[36]
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990.
[37]
J. J. Willcock, T. Hoefler, N. G. Edmonds, and A. Lumsdaine. Active pebbles: parallel programming for data-driven applications. In Proceedings of the international conference on Supercomputing, ICS '11, pages 235--244, New York, NY, USA, 2011. ACM.
[38]
M. Xie, Y. Lu, L. Liu, H. Cao, and X. Yang. Implementation and evaluation of network interface and message passing services for TianHe-1A supercomputer. In Proceedings of the 2011 IEEE 19th Annual Symposium on High Performance Interconnects, HOTI '11, pages 78--86, Washington, DC, USA, 2011. IEEE Computer Society.
[39]
W. Yu, D. K. Panda, and D. Buntinas. Scalable, high-performance nic-based all-to-all broadcast over Myrinet/GM. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, CLUSTER '04, pages 125--134, Washington, DC, USA, 2004. IEEE Computer Society.
[40]
E. Zahavi. Fat-trees routing and node ordering providing contention free traffic for MPI global collectives. In IPDPS Workshops, pages 761--770. IEEE, 2011.
[41]
E. Zahavi, G. Johnson, D. J. Kerbyson, and M. Lang. Optimized InfiniBand fat-tree routing for shift all-to-all communication pattern. In International Supercomputing Conference (ISC07), Dresden, Germany, June 2007.
[42]
E. Zahavi, G. Johnson, D. J. Kerbyson, and M. Lang. Optimized InfiniBand(TM) fat-tree routing for shift all-to-all communication patterns. Concurr. Comput.: Pract. Exper., 22(2):217--231, Feb. 2010.

Cited By

View all
  • (2024)Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis [Invited]Journal of Optical Communications and Networking10.1364/JOCN.50142716:2(A143)Online publication date: 8-Jan-2024
  • (2023)How Data Center Networks Can Improve Through Co-packaged Optics2023 Optical Fiber Communications Conference and Exhibition (OFC)10.23919/OFC49934.2023.10116657(1-3)Online publication date: Mar-2023
  • (2023)Distributed SPARQL queries in collaboration with the routing protocolProceedings of the 27th International Database Engineered Applications Symposium10.1145/3589462.3589497(99-106)Online publication date: 5-May-2023
  • Show More Cited By

Index Terms

  1. Bandwidth-optimal all-to-all exchanges in fat tree networks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing
    June 2013
    512 pages
    ISBN:9781450321303
    DOI:10.1145/2464996
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. all-to-all
    2. bandwidth optimality
    3. fat tree networks

    Qualifiers

    • Research-article

    Conference

    ICS'13
    Sponsor:
    ICS'13: International Conference on Supercomputing
    June 10 - 14, 2013
    Oregon, Eugene, USA

    Acceptance Rates

    ICS '13 Paper Acceptance Rate 43 of 202 submissions, 21%;
    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)104
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Exploring the benefits of using co-packaged optics in data center and AI supercomputer networks: a simulation-based analysis [Invited]Journal of Optical Communications and Networking10.1364/JOCN.50142716:2(A143)Online publication date: 8-Jan-2024
    • (2023)How Data Center Networks Can Improve Through Co-packaged Optics2023 Optical Fiber Communications Conference and Exhibition (OFC)10.23919/OFC49934.2023.10116657(1-3)Online publication date: Mar-2023
    • (2023)Distributed SPARQL queries in collaboration with the routing protocolProceedings of the 27th International Database Engineered Applications Symposium10.1145/3589462.3589497(99-106)Online publication date: 5-May-2023
    • (2023)Fault-adaptive Scheduling for Data Acquisition Networks2023 IEEE 48th Conference on Local Computer Networks (LCN)10.1109/LCN58197.2023.10223324(1-4)Online publication date: 2-Oct-2023
    • (2023)Realizing Optimal All-to-All Personalized Communication Using Butterfly-Based NetworksIEEE Access10.1109/ACCESS.2023.327949411(51064-51083)Online publication date: 2023
    • (2023)xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep LearningJournal of Computer Science and Technology10.1007/s11390-023-2894-638:1(166-195)Online publication date: 31-Mar-2023
    • (2022)Intra-node High-performance Computing Network Architecture with Fast Optical Switch Fabrics2022 27th OptoElectronics and Communications Conference (OECC) and 2022 International Conference on Photonics in Switching and Computing (PSC)10.23919/OECC/PSC53152.2022.9850165(1-4)Online publication date: 3-Jul-2022
    • (2022)Toward higher-radix switches with co-packaged optics for improved network locality in data center and HPC networks [Invited]Journal of Optical Communications and Networking10.1364/JOCN.45144914:6(C1)Online publication date: 4-Mar-2022
    • (2022)Applying on Node Aggregation Methods to MPI Alltoall Collectives: Matrix Block Aggregation AlgorithmProceedings of the 29th European MPI Users' Group Meeting10.1145/3555819.3555821(11-17)Online publication date: 14-Sep-2022
    • (2022)HammingMesh: A Network Topology for Large-Scale Deep LearningSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00016(1-18)Online publication date: Nov-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media