Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Modeling Universal Globally Adaptive Load-Balanced Routing

Published: 30 August 2019 Publication History

Abstract

Universal globally adaptive load-balanced (UGAL) routing has been proposed for various interconnection networks and has been deployed in a number of current-generation supercomputers. Although UGAL-based schemes have been extensively studied, most existing results are based on either simulation or measurement. Without a theoretical understanding of UGAL, multiple questions remain: For which traffic patterns is UGAL most suited? In addition, what determines the performance of the UGAL-based scheme on a particular network configuration? In this work, we develop a set of throughput models for UGALbased on linear programming. We show that the throughput models are valid across the torus, Dragonfly, and Slim Fly network topologies. Finally, we identify a robust model that can accurately and efficiently predict UGAL throughput for a set of representative traffic patterns across different topologies. Our models not only provide a mechanism to predict UGAL performance on large-scale interconnection networks but also reveal the inner working of UGAL and further our understanding of this type of routing.

References

[1]
Greg Faanes, Abdulla Bataineh, Duncan Roweth, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, et al. 2012. Cray Cascade: A scalable HPC system based on a Dragonfly network. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis. IEEE, Los Alamitos, CA, 103.
[2]
Maciej Besta and Torsten Hoefler. 2014. Slim fly: A cost effective low-diameter network topology. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’14). IEEE, Los Alamitos, CA, 348--359.
[3]
Arjun Singh. 2005. Load-Balanced Routing in Interconnection Networks. Ph.D. Dissertation. Stanford University.
[4]
John Kim, Wiliam J. Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable Dragonfly topology. ACM SIGARCH Computer Architecture News 36, 3 (2008), 77--88.
[5]
Nan Jiang, John Kim, and William J. Dally. 2009. Indirect adaptive routing on large scale interconnection networks. ACM SIGARCH Computer Architecture News 37, 3 (June 2009), 220--231.
[6]
F. Shahrokhi and D. W. Matula. 1990. The maximum concurrent flow problem. Journal of the ACM 37, 2 (April 1990), 318--334.
[7]
Sangeetha Abdu Jyothi, Ankit Singla, P. Brighten Godfrey, and Alexandra Kolla. 2016. Measuring and understanding throughput of network topologies. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’16). IEEE, Los Alamitos, CA, Article 65, 12 pages. http://dl.acm.org/citation.cfm?id=3014904.3014991.
[8]
Ankit Singla, P. Brighten Godfrey, and Alexandra Kolla. 2014. High throughput data center topology design. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14).
[9]
P. Faizian, M. A. Mollah, X. Yuan, Z. Alzaid, S. Pakin, and M. Lang. 2018. Random regular graph and generalized De Bruijn graph with k-shortest path routing. IEEE Transactions on Parallel and Distributed Systems 29, 1 (Jan. 2018), 144--155.
[10]
N. Jiang, J. Balfour, D. U. Becker, B. Towles, W. J. Dally, G. Michelogiannakis, and J. Kim. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’13). 86--96.
[11]
Scott Pakin, Xin Yuan, and Michael Lang. 2013. Predicting the performance of extreme-scale supercomputer networks. NSA/CSS the Next Wave 20, 2 (2013), 7--19.
[12]
N. Jain, A. Bhatele, X. Ni, N. J. Wright, and L. V. Kale. 2014. Maximizing throughput on a Dragonfly network. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’14). 336--347.
[13]
Peyman Faizian, Md Atiqul Mollah, Zhou Tong, Xin Yuan, and Michael Lang. 2017. A comparative study of SDN and adaptive routing on Dragonfly networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’17). ACM, New York, NY, Article 51, 11 pages.
[14]
M. A. Mollah, X. Yuan, S. Pakin, and M. Lang. 2018. Rapid calculation of max-min fair rates for multi-commodity flows in fat-tree networks. IEEE Transactions on Parallel and Distributed Systems 29, 1 (Jan. 2018), 156--168.
[15]
Xin Yuan, Santosh Mahapatra, Michael Lang, and Scott Pakin. 2014. LFTI: A new performance metric for assessing interconnect designs for extreme-scale HPC systems. In Proceedings of the IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE, Los Alamitos, CA, 273--282.
[16]
NERSC Edison Supercomputer. n.d. NERSC. Home Page. Retrieved August 2, 2019 from http://www.nersc.gov/users/computational-systems/edison/.
[17]
L. G. Valiant. 1982. A scheme for fast parallel communication. SIAM Journal on Computing 11, 2 (1982), 350--361.
[18]
M. Garcia, E. Vallejo, R. Beivide, M. Odriozola, C. Camarero, M. Valero, G. Rodríguez, J. Labarta, and C. Minkenberg. 2012. On-the-fly adaptive routing in high-radix hierarchical networks. In Proceedings of the 41st International Conference on Parallel Processing (ICPP’12). 279--288.
[19]
L. G. Valiant and G. J. Brebner. 1981. Universal schemes for parallel communication. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC’81). ACM, New York, NY, 263--277.
[20]
N. Megiddo. 1987. On the complexity of linear programming. In Advances in Economic Theory. Cambridge University Press, 225--268.
[21]
IBM CPLEX Optimizer. n.d. IBM CPLEX Optimization Studio. Retrieved September 5, 2017 from https://www.ibm.com/us-en/marketplace/ibm-ilog-cplex/.
[22]
Md Shafayat Rahman, Peyman Faizian, Md Atiqul Mollah, and Xin Yuan. 2018. Load-balanced slim fly networks. In Proceedings of the 47th International Conference on Parallel Processing. IEEE, Los Alamitos, CA.
[23]
B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. 2009. Express cube topologies for on-chip interconnects. In Proceedings of the 2009 IEEE 15th International Symposium on High Performance Computer Architecture. 163--174.
[24]
G. Rodriguez, C. Minkenberg, R. Beivide, R. P. Luijten, J. Labarta, and M. Valero. 2009. Oblivious routing schemes in extended generalized fat tree networks. In Proceedings of the 2009 IEEE International Conference on Cluster Computing and Workshops. 1--8.
[25]
X. Yuan, W. Nienaber, Z. Duan, and R. Melhem. 2009. Oblivious routing in fat-tree based system area networks with uncertain traffic demands. IEEE/ACM Transactions on Networking 17, 5 (Oct. 2009), 1439--1452.
[26]
Xin Yuan, Wickus Nienaber, and Santosh Mahapatra. 2016. On folded-Clos networks with deterministic single-path routing. ACM Transactions on Parallel Computing 2, 4 (Jan. 2016), 22 pages.
[27]
Ankit Singla, Chi-Yao Hong, Lucian Popa, and P. Brighten Godfrey. 2012. Jellyfish: Networking data centers randomly. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI’12). 17. http://dl.acm.org/citation.cfm?id=2228298.2228322.
[28]
Xin Yuan, Santosh Mahapatra, Wickus Nienaber, Scott Pakin, and Michael Lang. 2013. New routing scheme for Jellyfish and its performance with HPC workloads. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis. ACM, New York, NY, 36.
[29]
C. Camarero, C. Martínez, E. Vallejo, and R. Beivide. 2017. Projective networks: Topologies for large parallel computer systems. IEEE Transactions on Parallel and Distributed Systems 28, 7 (July 2017), 2003--2016.
[30]
M. Garcia, E. Vallejo, R. Beivide, M. Valero, and G. Rodríguez. 2013. OFAR-CM: Efficient Dragonfly networks with simple congestion management. In Proceedings of the IEEE 21st Annual Symposium on High-Performance Interconnects (HOTI’13). 55--62.
[31]
M. Garcia, E. Vallejo, R. Beivide, M. Odriozola, and M. Valero. 2013. Efficient routing mechanisms for Dragonfly networks. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP’13). 582--592.
[32]
Peyman Faizian, Juan Francisco Alfaro, Md Shafayat Rahman, Md Atiqul Mollah, Xin Yuan, Scott Pakin, and Michael Lang. 2018. TPR: Traffic pattern-based adaptive routing for Dragonfly networks. IEEE Transactions on Multi-Scale Computing Systems 4, 4 (2018), 931--943.
[33]
J. Won, G. Kim, J. Kim, T. Jiang, M. Parker, and S. Scott. 2015. Overcoming far-end congestion in large-scale networks. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 415--427.
[34]
P. Fuentes, E. Vallejo, M. Garcia, R. Beivide, G. Rodríguez, C. Minkenberg, and M. Valero. 2015. Contention-based nonminimal adaptive routing in high-radix networks. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’15). 103--112.

Cited By

View all
  • (2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-w80:10(14978-15005)Online publication date: 1-Jul-2024
  • (2023)An Analysis of Long-Tailed Network Latency Distribution and Background Traffic on Dragonfly+Benchmarking, Measuring, and Optimizing10.1007/978-3-031-31180-2_8(123-142)Online publication date: 13-May-2023
  • (2021)Multi-Path Routing in the Jellyfish Network2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00124(832-841)Online publication date: Jun-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 6, Issue 2
June 2019
109 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3343018
Issue’s Table of Contents
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 August 2019
Accepted: 01 July 2019
Revised: 01 March 2019
Received: 01 August 2018
Published in TOPC Volume 6, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Adaptive routing
  2. UGAL routing
  3. high performance computing

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Advanced Simulation and Computing

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-w80:10(14978-15005)Online publication date: 1-Jul-2024
  • (2023)An Analysis of Long-Tailed Network Latency Distribution and Background Traffic on Dragonfly+Benchmarking, Measuring, and Optimizing10.1007/978-3-031-31180-2_8(123-142)Online publication date: 13-May-2023
  • (2021)Multi-Path Routing in the Jellyfish Network2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00124(832-841)Online publication date: Jun-2021
  • (2020)Global link arrangement for practical DragonflyProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392756(1-11)Online publication date: 29-Jun-2020

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media