Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2304576.2304594acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Congestion avoidance on manycore high performance computing systems

Published: 25 June 2012 Publication History

Abstract

Efficient communication is a requirement for application scalability on High Performance Computing systems. In this paper we argue for incorporating proactive congestion avoidance mechanisms into the design of communication layers on manycore systems. This is in contrast with the status quo which employs a reactive approach, \emph{e.g.} congestion control mechanisms are activated only when resources have been exhausted. We present a core stateless optimization approach based on open loop end-point throttling, implemented for two UPC runtimes (Cray and Berkeley UPC) and validated on InfiniBand and the Cray Gemini networks. Microbenchmark results indicate that throttling the number of messages in flight per core can provide up to 4X performance improvements, while throttling the number of active cores per node can provide additional 40\% and 6X performance improvement for UPC and MPI respectively. We evaluate inline (each task makes independent decisions) and proxy (server) congestion avoidance designs. Our runtime provides both performance and performance portability. We improve all-to-all collective performance by up to 4X and provide better performance than vendor provided MPI and UPC implementations. We also demonstrate performance improvements of up to 60\% in application settings. Overall, our results indicate that modern systems accommodate only a surprisingly small number of messages in flight per node. As Exascale projections indicate that future systems are likely to contain hundreds to thousands of cores per node, we believe that their networks will be underprovisioned. In this situation, proactive congestion avoidance might become mandatory for performance improvement and portability.

References

[1]
The GWU NAS Benchmarks. http://threads.hpcl.gwu.edu/sites/npb-upc.
[2]
The InfiniBand Specification. Available at http://www.infinibandta.org.
[3]
V. Aggarwal, Y. Sabharwal, R. Garg, and P. Heidelberger. HPCC Randomaccess Benchmark For Next Generation Supercomputers. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, pages 1--11, Washington, DC, USA, 2009. IEEE Computer Society.
[4]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, Fall 1991.
[5]
F. Blagojević, P. Hargrove, C. Iancu, and K. Yelick. Hybrid PGAS Runtime Support for Multicore Nodes. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS '10, 2010.
[6]
D. Bonachea. GASNet Specification, v1.1. Technical Report CSD-02-1207, University of California at Berkeley, October 2002.
[7]
E. A. Brewer and B. C. Kuszmaul. How to Get Good Performance from the CM-5 Data Network. In IPPS'94, pages 858--867, 1994.
[8]
J. Bruck, S. Member, C. tien Ho, S. Kipnis, E. Upfal, S. Member, and D. Weathersby. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In IEEE Transactions on Parallel and Distributed Systems, pages 298--309, 1997.
[9]
Berkeley UPC. Available at http://upc.lbl.gov/.
[10]
M. Chetlur, G. D. Sharma, N. B. Abu-Ghazaleh, U. K. V. Rajasekaran, and P. A. Wilsey. An Active Layer Extension to MPi. In PVM/MPI, 1998.
[11]
Available at http://www.cpmd.org/.
[12]
W. J. Dally and C. L. Seitz. The Torus Routing Chip. Distributed Computing, pages 187--196, 1986.
[13]
V. Dvorak, J. Jaros, and M. Ohlidal. Optimum Topology-Aware Scheduling of Many-to-Many Collective Communications. International Conference on Networking, 0:61, 2007.
[14]
A. Faraj and X. Yuan. Communication Characteristics in the NAS Parallel Benchmarks. In 14th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2002), November 2002.
[15]
H. Jin, R. Hood, and P. Mehrotra. A Practical Study of UPC with the NAS Parallel Benchmarks. The 3rd Conference on Partitioned Global Address Space (PGAS) Programming Models, 2009.
[16]
K. C. Kandalla, H. Subramoni, A. Vishnu, and D. K. Panda. Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case studies with Scatter and Gather. In IPDPS Workshops'10, pages 1--8, 2010.
[17]
R. Kumar, A. Mamidala, and D. K. Panda. Scaling alltoall Collective on Multi-Core Systems. 2008 IEEE International Symposium on Parallel and Distributed Processing, pages 1--8, 2008.
[18]
S. Kumar and L. V. Kalé. Scaling All-to-All Multicast on Fat-tree Networks. In ICPADS'04, pages 205--214, 2004.
[19]
R. Nishtala, Y. Zheng, P. Hargrove, and K. A. Yelick. Tuning collective communication for partitioned global address space programming models. Parallel Computing, 37(9):576--591, 2011.
[20]
C. D. Pham. Comparison of Message Aggregation Strategies for Parallel Simulations on a High Performance Cluster. In In Proceedings Of The 8th International Symposium On Modeling, Analysis And Simulation Of Computer And Telecommunication Systems, August-September, 2000.
[21]
J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kalé. NAMD: Biomolecular Simulation on Thousands of Processors. In Proceedings of SC 2002, Baltimore, MD, September 2002.
[22]
R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of Collective Communication Operations in MPICH. IJHPCA, pages 49--66, 2005.
[23]
UPC Language Specification, Version 1.0. Available at http://upc.gwu.edu.
[24]
J. Vetter and F. Mueller. Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures. Proceedings of the 2002 International Parallel and Distributed Processing Symposium (IPDPS), 2002.
[25]
Y. Yang and J. Wang. Efficient All-to-All Broadcast in All-Port Mesh and Torus Networks. In Proceedings of the 5th International Symposium on High Performance Computer Architecture, HPCA '99, pages 290--, Washington, DC, USA, 1999. IEEE Computer Society.
[26]
Y. Yang and J. Wang. Near-Optimal All-to-All Broadcast in Multidimensional All-Port Meshes and Tori. IEEE Trans. Parallel Distrib. Syst., 13:128--141, February 2002.
[27]
E. Zahavi, G. Johnson, D. J. Kerbyson, and M. L. 0003. Optimized InfiniBand™ Fat-Tree Routing For Shift All-To-All Communication Patterns. Concurrency and Computation: Practice and Experience, 22(2):217--231, 2010.

Cited By

View all
  • (2024)Smart Network Traffic Prediction for Scientific Applications2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00022(108-115)Online publication date: 20-Mar-2024
  • (2022)An RDMA-enabled In-memory Computing Platform for R-tree on ClustersACM Transactions on Spatial Algorithms and Systems10.1145/35035138:2(1-26)Online publication date: 12-Feb-2022
  • (2022)Resource Utilization Aware Job Scheduling to Mitigate Performance Variability2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00040(335-345)Online publication date: May-2022
  • Show More Cited By

Index Terms

  1. Congestion avoidance on manycore high performance computing systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICS '12: Proceedings of the 26th ACM international conference on Supercomputing
      June 2012
      400 pages
      ISBN:9781450313162
      DOI:10.1145/2304576
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 25 June 2012

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. avoidance
      2. congestion
      3. cray
      4. high performance computing
      5. infiniband
      6. management
      7. manycore
      8. multicore

      Qualifiers

      • Research-article

      Conference

      ICS'12
      Sponsor:
      ICS'12: International Conference on Supercomputing
      June 25 - 29, 2012
      San Servolo Island, Venice, Italy

      Acceptance Rates

      Overall Acceptance Rate 629 of 2,180 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 09 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Smart Network Traffic Prediction for Scientific Applications2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00022(108-115)Online publication date: 20-Mar-2024
      • (2022)An RDMA-enabled In-memory Computing Platform for R-tree on ClustersACM Transactions on Spatial Algorithms and Systems10.1145/35035138:2(1-26)Online publication date: 12-Feb-2022
      • (2022)Resource Utilization Aware Job Scheduling to Mitigate Performance Variability2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00040(335-345)Online publication date: May-2022
      • (2021)Delay sensitivity-driven congestion mitigation for HPC systemsProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460362(342-353)Online publication date: 3-Jun-2021
      • (2020)Improving Performance of Batch Point-to-Point Communications by Active Contention Reduction Through Congestion-Avoiding Message SchedulingAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-38991-8_27(404-418)Online publication date: 22-Jan-2020
      • (2018)Maximizing Communication Overlap with Dynamic Program AnalysisProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3149457.3149459(1-11)Online publication date: 28-Jan-2018
      • (2017)Reaching bandwidth saturation using transparent injection parallelizationInternational Journal of High Performance Computing Applications10.1177/109434201667272031:5(405-421)Online publication date: 1-Sep-2017
      • (2017)An embedded sectioning scheme for multiprocessor topology-aware mapping of irregular applicationsInternational Journal of High Performance Computing Applications10.1177/109434201559708231:1(91-103)Online publication date: 1-Jan-2017
      • (2017)Automatic and Transparent Resource Contention Mitigation for Improving Large-Scale Parallel File System Performance2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2017.00084(604-613)Online publication date: Dec-2017
      • (2017)Fast Networks and Slow Memories: A Mechanism for Mitigating Bandwidth Mismatches2017 IEEE 25th Annual Symposium on High-Performance Interconnects (HOTI)10.1109/HOTI.2017.17(17-24)Online publication date: Aug-2017
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media