research-article

Congestion avoidance on manycore high performance computing systems

Authors:

Dhabaleswar K. Panda,

Khaled Z. Ibrahim,

Costin IancuAuthors Info & Claims

ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Pages 121 - 132

https://doi.org/10.1145/2304576.2304594

Published: 25 June 2012 Publication History

Abstract

Efficient communication is a requirement for application scalability on High Performance Computing systems. In this paper we argue for incorporating proactive congestion avoidance mechanisms into the design of communication layers on manycore systems. This is in contrast with the status quo which employs a reactive approach, \emph{e.g.} congestion control mechanisms are activated only when resources have been exhausted. We present a core stateless optimization approach based on open loop end-point throttling, implemented for two UPC runtimes (Cray and Berkeley UPC) and validated on InfiniBand and the Cray Gemini networks. Microbenchmark results indicate that throttling the number of messages in flight per core can provide up to 4X performance improvements, while throttling the number of active cores per node can provide additional 40\% and 6X performance improvement for UPC and MPI respectively. We evaluate inline (each task makes independent decisions) and proxy (server) congestion avoidance designs. Our runtime provides both performance and performance portability. We improve all-to-all collective performance by up to 4X and provide better performance than vendor provided MPI and UPC implementations. We also demonstrate performance improvements of up to 60\% in application settings. Overall, our results indicate that modern systems accommodate only a surprisingly small number of messages in flight per node. As Exascale projections indicate that future systems are likely to contain hundreds to thousands of cores per node, we believe that their networks will be underprovisioned. In this situation, proactive congestion avoidance might become mandatory for performance improvement and portability.

References

[1]

The GWU NAS Benchmarks. http://threads.hpcl.gwu.edu/sites/npb-upc.

[2]

The InfiniBand Specification. Available at http://www.infinibandta.org.

[3]

V. Aggarwal, Y. Sabharwal, R. Garg, and P. Heidelberger. HPCC Randomaccess Benchmark For Next Generation Supercomputers. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, pages 1--11, Washington, DC, USA, 2009. IEEE Computer Society.

Digital Library

[4]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, Fall 1991.

Digital Library

[5]

F. Blagojević, P. Hargrove, C. Iancu, and K. Yelick. Hybrid PGAS Runtime Support for Multicore Nodes. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS '10, 2010.

Digital Library

[6]

D. Bonachea. GASNet Specification, v1.1. Technical Report CSD-02-1207, University of California at Berkeley, October 2002.

Digital Library

[7]

E. A. Brewer and B. C. Kuszmaul. How to Get Good Performance from the CM-5 Data Network. In IPPS'94, pages 858--867, 1994.

Digital Library

[8]

J. Bruck, S. Member, C. tien Ho, S. Kipnis, E. Upfal, S. Member, and D. Weathersby. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In IEEE Transactions on Parallel and Distributed Systems, pages 298--309, 1997.

Digital Library

[9]

Berkeley UPC. Available at http://upc.lbl.gov/.

[10]

M. Chetlur, G. D. Sharma, N. B. Abu-Ghazaleh, U. K. V. Rajasekaran, and P. A. Wilsey. An Active Layer Extension to MPi. In PVM/MPI, 1998.

Digital Library

[11]

Available at http://www.cpmd.org/.

[12]

W. J. Dally and C. L. Seitz. The Torus Routing Chip. Distributed Computing, pages 187--196, 1986.

[13]

V. Dvorak, J. Jaros, and M. Ohlidal. Optimum Topology-Aware Scheduling of Many-to-Many Collective Communications. International Conference on Networking, 0:61, 2007.

Digital Library

[14]

A. Faraj and X. Yuan. Communication Characteristics in the NAS Parallel Benchmarks. In 14th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2002), November 2002.

[15]

H. Jin, R. Hood, and P. Mehrotra. A Practical Study of UPC with the NAS Parallel Benchmarks. The 3rd Conference on Partitioned Global Address Space (PGAS) Programming Models, 2009.

Digital Library

[16]

K. C. Kandalla, H. Subramoni, A. Vishnu, and D. K. Panda. Designing Topology-Aware Collective Communication Algorithms for Large Scale InfiniBand Clusters: Case studies with Scatter and Gather. In IPDPS Workshops'10, pages 1--8, 2010.

[17]

R. Kumar, A. Mamidala, and D. K. Panda. Scaling alltoall Collective on Multi-Core Systems. 2008 IEEE International Symposium on Parallel and Distributed Processing, pages 1--8, 2008.

[18]

S. Kumar and L. V. Kalé. Scaling All-to-All Multicast on Fat-tree Networks. In ICPADS'04, pages 205--214, 2004.

Digital Library

[19]

R. Nishtala, Y. Zheng, P. Hargrove, and K. A. Yelick. Tuning collective communication for partitioned global address space programming models. Parallel Computing, 37(9):576--591, 2011.

Digital Library

[20]

C. D. Pham. Comparison of Message Aggregation Strategies for Parallel Simulations on a High Performance Cluster. In In Proceedings Of The 8th International Symposium On Modeling, Analysis And Simulation Of Computer And Telecommunication Systems, August-September, 2000.

Digital Library

[21]

J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kalé. NAMD: Biomolecular Simulation on Thousands of Processors. In Proceedings of SC 2002, Baltimore, MD, September 2002.

Digital Library

[22]

R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of Collective Communication Operations in MPICH. IJHPCA, pages 49--66, 2005.

Digital Library

[23]

UPC Language Specification, Version 1.0. Available at http://upc.gwu.edu.

[24]

J. Vetter and F. Mueller. Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures. Proceedings of the 2002 International Parallel and Distributed Processing Symposium (IPDPS), 2002.

Digital Library

[25]

Y. Yang and J. Wang. Efficient All-to-All Broadcast in All-Port Mesh and Torus Networks. In Proceedings of the 5th International Symposium on High Performance Computer Architecture, HPCA '99, pages 290--, Washington, DC, USA, 1999. IEEE Computer Society.

Digital Library

[26]

Y. Yang and J. Wang. Near-Optimal All-to-All Broadcast in Multidimensional All-Port Meshes and Tori. IEEE Trans. Parallel Distrib. Syst., 13:128--141, February 2002.

Digital Library

[27]

E. Zahavi, G. Johnson, D. J. Kerbyson, and M. L. 0003. Optimized InfiniBand™ Fat-Tree Routing For Shift All-To-All Communication Patterns. Concurrency and Computation: Practice and Experience, 22(2):217--231, 2010.

Digital Library

Cited By

Schonbein WMatsika TGrant R(2024)Smart Network Traffic Prediction for Scientific Applications2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00022(108-115)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00022
Xiao MWang HGeng LLee RZhang X(2022)An RDMA-enabled In-memory Computing Platform for R-tree on ClustersACM Transactions on Spatial Algorithms and Systems10.1145/35035138:2(1-26)Online publication date: 12-Feb-2022
https://dl.acm.org/doi/10.1145/3503513
Nichols DMarathe AShoga KGamblin TBhatele A(2022)Resource Utilization Aware Job Scheduling to Mitigate Performance Variability2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00040(335-345)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00040
Show More Cited By

Index Terms

Congestion avoidance on manycore high performance computing systems
1. Hardware
  1. Communication hardware, interfaces and storage
2. Networks

Recommendations

A cluster for CS education in the manycore era
SIGCSE '11: Proceedings of the 42nd ACM technical symposium on Computer science education

Traditional Beowulf clusters have been homogeneous platforms for distributed-memory MIMD parallelism. However, the shift to multicore architectures has made shared-memory MIMD parallelism increasingly important, and inexpensive manycore GPGPUs have ...
Analysis of computing and energy performance of multicore, NUMA, and manycore platforms for an irregular application
IA³ '13: Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms

The exponential growth in processor performance seems to have reached a turning point. Nowadays, energy efficiency is as important as performance and has become a critical aspect to the development of scalable systems. These strict energy constraints ...
Approximate weighted matching on emerging manycore and multithreaded architectures

Graph matching is a prototypical combinatorial problem with many applications in high-performance scientific computing. Optimal algorithms for computing matchings are challenging to parallelize. Approximation algorithms are amenable to parallelization ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

June 2012

400 pages

ISBN:9781450313162

DOI:10.1145/2304576

General Chairs:
Utpal Banerjee
University of California at Irvine, USA
,
Kyle A. Gallivan
Florida State University, USA
,
Program Chairs:
Gianfranco Bilardi
Università degli Studi di Padova, Italy
,
Manolis G.H. Katevenis
FORTH and University of Crete, Greece

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS'12

Sponsor:

SIGARCH

ICS'12: International Conference on Supercomputing

June 25 - 29, 2012

San Servolo Island, Venice, Italy

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
337
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Schonbein WMatsika TGrant R(2024)Smart Network Traffic Prediction for Scientific Applications2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00022(108-115)Online publication date: 20-Mar-2024
https://doi.org/10.1109/PDP62718.2024.00022
Xiao MWang HGeng LLee RZhang X(2022)An RDMA-enabled In-memory Computing Platform for R-tree on ClustersACM Transactions on Spatial Algorithms and Systems10.1145/35035138:2(1-26)Online publication date: 12-Feb-2022
https://dl.acm.org/doi/10.1145/3503513
Nichols DMarathe AShoga KGamblin TBhatele A(2022)Resource Utilization Aware Job Scheduling to Mitigate Performance Variability2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00040(335-345)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00040
Patke AJha SQiu HBrandt JGentile AGreenseid JKalbarczyk ZIyer RZhou HMoreira JMueller FEtsion Y(2021)Delay sensitivity-driven congestion mitigation for HPC systemsProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460362(342-353)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460362
Peng JYang ZLiu Q(2020)Improving Performance of Batch Point-to-Point Communications by Active Contention Reduction Through Congestion-Avoiding Message SchedulingAlgorithms and Architectures for Parallel Processing10.1007/978-3-030-38991-8_27(404-418)Online publication date: 22-Jan-2020
https://doi.org/10.1007/978-3-030-38991-8_27
Saillard ESen KLavrijsen WIancu C(2018)Maximizing Communication Overlap with Dynamic Program AnalysisProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3149457.3149459(1-11)Online publication date: 28-Jan-2018
https://dl.acm.org/doi/10.1145/3149457.3149459
Chaimov NIbrahim KWilliams SIancu C(2017)Reaching bandwidth saturation using transparent injection parallelizationInternational Journal of High Performance Computing Applications10.1177/109434201667272031:5(405-421)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1177/1094342016672720
Kirmani SPark JRaghavan P(2017)An embedded sectioning scheme for multiprocessor topology-aware mapping of irregular applicationsInternational Journal of High Performance Computing Applications10.1177/109434201559708231:1(91-103)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1177/1094342015597082
Neuwirth SWang FOral SBruening U(2017)Automatic and Transparent Resource Contention Mitigation for Improving Large-Scale Parallel File System Performance2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS.2017.00084(604-613)Online publication date: Dec-2017
https://doi.org/10.1109/ICPADS.2017.00084
Schneider TDinan JFlajslik MUnderwood KHoefler T(2017)Fast Networks and Slow Memories: A Mechanism for Mitigating Bandwidth Mismatches2017 IEEE 25th Annual Symposium on High-Performance Interconnects (HOTI)10.1109/HOTI.2017.17(17-24)Online publication date: Aug-2017
https://doi.org/10.1109/HOTI.2017.17
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents