research-article

Maximizing Communication Overlap with Dynamic Program Analysis

Authors:

Emmanuelle Saillard,

Costin IancuAuthors Info & Claims

HPCAsia '18: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

Pages 1 - 11

https://doi.org/10.1145/3149457.3149459

Published: 28 January 2018 Publication History

Abstract

We present a dynamic program analysis approach to optimize communication overlap in scientific applications. Our tool instruments the code to generate a trace of the application's memory and synchronization behavior. An offline analysis determines the program optimal points for maximal overlap when considering several programming constructs: nonblocking one-sided communication operations, non-blocking collectives and bespoke synchronization patterns and operations. Feedback about possible transformations is presented to the user and the tool can perform the directed transformations, which are supported by a lightweight runtime. The value of our approach comes from: 1) the ability to optimize across boundaries of software modules or libraries, while specializing for the intrinsics of the underlying communication runtime; and 2) providing upper bounds on the expected performance improvements after communication optimizations. We have reduced the time spent in communication by as much as 64% for several applications that were already aggressively optimized for overlap; this indicates that manual optimizations leave untapped performance. Although demonstrated mainly for the UPC programming language, the methodology can be easily adapted to any other communication and synchronization API.

References

[1]

GASNet Communication System. http://gasnet.lbl.gov.

[2]

UPC Home Page. http://upc-lang.org.

[3]

X10: Performance and Productivity at Scale. http://x10-lang.org.

[4]

The Model for Prediction Across Scales (MPAS), 2013. https://mpas-dev.github.io.

[5]

Edison, 2016. http://www.nersc.gov/users/computational-systems/edison/.

[6]

Berkeley UPC User's Guide v. 2.22.2, 2017. http://upc.lbl.gov/docs/user/.

[7]

The Chapel Parallel Programming Language, 2017. http://chapel.cray.com/index.html.

[8]

The LLVM Compiler Infrastructure, 2017. http://llvm.org.

[9]

Shepard, 2017. http://www.sandia.gov/asc/computational_systems/HAAPS.html.

[10]

B. Alverson, E. F. L. Kaplan, and D. Roweth. The Cray XC Network., 2012. http://www.cray.com/sites/default/files/resources/CrayXCNetwork.pdf.

[11]

B. Alverson, E. F. L. Kaplan, and D. Roweth. CrayÂő XCTM Series Network, 2012.

[12]

D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks --summary and preliminary results. In Supercomputing, 1991.

Digital Library

[13]

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing bandwidth limited problems using one-sided communication and overlap. In Proceedings of the 20th International Conference on Parallel and Distributed Processing, IPDPS'06, 2006.

Digital Library

[14]

The Berkeley UPC Compiler, 2002. http://upc.lbl.gov.

[15]

M. Chabbi, J. M. Crummey, K. Sen, W. de Jong, W. Lavrijsen, and C. Iancu. Barrier Elision for Production Parallel Programs. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2015, 2015.

Digital Library

[16]

S. Chakrabarti, M. Gupta, and J. Choi. Global communication analysis and optimization. In SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 68--78, 1996.

Digital Library

[17]

S. Chatterjee, S. Tasirlar, Z. Budimlic, V. CavÃl', M. Chabbi, M. Grossman, V. Sarkar, and Y. Yan. Integrating asynchronous task parallelism with MPI. In IEEE Parallel and Distributed Processing (IPDPS), 2013.

Digital Library

[18]

W.-Y. Chen, C. Iancu, and K. Yelick. Communication optimizations for fine-grained UPC applications. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, 2005.

Digital Library

[19]

E. Georganas, A. BuluÃğ, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick. merAligner: A fully parallel sequence aligner". In Proceedings of the 2015 IEEE International Symposium on Parallel&Distributed Processing, IPDPS '15, 2015.

Digital Library

[20]

M. Gupta, S. Midkiff, E. Schonberg, et al. A HPF compiler for the IBM SP2. In Supercomputing 1995, November 1995.

Digital Library

[21]

M. Gupta, E. Schonberg, and H. Srinivasan. A unified framework for optimizing communication in data-parallel programs. IEEE Transactions on Parallel and Distributed Systems, July 1996.

Digital Library

[22]

A. Hayashi, J. Zhao, M. Ferguson, and V. Sarkar. LLVM-based communication optimizations for PGAS programs. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM '15, 2015.

Digital Library

[23]

T. Hoefler, P. Kambadur, R. L. Graham, G. Shipman, and A. Lumsdaine. A Case for Standard Non-blocking Collective Operations. 2007.

[24]

P. Husbands and K. Yelick. Multi-threading and one-sided communication in parallel LU factorization. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC '07, 2007.

Digital Library

[25]

C. Iancu, W. Chen, and K. Yelick. Performance portable optimizations for loops containing communication operations. In Proceedings of the 22nd Annual International Conference on Supercomputing, 2008.

Digital Library

[26]

M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujam, and N. Shenoy. A global communication optimization technique based on data-flow analysis and linear algebra. ACM Transactions on Programming Languages and Systems, 21(6):1251--1297, 1999.

Digital Library

[27]

M. Luo, D. K. Panda, K. Z. Ibrahim, and C. Iancu. Congestion avoidance on manycore high performance computing systems. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS), 2012.

Digital Library

[28]

The Message Passing Interface Standard, 2016. http://www.mpi-forum.org/.

[29]

T. Nguyen, P. Cicotti, E. Bylaska, D. Quinlan, and S. B. Baden. Bamboo: Translating MPI applications to a latency-tolerant, data-driven form. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, 2012.

Digital Library

[30]

R. Nishtala, P. H. Hargrove, D. O. Bonachea, and K. A. Yelick. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, 2009.

Digital Library

[31]

H. Shan, S. Williams, Y. Zheng, A. Kamil, and K. Yelick. Implementing high-performance geometric multigrid solver with naturally grained messages. In Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models, 2015.

Digital Library

[32]

Man page collections: Shared memory access. http://www.cray.com/craydoc/20/manuals/S-2383-22/S-2383-22-manual.pdf.

[33]

E. Solomonik, D. Matthews, J. Hammond, and J. Demmel. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, IPDPS '13, 2013.

Digital Library

[34]

P. Sternberg, E. G. Ng, C. Yang, P. Maris, J. P. Vary, M. Sosonkina, and H. V. Le. Accelerating configuration interaction calculations for nuclear structure. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, 2008.

Digital Library

[35]

E. Su, A. Lain, S. Ramaswamy, D. J. Palermo, E. W. H. IV, and P. Banerjee. Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers. In 9th ACM International Conference on Supercomputing, pages 424--433, July 1995.

Digital Library

[36]

R. Sudarsan, J. Borrill, C. Cantalupo, T. Kisner, K. Madduri, L. Oliker, Y. Zheng, and H. Simon. Cosmic microwave background map-making at the petascale and beyond. In Proceedings of the 25th International Conference on Supercomputing, 2011.

Digital Library

[37]

S. Williams, D. D. Kalamkar, A. Singh, A. M. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, and L. Oliker. Optimization of geometric multigrid for emerging multi- and manycore processors. In Proc. of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12. IEEE Computer Society Press, 2012.

Digital Library

[38]

Y. Zheng, A. Kamil, M. B. Driscoll, H. Shan, and K. Yelick. UPC++: A PGAS extension for C++. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, 2014.

Digital Library

[39]

Y. Zhu and L. J. Hendren. Communication optimizations for parallel c programs. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI), pages 199--211, 1998.

Digital Library

Cited By

Lescouet ABrunet ETrahay FThomas G(2020)Transparent Overlapping of Blocking Communication in MPI Applications2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00097(744-749)Online publication date: Dec-2020
https://doi.org/10.1109/HPCC-SmartCity-DSS50907.2020.00097

Index Terms

Maximizing Communication Overlap with Dynamic Program Analysis
1. Computing methodologies
  1. Parallel computing methodologies
2. Software and its engineering
  1. Software notations and tools

Recommendations

Optimizing the Synchronization Operations in Message Passing Interface One-Sided Communication

One-sided communication in Message Passing Interface (MPI) requires the use of one of three different synchronization mechanisms, which indicate when the one-sided operation can be started and when the operation is completed. Efficient implementation of ...
Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computation

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Implementing OpenSHMEM Using MPI-3 One-Sided Communication
OpenSHMEM 2014: Proceedings of the First Workshop on OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools - Volume 8356

This paper reports the design and implementation of Open- SHMEM over MPI using new one-sided communication features in MPI- 3, which include not only new functions (e.g. remote atomics) but also a newmemory model that is consistent with that of SHMEM.We ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

HPCAsia '18: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

January 2018

322 pages

ISBN:9781450353724

DOI:10.1145/3149457

Copyright © 2018 ACM.

© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
IPSJ: Information Processing Society of Japan
Cybermedia Center, Osaka University: Cybermedia Center, Osaka University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 January 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

HPC Asia 2018

HPC Asia 2018: International Conference on High Performance Computing in Asia-Pacific Region

January 28 - 31, 2018

Tokyo, Chiyoda, Japan

Acceptance Rates

HPCAsia '18 Paper Acceptance Rate 30 of 67 submissions, 45%;

Overall Acceptance Rate 69 of 143 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
104
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lescouet ABrunet ETrahay FThomas G(2020)Transparent Overlapping of Blocking Communication in MPI Applications2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00097(744-749)Online publication date: Dec-2020
https://doi.org/10.1109/HPCC-SmartCity-DSS50907.2020.00097

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents