Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3149457.3149459acmotherconferencesArticle/Chapter ViewAbstractPublication PageshpcasiaConference Proceedingsconference-collections
research-article

Maximizing Communication Overlap with Dynamic Program Analysis

Published: 28 January 2018 Publication History

Abstract

We present a dynamic program analysis approach to optimize communication overlap in scientific applications. Our tool instruments the code to generate a trace of the application's memory and synchronization behavior. An offline analysis determines the program optimal points for maximal overlap when considering several programming constructs: nonblocking one-sided communication operations, non-blocking collectives and bespoke synchronization patterns and operations. Feedback about possible transformations is presented to the user and the tool can perform the directed transformations, which are supported by a lightweight runtime. The value of our approach comes from: 1) the ability to optimize across boundaries of software modules or libraries, while specializing for the intrinsics of the underlying communication runtime; and 2) providing upper bounds on the expected performance improvements after communication optimizations. We have reduced the time spent in communication by as much as 64% for several applications that were already aggressively optimized for overlap; this indicates that manual optimizations leave untapped performance. Although demonstrated mainly for the UPC programming language, the methodology can be easily adapted to any other communication and synchronization API.

References

[1]
GASNet Communication System. http://gasnet.lbl.gov.
[2]
UPC Home Page. http://upc-lang.org.
[3]
X10: Performance and Productivity at Scale. http://x10-lang.org.
[4]
The Model for Prediction Across Scales (MPAS), 2013. https://mpas-dev.github.io.
[5]
Edison, 2016. http://www.nersc.gov/users/computational-systems/edison/.
[6]
Berkeley UPC User's Guide v. 2.22.2, 2017. http://upc.lbl.gov/docs/user/.
[7]
The Chapel Parallel Programming Language, 2017. http://chapel.cray.com/index.html.
[8]
The LLVM Compiler Infrastructure, 2017. http://llvm.org.
[9]
Shepard, 2017. http://www.sandia.gov/asc/computational_systems/HAAPS.html.
[10]
B. Alverson, E. F. L. Kaplan, and D. Roweth. The Cray XC Network., 2012. http://www.cray.com/sites/default/files/resources/CrayXCNetwork.pdf.
[11]
B. Alverson, E. F. L. Kaplan, and D. Roweth. CrayÂő XCTM Series Network, 2012.
[12]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks --summary and preliminary results. In Supercomputing, 1991.
[13]
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing bandwidth limited problems using one-sided communication and overlap. In Proceedings of the 20th International Conference on Parallel and Distributed Processing, IPDPS'06, 2006.
[14]
The Berkeley UPC Compiler, 2002. http://upc.lbl.gov.
[15]
M. Chabbi, J. M. Crummey, K. Sen, W. de Jong, W. Lavrijsen, and C. Iancu. Barrier Elision for Production Parallel Programs. In Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2015, 2015.
[16]
S. Chakrabarti, M. Gupta, and J. Choi. Global communication analysis and optimization. In SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 68--78, 1996.
[17]
S. Chatterjee, S. Tasirlar, Z. Budimlic, V. CavÃl', M. Chabbi, M. Grossman, V. Sarkar, and Y. Yan. Integrating asynchronous task parallelism with MPI. In IEEE Parallel and Distributed Processing (IPDPS), 2013.
[18]
W.-Y. Chen, C. Iancu, and K. Yelick. Communication optimizations for fine-grained UPC applications. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, 2005.
[19]
E. Georganas, A. BuluÃğ, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick. merAligner: A fully parallel sequence aligner". In Proceedings of the 2015 IEEE International Symposium on Parallel&Distributed Processing, IPDPS '15, 2015.
[20]
M. Gupta, S. Midkiff, E. Schonberg, et al. A HPF compiler for the IBM SP2. In Supercomputing 1995, November 1995.
[21]
M. Gupta, E. Schonberg, and H. Srinivasan. A unified framework for optimizing communication in data-parallel programs. IEEE Transactions on Parallel and Distributed Systems, July 1996.
[22]
A. Hayashi, J. Zhao, M. Ferguson, and V. Sarkar. LLVM-based communication optimizations for PGAS programs. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM '15, 2015.
[23]
T. Hoefler, P. Kambadur, R. L. Graham, G. Shipman, and A. Lumsdaine. A Case for Standard Non-blocking Collective Operations. 2007.
[24]
P. Husbands and K. Yelick. Multi-threading and one-sided communication in parallel LU factorization. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC '07, 2007.
[25]
C. Iancu, W. Chen, and K. Yelick. Performance portable optimizations for loops containing communication operations. In Proceedings of the 22nd Annual International Conference on Supercomputing, 2008.
[26]
M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujam, and N. Shenoy. A global communication optimization technique based on data-flow analysis and linear algebra. ACM Transactions on Programming Languages and Systems, 21(6):1251--1297, 1999.
[27]
M. Luo, D. K. Panda, K. Z. Ibrahim, and C. Iancu. Congestion avoidance on manycore high performance computing systems. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS), 2012.
[28]
The Message Passing Interface Standard, 2016. http://www.mpi-forum.org/.
[29]
T. Nguyen, P. Cicotti, E. Bylaska, D. Quinlan, and S. B. Baden. Bamboo: Translating MPI applications to a latency-tolerant, data-driven form. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, 2012.
[30]
R. Nishtala, P. H. Hargrove, D. O. Bonachea, and K. A. Yelick. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, 2009.
[31]
H. Shan, S. Williams, Y. Zheng, A. Kamil, and K. Yelick. Implementing high-performance geometric multigrid solver with naturally grained messages. In Proceedings of the 2015 9th International Conference on Partitioned Global Address Space Programming Models, 2015.
[32]
Man page collections: Shared memory access. http://www.cray.com/craydoc/20/manuals/S-2383-22/S-2383-22-manual.pdf.
[33]
E. Solomonik, D. Matthews, J. Hammond, and J. Demmel. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, IPDPS '13, 2013.
[34]
P. Sternberg, E. G. Ng, C. Yang, P. Maris, J. P. Vary, M. Sosonkina, and H. V. Le. Accelerating configuration interaction calculations for nuclear structure. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, 2008.
[35]
E. Su, A. Lain, S. Ramaswamy, D. J. Palermo, E. W. H. IV, and P. Banerjee. Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers. In 9th ACM International Conference on Supercomputing, pages 424--433, July 1995.
[36]
R. Sudarsan, J. Borrill, C. Cantalupo, T. Kisner, K. Madduri, L. Oliker, Y. Zheng, and H. Simon. Cosmic microwave background map-making at the petascale and beyond. In Proceedings of the 25th International Conference on Supercomputing, 2011.
[37]
S. Williams, D. D. Kalamkar, A. Singh, A. M. Deshpande, B. Van Straalen, M. Smelyanskiy, A. Almgren, P. Dubey, J. Shalf, and L. Oliker. Optimization of geometric multigrid for emerging multi- and manycore processors. In Proc. of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12. IEEE Computer Society Press, 2012.
[38]
Y. Zheng, A. Kamil, M. B. Driscoll, H. Shan, and K. Yelick. UPC++: A PGAS extension for C++. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, 2014.
[39]
Y. Zhu and L. J. Hendren. Communication optimizations for parallel c programs. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI), pages 199--211, 1998.

Cited By

View all
  • (2020)Transparent Overlapping of Blocking Communication in MPI Applications2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00097(744-749)Online publication date: Dec-2020

Index Terms

  1. Maximizing Communication Overlap with Dynamic Program Analysis

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      HPCAsia '18: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
      January 2018
      322 pages
      ISBN:9781450353724
      DOI:10.1145/3149457
      © 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 January 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Dynamic Analysis
      2. One-sided communication
      3. Optimization
      4. UPC

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      HPC Asia 2018

      Acceptance Rates

      HPCAsia '18 Paper Acceptance Rate 30 of 67 submissions, 45%;
      Overall Acceptance Rate 69 of 143 submissions, 48%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 16 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)Transparent Overlapping of Blocking Communication in MPI Applications2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00097(744-749)Online publication date: Dec-2020

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media