research-article

Barrier elision for production parallel programs

Authors:

John Mellor-Crummey,

Costin IancuAuthors Info & Claims

ACM SIGPLAN Notices, Volume 50, Issue 8

Pages 109 - 119

https://doi.org/10.1145/2858788.2688502

Published: 24 January 2015 Publication History

Abstract

Large scientific code bases are often composed of several layers of runtime libraries, implemented in multiple programming languages. In such situation, programmers often choose conservative synchronization patterns leading to suboptimal performance. In this paper, we present context-sensitive dynamic optimizations that elide barriers redundant during the program execution. In our technique, we perform data race detection alongside the program to identify redundant barriers in their calling contexts; after an initial learning, we start eliding all future instances of barriers occurring in the same calling context. We present an automatic on-the-fly optimization and a multi-pass guided optimization. We apply our techniques to NWChem--a 6 million line computational chemistry code written in C/C++/Fortran that uses several runtime libraries such as Global Arrays, ComEx, DMAPP, and MPI. Our technique elides a surprisingly high fraction of barriers (as many as 63%) in production runs. This redundancy elimination translates to application speedups as high as 14% on 2048 cores. Our techniques also provided valuable insight about the application behavior, later used by NWChem developers. Overall, we demonstrate the value of holistic context-sensitive analyses that consider the domain science in conjunction with the associated runtime software stack.

References

[1]

Cray Unified Parallel C. http://docs.cray.com/books/S-2179-50/html-S-2179-50/z1035483822pvl.html.

[2]

S. Agarwal et al. May-happen-in-parallel analysis of X10 programs. In Proc. of the 12th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, PPoPP ’07, 2007.

Digital Library

[3]

C. Barton et al. Shared memory programming for large scale machines. In Proc. of the ACM SIGPLAN 2006 Conf. on Programming Language Design and Implementation, 2006.

Digital Library

[4]

L. S. Blackford et al. ScaLAPACK User’s Guide. Society for Industrial and Applied Mathematics, 1997.

[5]

M. D. Bond et al. Efficient, context-sensitive detection of real-world semantic attacks. In Proc. of the 5th ACM SIGPLAN Workshop on Programming Languages and Analysis for Security, PLAS ’10, 2010.

Digital Library

[6]

M. D. Bond and K. S. McKinley. Probabilistic calling context. In Proc. of the 22Nd Annual ACM SIGPLAN Conf. on Object-oriented Programming Systems and Applications, OOPSLA ’07, 2007.

Digital Library

[7]

V. Cavé et al. Habanero-java: The new adventures of old X10. In Proc. of the 9th Intl. Conf. on Principles and Practice of Programming in Java, PPPJ ’11, 2011.

Digital Library

[8]

M. Chabbi, X. Liu, and J. Mellor-Crummey. Call paths for pin tools. In Proc. of Annual IEEE/ACM Intl. Symp. on Code Generation and Optimization, CGO ’14, pages 76:76–76:86, 2014.

Digital Library

[9]

M. Chabbi and J. Mellor-Crummey. DeadSpy: a tool to pinpoint program inefficiencies. In Proc. of the 10th Intl. Symp. on Code Generation and Optimization, CGO ’12, pages 124–134, 2012.

Digital Library

[10]

M. Chabbi, K. Murthy, M. Fagan, and J. Mellor-Crummey. Effective sampling-driven performance tools for GPU-accelerated supercomputers. In Proc. of the Intl. Conf. on High Performance Computing, Networking, Storage and Analysis, SC ’13, pages 43:1–43:12, 2013.

Digital Library

[11]

B. Chamberlain et al. Parallel programmability and the Chapel language. Int. J. High Perform. Comput. Appl., 21(3), Aug. 2007.

Digital Library

[12]

P. Charles et al. X10: An object-oriented approach to non-uniform cluster computing. SIGPLAN Not., 40(10), Oct. 2005.

Digital Library

[13]

ComEx: Communications Runtime for Exascale. http://hpc.pnl.gov/comex/.

[14]

A. Danalis. MPI and compiler technology: A love-hate relationship. In Proc. of the 19th European Conf. on Recent Advances in the Message Passing Interface, EuroMPI’12, 2012.

Digital Library

[15]

P. C. Diniz and M. C. Rinard. Lock coarsening: Eliminating lock overhead in automatically parallelized object-based programs. J. Parallel Distrib. Comput., 49, 1998.

Digital Library

[16]

M. A. Heroux et al. An overview of the trilinos project. ACM Trans. Math. Softw., 31, 2005.

Digital Library

[17]

Compiled MPI. http://htor.inf.ethz.ch/research/compi/.

[18]

P. Husbands et al. A performance analysis of the Berkeley UPC compiler. In Proc. of the 17th Annual Intl. Conf. on Supercomputing, ICS ’03, 2003.

Digital Library

[19]

T. E. Jeremiassen and S. J. Eggers. Static analysis of barrier synchronization in explicitly parallel programs. In Proc. of the IFIP WG10.3 Working Conf. on Parallel Architectures and Compilation Techniques, PACT ’94, 1994.

Digital Library

[20]

A. Kamil and K. Yelick. Concurrency analysis for parallel programs with textually aligned barriers. In In Proc. of the 18th Intl. Workshop on Languages and Compilers for Parallel Computing, 2005.

Digital Library

[21]

A. Karwande et al. CC-MPI: A compiled communication capable MPI prototype for ethernet switched clusters. In Proc. of the Ninth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, PPoPP ’03, 2003.

Digital Library

[22]

P.-W. Lai et al. A framework for load balancing of tensor contraction expressions via dynamic task partitioning. In Proc. of the Intl. Conf. on High Performance Computing, Networking, Storage and Analysis, SC ’13, 2013.

Digital Library

[23]

The libunwind project. http://www.nongnu.org/libunwind/.

[24]

J. F. Mart´ınez and J. Torrellas. Speculative synchronization: applying thread-level speculation to explicitly parallel applications. In Proc. of the 10th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 18–29, 2002.

Digital Library

[25]

J. Nieplocha and B. Carpenter. ARMCI: A portable remote memory copy libray for distributed array libraries and compiler run-time systems. In Proc. of the 11 IPPS/SPDP’99 Workshops Held in Conjunction with the 13th Intl. Parallel Processing Symp. and 10th Symp. on Parallel and Distributed Processing, 1999.

Digital Library

[26]

J. Nieplocha et al. Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl., 20(2), May 2006.

Digital Library

[27]

C. S. Park et al. Scaling data race detection for partitioned global address space programs. In Proc. of the 27th Intl. ACM Conf. on Intl. Conf. on Supercomputing, ICS ’13, 2013.

Digital Library

[28]

F. Petrini, D. J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In Proc. of the 2003 ACM/IEEE conference on Supercomputing, SC ’03, pages 55–, New York, NY, USA, 2003. ACM.

Digital Library

[29]

R. Preissl et al. Transforming MPI source code based on communication patterns. Future Gener. Comput. Syst., 2010.

Digital Library

[30]

R. Rajwar and J. R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In Proc. of the 34th Annual ACM/IEEE Intl. Symp. on Microarchitecture, MICRO 34, pages 294– 305, 2001.

Digital Library

[31]

S. Sharma, S. Vakkalanka, G. Gopalakrishnan, R. Kirby, R. Thakur, and W. Gropp. A formal approach to detect functionally irrelevant barriers in MPI programs. In A. Lastovetsky et al., editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, volume 5205 of Lecture Notes in Computer Science, pages 265–273. Springer Berlin Heidelberg, 2008.

Digital Library

[32]

D. Shasha and M. Snir. Efficient and correct execution of parallel programs that share memory. ACM Trans. Program. Lang. Syst., 10(2), 1988.

Digital Library

[33]

J. G. Siek et al. Boost Graph Library: User Guide and Reference Manual, The. Pearson Education, 2001.

[34]

E. Solomonik et al. Cyclops tensor framework: reducing communication and eliminating load imbalance in massively parallel contractions. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th Intl. Symp., pages 813––824. IEEE, 2013.

Digital Library

[35]

Sparsehash. https://code.google.com/p/sparsehash/.

[36]

N. R. Tallent et al. Binary analysis for measurement and attribution of program performance. In Proc. of the 2009 ACM SIGPLAN Conf. on Programming Language Design and Implementation, PLDI ’09, pages 441–452, 2009.

Digital Library

[37]

M. Valiev et al. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications, 181(9):1477–1489, 2010.

[38]

Y. Zhang and E. Duesterwald. Barrier matching for programs with textually unaligned barriers. In PPOPP, 2007.

Digital Library

[39]

Y. Zhang et al. Concurrency analysis for shared memory programs with textually unaligned barriers. In LCPC, pages 95–109, 2007.

Cited By

Zhao QLiu XChabbi MCuicchi CQualters IKramer W(2020)DrCCTProfProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433740(1-16)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433740
Deng ZLi JLin J(2021)A Synchronization Optimization Technique for OpenMP2021 IEEE 13th International Conference on Computer Research and Development (ICCRD)10.1109/ICCRD51685.2021.9386475(95-103)Online publication date: 5-Jan-2021
https://doi.org/10.1109/ICCRD51685.2021.9386475
Zhao QLiu XChabbi MCuicchi CQualters IKramer W(2020)DrCCTProfProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433740(1-16)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433740
Show More Cited By

Index Terms

Barrier elision for production parallel programs
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Barrier elision for production parallel programs
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Large scientific code bases are often composed of several layers of runtime libraries, implemented in multiple programming languages. In such situation, programmers often choose conservative synchronization patterns leading to suboptimal performance. ...
Efficient, portable implementation of asynchronous multi-place programs
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

The X10 programming language is organized around the notion of places (an encapsulation of data and activities operating on the data), partitioned global address space (PGAS), and asynchronous computation and communication.

This paper introduces an ...
Efficient, portable implementation of asynchronous multi-place programs
PPoPP '09

The X10 programming language is organized around the notion of places (an encapsulation of data and activities operating on the data), partitioned global address space (PGAS), and asynchronous computation and communication.

This paper introduces an ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 50, Issue 8

PPoPP '15

August 2015

290 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/2858788

Editor:
Andy Gill
University of Kansas, Lawrence, KS

Issue’s Table of Contents

PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2015
290 pages
ISBN:9781450332057
DOI:10.1145/2688500
General Chair:
Albert Cohen
INRIA, France
,
Program Chair:
David Grove
IBM Research, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 January 2015

Published in SIGPLAN Volume 50, Issue 8

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
369
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao QLiu XChabbi MCuicchi CQualters IKramer W(2020)DrCCTProfProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433740(1-16)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433740
Deng ZLi JLin J(2021)A Synchronization Optimization Technique for OpenMP2021 IEEE 13th International Conference on Computer Research and Development (ICCRD)10.1109/ICCRD51685.2021.9386475(95-103)Online publication date: 5-Jan-2021
https://doi.org/10.1109/ICCRD51685.2021.9386475
Zhao QLiu XChabbi MCuicchi CQualters IKramer W(2020)DrCCTProfProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433740(1-16)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433740
Zhou TJantz MKulkarni PDoshi KSarkar VAmaral JKulkarni M(2019)Valence: variable length calling context encodingProceedings of the 28th International Conference on Compiler Construction10.1145/3302516.3307351(147-158)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3302516.3307351
Su PWen SYang HChabbi MLiu XAtlee JBultan TWhittle J(2019)Redundant loadsProceedings of the 41st International Conference on Software Engineering10.1109/ICSE.2019.00103(982-993)Online publication date: 25-May-2019
https://dl.acm.org/doi/10.1109/ICSE.2019.00103
Sarkar SAlavani G(2018)How Easy it is to Write Software for Heterogeneous Systems?ACM SIGSOFT Software Engineering Notes10.1145/3149485.314951142:4(1-7)Online publication date: 11-Jan-2018
https://dl.acm.org/doi/10.1145/3149485.3149511
Saillard ESen KLavrijsen WIancu C(2018)Maximizing Communication Overlap with Dynamic Program AnalysisProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3149457.3149459(1-11)Online publication date: 28-Jan-2018
https://dl.acm.org/doi/10.1145/3149457.3149459
Gopalakrishnan GSawaya JJha SKatz DWeissman J(2015)Achieving Formal Parallel Program Debugging by Incentivizing CS/HPC Collaborative Tool DevelopmentProceedings of the 1st Workshop on The Science of Cyberinfrastructure: Research, Experience, Applications and Models10.1145/2753524.2753531(11-18)Online publication date: 16-Jun-2015
https://dl.acm.org/doi/10.1145/2753524.2753531

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents