research-article

Public Access

PPL: an abstract runtime system for hybrid parallel programming

Authors:

Marc SnirAuthors Info & Claims

ESPM '15: Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware

Pages 2 - 9

https://doi.org/10.1145/2832241.2832246

Published: 15 November 2015 Publication History

Abstract

Hardware trends indicate that supercomputers will see fast growing intra-node parallelism. Future programming models will need to carefully manage the interaction between inter- and intra-node parallelism to cope with this evolution. There exist many programming models which expose both levels of parallelism. However, they do not scale well as per-node thread counts rise and there is limited interoperability between threading and communication, leading to unnecessary software overheads and an increased amount of unnecessary communication. To address this, it is necessary to understand the limitations of current models and develop new approaches.

We propose a new runtime system design, PPL, which abstracts important high-level concepts of a typical parallel system for distributed-memory machines. By modularizing these elements, layers can be tested to better understand the needs of future programming models. We present details of the design and development implementation of PPL in C++11 and evaluate the performance of several different module implementations through micro-benchmarks and three applications: Barnes-Hut, Monte Carlo particle tracking, and a sparse-triangular solver.

References

[1]

Mellanox technologies. http://www.mellanox.com.

[2]

B. Acun, A. Gupta, N. Jain, et al. Parallel programming with migratable objects: Charm++ in practice. In Supercomputing 2014, pages 647--658. IEEE, 2014.

Digital Library

[3]

A. Amer, H. Lu, Y. Wei, et al. MPI+Threads: Runtime contention and remedies. In PPoPP 2015, pages 239--248, New York, NY, USA, 2015. ACM.

Digital Library

[4]

P. Balaji, D. Buntinas, D. Goodell, W. Gropp, and R. Thakur. Toward efficient support for multithreaded mpi communication. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 120--129. Springer, 2008.

Digital Library

[5]

P. Balaji, D. Buntinas, D. Goodell, W. Gropp, and R. Thakur. Fine-grained multithreading support for hybrid threaded mpi programming. International Journal of High Performance Computing Applications, 24(1):49--57, 2010.

Digital Library

[6]

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. Legion: expressing locality and independence with logical regions. In Supercomputing 2012, page 66. IEEE Computer Society Press, 2012.

Digital Library

[7]

K. Bergman, S. Borkar, D. Campbell, et al. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, 15, 2008.

[8]

D. Bonachea. GASNet specification, v1.8. http://gasnet.lbl.gov/#spec, November 2008.

[9]

P. Charles, C. Grothoff, V. Saraswat, et al. X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPlan Notices, 40(10):519--538, 2005.

Digital Library

[10]

C. Coarfa, Y. Dotsenko, J. Mellor-Crummey, et al. An evaluation of global address space languages: Co-array Fortran and Unified Parallel C. In PPoPP 2005, pages 36--47. ACM, 2005.

Digital Library

[11]

K. Devine, E. Boman, R. Heaphy, et al. Zoltan data management services for parallel dynamic applications. Computing in Science & Engineering, 4(2):90--96, 2002.

Digital Library

[12]

G. Dózsa, S. Kumar, P. Balaji, D. Buntinas, D. Goodell, W. Gropp, J. Ratterman, and R. Thakur. Enabling concurrent multithreaded MPI communication on multicore petascale systems. In Recent Advances in the Message Passing Interface, pages 11--20. Springer, 2010.

Digital Library

[13]

K. G. Felker, A. R. Siegel, K. S. Smith, P. K. Romano, and B. Forget. The Energy Band Memory Server Algorithm for Parallel Monte Carlo Transport Calculations. In Joint International Conference on Supercomputing in Nuclear Applications and Monte Carlo, 2013.

[14]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In ACM Sigplan Notices, volume 33, pages 212--223. ACM, 1998.

Digital Library

[15]

W. Gropp and R. Thakur. Thread-safety in an MPI implementation: Requirements and analysis. Parallel Computing, 33(9):595--604, 2007.

Digital Library

[16]

P. Husbands, C. Iancu, and K. Yelick. A performance analysis of the berkeley upc compiler. In ICS '03, pages 63--73. ACM, 2003.

Digital Library

[17]

H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Serio, and D. Fey. HPX: A task based programming model in a global address space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, page 6. ACM, 2014.

Digital Library

[18]

L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In OOPSLA '93, pages 91--108, New York, NY, USA, 1993. ACM.

Digital Library

[19]

T. Mattson, R. Cledat, Z. Budimlic, et al. OCR: The open community runtime interface version 1.1.0. http://xstack.exascale-tech.com/git/public?p=xstack.git;a=blob;f=ocr/spec/ocr-1.0.0.pdf, June 2015.

[20]

MPI Forum. MPI: A message-passing interface standard version 3.0. http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf, Sept. 2012.

[21]

R. W. Numrich and J. Reid. Co-Array Fortran for parallel programming. In ACM Sigplan Fortran Forum, volume 17, pages 1--31. ACM, 1998.

Digital Library

[22]

OpenMP Architedcture Review Board. OpenMP application program interface version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, July 213.

[23]

J. Reinders. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. O'Reilly Media, Inc., 2007.

Digital Library

[24]

A. Righi. umalloc.c file reference. http://minirighi.sourceforge.net/html/umalloc_8c.html. Accessed: 2014-10-17.

[25]

J. K. Salmon. Parallel hierarchical N-body methods. PhD thesis, California Institute of Technology, 1991.

Digital Library

[26]

V. Sarkar, W. Harrod, and A. E. Snavely. Software challenges in extreme scale systems. In Journal of Physics: Conference Series, volume 180, page 012045. IOP Publishing, 2009.

[27]

S. Seo, A. Amer, P. Balaji, P. Beckman, C. Bordage, G. Bosilca, A. Brooks, A. CastellÃş, D. Genet, T. Herault, P. Jindal, L. V. Kale, S. Krishnamoorthy, J. Lifflander, H. Lu, E. Meneses, M. Snir, and Y. Sun. Argobots: A lightweight low-level threading/tasking framework. http://collab.mcs.anl.gov/display/ARGOBOTS/, 2015.

[28]

M. Si, A. J. Peña, P. Balaji, et al. MT-MPI: Multithreaded MPI for many-core environments. In ICS '14, pages 125--134. ACM, 2014.

Digital Library

[29]

H. Tang and T. Yang. Optimizing threaded MPI execution on SMP clusters. In ICS '01, pages 381--392. ACM, 2001.

Digital Library

[30]

Texas Advanced Computing Center. Stampede. portal.tacc.utexas.edu/user-guides/stampede. Accessed: 2015-01-13.

[31]

E. Totoni, M. T. Heath, and L. V. Kale. Structure-adaptive parallel solution of sparse triangular linear systems. Parallel Computing, 40(9):454--470, 2014.

Digital Library

[32]

UPC Consortium. UPC language specifications v1.3. http://upc.lbl.gov/publications/upc-spec-1.3.pdf, 2013.

[33]

V. M. Weaver. Linux perf_event features and overhead. In The 2nd International Workshop on Performance Analysis of Workload Optimized Systems, FastPath, page 80, 2013.

[34]

K. B. Wheeler, R. C. Murphy, and D. Thain. Qthreads: An API for programming with millions of lightweight threads. In IPDPS 2008, pages 1--8. IEEE, 2008.

[35]

J. Zhang. MPI-3 EBMS. http://github.com/ANL-CESAR/EBMS, 2015.

[36]

J. Zhang, B. Behzad, and M. Snir. Design of a multithreaded Barnes-Hut algorithm for multicore clusters. IEEE Transactions on Parallel and Distributed Systems, 26(7):31--36, 2015.

Cited By

Dryden NMaruyama NMoon TBenson TYoo ASnir MVan Essen B(2018)Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)10.1109/MLHPC.2018.8638639(1-13)Online publication date: Nov-2018
https://doi.org/10.1109/MLHPC.2018.8638639
Dang HSnir MGropp WDongarra JHolmes DCollis ALarsson Träff JSmith L(2016)Towards millions of communicating threadsProceedings of the 23rd European MPI Users' Group Meeting10.1145/2966884.2966914(1-14)Online publication date: 25-Sep-2016
https://dl.acm.org/doi/10.1145/2966884.2966914

Index Terms

PPL: an abstract runtime system for hybrid parallel programming
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
    2. General programming languages
      1. Language features
        Concurrent programming structures
        Frameworks

Recommendations

Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computation

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Asynchronous PGAS runtime for Myrinet networks
PGAS '10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model

PGAS languages aim to enhance productivity for large scale systems. The IBM Asynchronous PGAS runtime (APGAS) supports various high productivity programming languages including UPC, X10 and CAF. The runtime has been designed for scalability and ...
DART-MPI: An MPI-based Implementation of a PGAS Runtime System
PGAS '14: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models

A Partitioned Global Address Space (PGAS) approach treats a distributed system as if the memory were shared on a global level. Given such a global view on memory, the user may program applications very much like shared memory systems. This greatly ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ESPM '15: Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware

November 2015

58 pages

ISBN:9781450339964

DOI:10.1145/2832241

Program Chairs:
Dhabaleswar K. (DK) Panda
The Ohio State University
,
Karl Schulz
Intel Corporation
,
Khaled Hamidouche
The Ohio State University
,
Hari Subramoni
The Ohio State University

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)
Sandia National Laboratories

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS\DATC

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15, 2015

Texas, Austin

Acceptance Rates

ESPM '15 Paper Acceptance Rate 5 of 10 submissions, 50%;

Overall Acceptance Rate 5 of 10 submissions, 50%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
207
Total Downloads

Downloads (Last 12 months)55
Downloads (Last 6 weeks)3

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dryden NMaruyama NMoon TBenson TYoo ASnir MVan Essen B(2018)Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)10.1109/MLHPC.2018.8638639(1-13)Online publication date: Nov-2018
https://doi.org/10.1109/MLHPC.2018.8638639
Dang HSnir MGropp WDongarra JHolmes DCollis ALarsson Träff JSmith L(2016)Towards millions of communicating threadsProceedings of the 23rd European MPI Users' Group Meeting10.1145/2966884.2966914(1-14)Online publication date: 25-Sep-2016
https://dl.acm.org/doi/10.1145/2966884.2966914

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents