research-article

Control replication: compiling implicit parallelism to efficient SPMD with logical regions

Authors:

Elliott Slaughter,

Sean Treichler,

Patrick McCormick,

Alex AikenAuthors Info & Claims

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 14, Pages 1 - 12

https://doi.org/10.1145/3126908.3126949

Published: 12 November 2017 Publication History

Abstract

We present control replication, a technique for generating high-performance and scalable SPMD code from implicitly parallel programs. In contrast to traditional parallel programming models that require the programmer to explicitly manage threads and the communication and synchronization between them, implicitly parallel programs have sequential execution semantics and naturally avoid the pitfalls of explicitly parallel code. However, without optimizations to distribute control overhead, scalability is often poor.

Performance on distributed-memory machines is especially sensitive to communication and synchronization in the program, and thus optimizations for these machines require an intimate understanding of a program's memory accesses. Control replication achieves particularly effective and predictable results by leveraging language support for first-class data partitioning in the source programming model. We evaluate an implementation of control replication for Regent and show that it achieves up to 99% parallel efficiency at 1024 nodes with absolute performance comparable to hand-written MPI(+X) codes.

References

[1]

2006. Titanium Language Reference Manual. http://titanium.cs.berkeley.edu/doc/lang-ref.pdf. (2006).

[2]

2013. UPC Language Specifications, Version 1.3. http://upc.lbl.gov/publications/upc-spec-1.3.pdf. (2013).

[3]

2016. Piz Daint & Piz Dora - CSCS. http://www.cscs.ch/computers/piz_daint. (2016).

[4]

Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience 23 (Feb. 2011), 187--198. Issue 2.

Digital Library

[5]

Ayon Basumallik, Seung-Jai Min, and Rudolf Eigenmann. 2007. Programming Distributed Memory Sytems Using OpenMP. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International. IEEE, 1--8.

[6]

Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Supercomputing (SC).

Digital Library

[7]

Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2014. Structure Slicing: Extending Logical Regions with Fields. In Supercomputing (SC).

Digital Library

[8]

Janine Bennett, Robert Clay, Gavin Baker, Marc Gamell, David Hollman, Samuel Knight, Hemanth Kolla, Gregory Sjaardema, Nicole Slattengren, Keita Teranishi, Jeremiah Wilke, Matt Bettencourt, Steve Bova, Ken Franko, Paul Lin, Ryan Grant, Si Hammond, Stephen Olivier, Laxmikant Kale, Nikhil Jain, Eric Mikida, Alex Aiken, Mike Bauer, Wonchan Lee, Elliott Slaughter, Sean Treichler, Martin Berzins, Todd Harman, Alan Humphrey, John Schmidt, Dan Sunderland, Pat McCormick, Samuel Gutierrez, Martin Schulz, Abhinav Bhatele, David Boehme, Peer-Timo Bremer, and Todd Gamblin. 2015. ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms. SAND2015-8312 (2015).

[9]

Ganesh Bikshandi, Jose G. Castanos, Sreedhar B. Kodali, V. Krishna Nandivada, Igor Peshansky, Vijay A. Saraswat, Sayantan Sur, Pradeep Varma, and Tong Wen. 2009. Efficient, Portable Implementation of Asynchronous Multi-place Programs. In PPoPP. ACM, 271--282.

Digital Library

[10]

William Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeflinger, David Padua, Paul Petersen, William Pottenger, Lawrence Rauchwerger, Peng Tu, and Stephen Weatherford. 1995. Effective Automatic Parallelization with Polaris. In International Journal of Parallel Programming. Citeseer.

[11]

Robert L. Bocchino Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vakilian. 2009. A Type and Effect System for Deterministic Parallel Java. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA).

Digital Library

[12]

Uday Bondhugula. 2013. Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures. In Supercomputing (SC). ACM, 33.

Digital Library

[13]

George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Hérault, and Jack J. Dongarra. 2013. PaRSEC: Exploiting Heterogeneity to Enhance Scalability. Computing in Science & Engineering 15, 6 (2013), 36--45.

Digital Library

[14]

Donald E. Burton. 1994. Consistent Finite-Volume Discretization of Hydrodynamics Conservation Laws for Unstructured Grids. Technical Report UCRL-JC-118788. Lawrence Livermore National Laboratory, Livermore, CA.

[15]

Ümit Çatalyürek and Cevdet Aykanat. 2011. PaToH (Partitioning Tool for Hypergraphs). In Encyclopedia of Parallel Computing. Springer, 1479--1487.

[16]

Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel Programmability and the Chapel Language. Int'l Journal of HPC Apps. (2007).

Digital Library

[17]

Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph Von Praun, and Vivek Sarkar. 2005. X10: An Object-Oriented Approach to Non-Uniform Cluster Computing. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 519--538.

Digital Library

[18]

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Opearting Systems Design & Implementation (OSDI). 10--10.

Digital Library

[19]

Steven J. Deitz, Bradford L. Chamberlain, Sung-Eun Choi, and Lawrence Snyder. 2003. The Design and Implementation of a Parallel Array Operator for the Arbitrary Remapping of Data. In PPoPP, Vol. 38. ACM, 155--166.

Digital Library

[20]

Zachary DeVito, James Hegarty, Alex Aiken, Pat Hanrahan, and Jan Vitek. 2013. Terra: A Multi-Stage Language for High-Performance Computing (PLDI).

Digital Library

[21]

Zachary DeVito, Niels Joubert, Francisco Palacios, Stephen Oakley, Montserrat Medina, Mike Barrientos, Erich Elsen, Frank Ham, Alex Aiken, Karthik Duraisamy, Eric Darve, Juan Alonso, and Pat Hanrahan. 2011. Liszt: A Domain Specific Language for Building Portable Mesh-based PDE Solvers. In Supercomputing (SC).

Digital Library

[22]

H. Carter Edwards and Christian R. Trott. 2013. Kokkos: Enabling Performance Portability Across Manycore Architectures. In Extreme Scaling Workshop (XSW), 2013. 18--24.

Digital Library

[23]

Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Supercomputing.

Digital Library

[24]

Charles R. Ferenbaugh. 2014. PENNANT: an unstructured mesh mini-app for advanced architecture research. Concurrency and Computation: Practice and Experience (2014).

Digital Library

[25]

Mary W. Hall, Jennifer-Ann M. Anderson, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, Edouard Bugnion, and Monica S. Lam. 1996. Maximizing Multiprocessor Performance with the SUIF Compiler. IEEE Computer 29, 12 (1996), 84--89.

Digital Library

[26]

Michael A. Heroux, Douglas W. Doerfler, Paul S. Crozier, James M. Willenbring, H. Carter Edwards, Alan Williams, Mahesh Rajan, Eric R. Keiter, Heidi K. Thornquist, and Robert W. Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.

[27]

Jay P. Hoeflinger. 2006. Extending OpenMP to Clusters. White Paper, Intel Corporation (2006).

[28]

Roberto Ierusalimschy, Luiz Henrique De Figueiredo, and Waldemar Celes Filho. 1996. Lua - An Extensible Extension Language. Software: Practice & Experience (1996).

Digital Library

[29]

François Irigoin, Pierre Jouvelot, and Rémi Triolet. 1991. Semantical Interprocedural Parallelization: An Overview of the PIPS Project. In Supercomputing (SC).

Digital Library

[30]

Ken Kennedy, Charles Koelbel, and Hans Zima. 2007. The Rise and Fall of High Performance Fortran: An Historical Object Lesson. In Proceedings of the Third ACM SIGPLAN Conference on History of Programming Languages. ACM, 7--1.

Digital Library

[31]

Okwan Kwon, Fahed Jubair, Rudolf Eigenmann, and Samuel Midkiff. 2012. A Hybrid Approach of OpenMP for Clusters (PPoPP). ACM, 75--84.

Digital Library

[32]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04). Palo Alto, California.

Digital Library

[33]

Omid Mashayekhi, Hang Qu, Chinmayee Shah, and Philip Levis. 2017. Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics. In USENIX Annual Technical Conference (USENIX ATC).

Digital Library

[34]

Mahesh Ravishankar, Roshan Dathathri, Venmugil Elango, Louis-Noël Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2015. Distributed Memory Code Generation for Mixed Irregular/Regular Computations (PPoPP). ACM, 65--75.

Digital Library

[35]

Mahesh Ravishankar, John Eisenlohr, Louis-Noël Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2012. Code Generation for Parallel Execution of a Class of Irregular Loops on Distributed Memory Systems. In Supercomputing (SC).

Digital Library

[36]

Harvey Richardson. 1996. High Performance Fortran: History, Overview and Current Developments. Thinking Machines Corporation 14 (1996), 17.

[37]

Hitoshi Sakagami, Hitoshi Murai, Yoshiki Seo, and Mitsuo Yokokawa. 2002. 14.9 TFLOPS Three-Dimensional Fluid Simulation for Fusion Science with HPF on the Earth Simulator. In Supercomputing, ACM/IEEE 2002 Conference. IEEE, 51--51.

Digital Library

[38]

Mitsuhisa Sato, Hiroshi Harada, Atsushi Hasegawa, and Yutaka Ishikawa. 2001. Cluster-enabled OpenMP: An OpenMP Compiler for the SCASH Software Distributed Shared Memory System. Scientific Programming 9, 2, 3 (2001), 123--130.

Digital Library

[39]

Kirk Schloegel, George Karypis, and Vipin Kumar. 2002. Parallel Static and Dynamic Multi-Constraint Graph Partitioning. Concurrency and Computation: Practice and Experience 14, 3 (2002), 219--240.

[40]

Yoshiki Seo, Hidetoshi Iwashita, Hiroshi Ohta, and Hitoshi Sakagami. 2002. HPF/JA: Extensions of High Performance Fortran for Accelerating Real-World Applications. Concurrency and Computation: Practice and Experience 14, 8-9 (2002), 555--573.

[41]

Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael Bauer, and Alex Aiken. 2015. Regent: A High-Productivity Programming Language for HPC with Logical Regions. In Supercomputing (SC).

Digital Library

[42]

Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. 1998. MPI-The Complete Reference. MIT Press.

Digital Library

[43]

Sean Treichler, Michael Bauer, and Alex Aiken. 2014. Realm: An Event-Based Low-Level Runtime for Distributed Memory Architectures. In Parallel Architectures and Compilation Techniques (PACT).

Digital Library

[44]

Sean Treichler, Michael Bauer, Rahul Sharma, Elliott Slaughter, and Alex Aiken. 2016. Dependent Partitioning. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). ACM, 344--358.

Digital Library

[45]

Rob F. Van der Wijngaart, Abdullah Kayi, Jeff R. Hammond, Gabriele Jost, Tom St. John, Srinivas Sridharan, Timothy G. Mattson, John Abercrombie, and Jacob Nelson. 2016. Comparing Runtime Systems with Exascale Ambitions Using the Parallel Research Kernels. In International Conference on High Performance Computing. Springer, 321--339.

[46]

Rob F. Van der Wijngaart and Timothy G. Mattson. 2014. The Parallel Research Kernels. In HPEC. 1--6.

[47]

Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, and Tong Wen. 2007. Productivity and Performance Using Partitioned Global Address Space Languages. In PASCO. 24--32.

Digital Library

[48]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10 (2010), 10--10.

Digital Library

Cited By

Bauer MSlaughter ETreichler SLee WGarland MAiken ADehnavi MKulkarni MKrishnamoorthy S(2023)Visibility Algorithms for Dynamic Dependence Analysis and Distributed CoherenceProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577515(218-231)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577515
Bauer MLee WSlaughter EJia ZDi Renzo MPapadakis MShipman GMcCormick PGarland MAiken ALee JPetrank E(2021)Scaling implicit parallelism via dynamic control replicationProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441587(105-118)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441587
Raut EAnderson JAraya-Polo MMeng J(2021)Evaluation of Distributed Tasks in Stencil-based Application on GPUs2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM254806.2021.00011(45-52)Online publication date: Nov-2021
https://doi.org/10.1109/ESPM254806.2021.00011
Show More Cited By

Index Terms

Control replication: compiling implicit parallelism to efficient SPMD with logical regions
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Scaling implicit parallelism via dynamic control replication
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

We present dynamic control replication, a run-time program analysis that enables scalable execution of implicitly parallel programs on large machines through a distributed and efficient dynamic dependence analysis. Dynamic control replication ...
Regent: a high-productivity programming language for HPC with logical regions
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

We present Regent, a high-productivity programming language for high performance computing with logical regions. Regent users compose programs with tasks (functions eligible for parallel execution) and logical regions (hierarchical collections of ...
Dependent partitioning
OOPSLA '16

A key problem in parallel programming is how data is partitioned: divided into subsets that can be operated on in parallel and, in distributed memory machines, spread across multiple address spaces.

We present a dependent partitioning framework that ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2017

801 pages

ISBN:9781450351140

DOI:10.1145/3126908

General Chair:
Bernd Mohr
Jülich Supercomputing Center, Jülich, Germany
,
Program Chair:
Padma Raghavan
Vanderbilt University, Nashville, TN

Copyright © 2017 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Swiss National Supercomputing Centre (CSCS)

Conference

SC '17

Sponsor:

SIGHPC

SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2017

Colorado, Denver

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
242
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bauer MSlaughter ETreichler SLee WGarland MAiken ADehnavi MKulkarni MKrishnamoorthy S(2023)Visibility Algorithms for Dynamic Dependence Analysis and Distributed CoherenceProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577515(218-231)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577515
Bauer MLee WSlaughter EJia ZDi Renzo MPapadakis MShipman GMcCormick PGarland MAiken ALee JPetrank E(2021)Scaling implicit parallelism via dynamic control replicationProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441587(105-118)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3437801.3441587
Raut EAnderson JAraya-Polo MMeng J(2021)Evaluation of Distributed Tasks in Stencil-based Application on GPUs2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM254806.2021.00011(45-52)Online publication date: Nov-2021
https://doi.org/10.1109/ESPM254806.2021.00011
Slaughter EWu WFu YBrandenburg LGarcia NKautz WMarx EMorris KCao QBosilca GMirchandaney SLee WTreichler SMcCormick PAiken ACuicchi CQualters IKramer W(2020)Task benchProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433783(1-15)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433783
Slaughter EWu WFu YBrandenburg LGarcia NKautz WMarx EMorris KCao QBosilca GMirchandaney SLeek WTreichlerk SMcCormick PAiken A(2020)Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime PerformanceSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00066(1-15)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00066
Soi RMamidi NSlaughter EPrasun KNemili ADeshpande S(2020)An Implicitly Parallel Meshfree Solver in Regent2020 IEEE/ACM 3rd Annual Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)10.1109/PAWATM51920.2020.00009(40-54)Online publication date: Nov-2020
https://doi.org/10.1109/PAWATM51920.2020.00009
Lee WPapadakis MSlaughter EAiken ATaufer MBalaji PPeña A(2019)A constraint-based approach to automatic data partitioning for distributed memory executionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356199(1-24)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356199
Bauer MGarland MTaufer MBalaji PPeña A(2019)Legate NumPyProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356175(1-23)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356175
Slaughter EAiken A(2019)Pygion: Flexible, Scalable Task-Based Parallelism with Python2019 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM)10.1109/PAW-ATM49560.2019.00011(58-72)Online publication date: Nov-2019
https://doi.org/10.1109/PAW-ATM49560.2019.00011
Lee WSlaughter EBauer MTreichler SWarszawski TGarland MAiken A(2018)Dynamic tracingProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291702(1-13)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291702
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten