Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3126908.3126949acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Control replication: compiling implicit parallelism to efficient SPMD with logical regions

Published: 12 November 2017 Publication History

Abstract

We present control replication, a technique for generating high-performance and scalable SPMD code from implicitly parallel programs. In contrast to traditional parallel programming models that require the programmer to explicitly manage threads and the communication and synchronization between them, implicitly parallel programs have sequential execution semantics and naturally avoid the pitfalls of explicitly parallel code. However, without optimizations to distribute control overhead, scalability is often poor.
Performance on distributed-memory machines is especially sensitive to communication and synchronization in the program, and thus optimizations for these machines require an intimate understanding of a program's memory accesses. Control replication achieves particularly effective and predictable results by leveraging language support for first-class data partitioning in the source programming model. We evaluate an implementation of control replication for Regent and show that it achieves up to 99% parallel efficiency at 1024 nodes with absolute performance comparable to hand-written MPI(+X) codes.

References

[1]
2006. Titanium Language Reference Manual. http://titanium.cs.berkeley.edu/doc/lang-ref.pdf. (2006).
[2]
2013. UPC Language Specifications, Version 1.3. http://upc.lbl.gov/publications/upc-spec-1.3.pdf. (2013).
[3]
2016. Piz Daint & Piz Dora - CSCS. http://www.cscs.ch/computers/piz_daint. (2016).
[4]
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice and Experience 23 (Feb. 2011), 187--198. Issue 2.
[5]
Ayon Basumallik, Seung-Jai Min, and Rudolf Eigenmann. 2007. Programming Distributed Memory Sytems Using OpenMP. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International. IEEE, 1--8.
[6]
Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Supercomputing (SC).
[7]
Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2014. Structure Slicing: Extending Logical Regions with Fields. In Supercomputing (SC).
[8]
Janine Bennett, Robert Clay, Gavin Baker, Marc Gamell, David Hollman, Samuel Knight, Hemanth Kolla, Gregory Sjaardema, Nicole Slattengren, Keita Teranishi, Jeremiah Wilke, Matt Bettencourt, Steve Bova, Ken Franko, Paul Lin, Ryan Grant, Si Hammond, Stephen Olivier, Laxmikant Kale, Nikhil Jain, Eric Mikida, Alex Aiken, Mike Bauer, Wonchan Lee, Elliott Slaughter, Sean Treichler, Martin Berzins, Todd Harman, Alan Humphrey, John Schmidt, Dan Sunderland, Pat McCormick, Samuel Gutierrez, Martin Schulz, Abhinav Bhatele, David Boehme, Peer-Timo Bremer, and Todd Gamblin. 2015. ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms. SAND2015-8312 (2015).
[9]
Ganesh Bikshandi, Jose G. Castanos, Sreedhar B. Kodali, V. Krishna Nandivada, Igor Peshansky, Vijay A. Saraswat, Sayantan Sur, Pradeep Varma, and Tong Wen. 2009. Efficient, Portable Implementation of Asynchronous Multi-place Programs. In PPoPP. ACM, 271--282.
[10]
William Blume, Rudolf Eigenmann, Keith Faigin, John Grout, Jay Hoeflinger, David Padua, Paul Petersen, William Pottenger, Lawrence Rauchwerger, Peng Tu, and Stephen Weatherford. 1995. Effective Automatic Parallelization with Polaris. In International Journal of Parallel Programming. Citeseer.
[11]
Robert L. Bocchino Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vakilian. 2009. A Type and Effect System for Deterministic Parallel Java. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA).
[12]
Uday Bondhugula. 2013. Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures. In Supercomputing (SC). ACM, 33.
[13]
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Hérault, and Jack J. Dongarra. 2013. PaRSEC: Exploiting Heterogeneity to Enhance Scalability. Computing in Science & Engineering 15, 6 (2013), 36--45.
[14]
Donald E. Burton. 1994. Consistent Finite-Volume Discretization of Hydrodynamics Conservation Laws for Unstructured Grids. Technical Report UCRL-JC-118788. Lawrence Livermore National Laboratory, Livermore, CA.
[15]
Ümit Çatalyürek and Cevdet Aykanat. 2011. PaToH (Partitioning Tool for Hypergraphs). In Encyclopedia of Parallel Computing. Springer, 1479--1487.
[16]
Bradford L. Chamberlain, David Callahan, and Hans P. Zima. 2007. Parallel Programmability and the Chapel Language. Int'l Journal of HPC Apps. (2007).
[17]
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph Von Praun, and Vivek Sarkar. 2005. X10: An Object-Oriented Approach to Non-Uniform Cluster Computing. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 519--538.
[18]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Opearting Systems Design & Implementation (OSDI). 10--10.
[19]
Steven J. Deitz, Bradford L. Chamberlain, Sung-Eun Choi, and Lawrence Snyder. 2003. The Design and Implementation of a Parallel Array Operator for the Arbitrary Remapping of Data. In PPoPP, Vol. 38. ACM, 155--166.
[20]
Zachary DeVito, James Hegarty, Alex Aiken, Pat Hanrahan, and Jan Vitek. 2013. Terra: A Multi-Stage Language for High-Performance Computing (PLDI).
[21]
Zachary DeVito, Niels Joubert, Francisco Palacios, Stephen Oakley, Montserrat Medina, Mike Barrientos, Erich Elsen, Frank Ham, Alex Aiken, Karthik Duraisamy, Eric Darve, Juan Alonso, and Pat Hanrahan. 2011. Liszt: A Domain Specific Language for Building Portable Mesh-based PDE Solvers. In Supercomputing (SC).
[22]
H. Carter Edwards and Christian R. Trott. 2013. Kokkos: Enabling Performance Portability Across Manycore Architectures. In Extreme Scaling Workshop (XSW), 2013. 18--24.
[23]
Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Supercomputing.
[24]
Charles R. Ferenbaugh. 2014. PENNANT: an unstructured mesh mini-app for advanced architecture research. Concurrency and Computation: Practice and Experience (2014).
[25]
Mary W. Hall, Jennifer-Ann M. Anderson, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, Edouard Bugnion, and Monica S. Lam. 1996. Maximizing Multiprocessor Performance with the SUIF Compiler. IEEE Computer 29, 12 (1996), 84--89.
[26]
Michael A. Heroux, Douglas W. Doerfler, Paul S. Crozier, James M. Willenbring, H. Carter Edwards, Alan Williams, Mahesh Rajan, Eric R. Keiter, Heidi K. Thornquist, and Robert W. Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.
[27]
Jay P. Hoeflinger. 2006. Extending OpenMP to Clusters. White Paper, Intel Corporation (2006).
[28]
Roberto Ierusalimschy, Luiz Henrique De Figueiredo, and Waldemar Celes Filho. 1996. Lua - An Extensible Extension Language. Software: Practice & Experience (1996).
[29]
François Irigoin, Pierre Jouvelot, and Rémi Triolet. 1991. Semantical Interprocedural Parallelization: An Overview of the PIPS Project. In Supercomputing (SC).
[30]
Ken Kennedy, Charles Koelbel, and Hans Zima. 2007. The Rise and Fall of High Performance Fortran: An Historical Object Lesson. In Proceedings of the Third ACM SIGPLAN Conference on History of Programming Languages. ACM, 7--1.
[31]
Okwan Kwon, Fahed Jubair, Rudolf Eigenmann, and Samuel Midkiff. 2012. A Hybrid Approach of OpenMP for Clusters (PPoPP). ACM, 75--84.
[32]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO'04). Palo Alto, California.
[33]
Omid Mashayekhi, Hang Qu, Chinmayee Shah, and Philip Levis. 2017. Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics. In USENIX Annual Technical Conference (USENIX ATC).
[34]
Mahesh Ravishankar, Roshan Dathathri, Venmugil Elango, Louis-Noël Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2015. Distributed Memory Code Generation for Mixed Irregular/Regular Computations (PPoPP). ACM, 65--75.
[35]
Mahesh Ravishankar, John Eisenlohr, Louis-Noël Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2012. Code Generation for Parallel Execution of a Class of Irregular Loops on Distributed Memory Systems. In Supercomputing (SC).
[36]
Harvey Richardson. 1996. High Performance Fortran: History, Overview and Current Developments. Thinking Machines Corporation 14 (1996), 17.
[37]
Hitoshi Sakagami, Hitoshi Murai, Yoshiki Seo, and Mitsuo Yokokawa. 2002. 14.9 TFLOPS Three-Dimensional Fluid Simulation for Fusion Science with HPF on the Earth Simulator. In Supercomputing, ACM/IEEE 2002 Conference. IEEE, 51--51.
[38]
Mitsuhisa Sato, Hiroshi Harada, Atsushi Hasegawa, and Yutaka Ishikawa. 2001. Cluster-enabled OpenMP: An OpenMP Compiler for the SCASH Software Distributed Shared Memory System. Scientific Programming 9, 2, 3 (2001), 123--130.
[39]
Kirk Schloegel, George Karypis, and Vipin Kumar. 2002. Parallel Static and Dynamic Multi-Constraint Graph Partitioning. Concurrency and Computation: Practice and Experience 14, 3 (2002), 219--240.
[40]
Yoshiki Seo, Hidetoshi Iwashita, Hiroshi Ohta, and Hitoshi Sakagami. 2002. HPF/JA: Extensions of High Performance Fortran for Accelerating Real-World Applications. Concurrency and Computation: Practice and Experience 14, 8-9 (2002), 555--573.
[41]
Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael Bauer, and Alex Aiken. 2015. Regent: A High-Productivity Programming Language for HPC with Logical Regions. In Supercomputing (SC).
[42]
Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. 1998. MPI-The Complete Reference. MIT Press.
[43]
Sean Treichler, Michael Bauer, and Alex Aiken. 2014. Realm: An Event-Based Low-Level Runtime for Distributed Memory Architectures. In Parallel Architectures and Compilation Techniques (PACT).
[44]
Sean Treichler, Michael Bauer, Rahul Sharma, Elliott Slaughter, and Alex Aiken. 2016. Dependent Partitioning. In Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). ACM, 344--358.
[45]
Rob F. Van der Wijngaart, Abdullah Kayi, Jeff R. Hammond, Gabriele Jost, Tom St. John, Srinivas Sridharan, Timothy G. Mattson, John Abercrombie, and Jacob Nelson. 2016. Comparing Runtime Systems with Exascale Ambitions Using the Parallel Research Kernels. In International Conference on High Performance Computing. Springer, 321--339.
[46]
Rob F. Van der Wijngaart and Timothy G. Mattson. 2014. The Parallel Research Kernels. In HPEC. 1--6.
[47]
Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, and Tong Wen. 2007. Productivity and Performance Using Partitioned Global Address Space Languages. In PASCO. 24--32.
[48]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10 (2010), 10--10.

Cited By

View all
  • (2023)Visibility Algorithms for Dynamic Dependence Analysis and Distributed CoherenceProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577515(218-231)Online publication date: 25-Feb-2023
  • (2021)Scaling implicit parallelism via dynamic control replicationProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441587(105-118)Online publication date: 17-Feb-2021
  • (2021)Evaluation of Distributed Tasks in Stencil-based Application on GPUs2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM254806.2021.00011(45-52)Online publication date: Nov-2021
  • Show More Cited By

Index Terms

  1. Control replication: compiling implicit parallelism to efficient SPMD with logical regions

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
      November 2017
      801 pages
      ISBN:9781450351140
      DOI:10.1145/3126908
      • General Chair:
      • Bernd Mohr,
      • Program Chair:
      • Padma Raghavan
      Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

      Sponsors

      In-Cooperation

      • IEEE CS

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 November 2017

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. control replication
      2. legion
      3. regent
      4. regions
      5. task-based runtimes

      Qualifiers

      • Research-article

      Funding Sources

      • Swiss National Supercomputing Centre (CSCS)

      Conference

      SC '17
      Sponsor:

      Acceptance Rates

      SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;
      Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)12
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 25 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Visibility Algorithms for Dynamic Dependence Analysis and Distributed CoherenceProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577515(218-231)Online publication date: 25-Feb-2023
      • (2021)Scaling implicit parallelism via dynamic control replicationProceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3437801.3441587(105-118)Online publication date: 17-Feb-2021
      • (2021)Evaluation of Distributed Tasks in Stencil-based Application on GPUs2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM254806.2021.00011(45-52)Online publication date: Nov-2021
      • (2020)Task benchProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433783(1-15)Online publication date: 9-Nov-2020
      • (2020)Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime PerformanceSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00066(1-15)Online publication date: Nov-2020
      • (2020)An Implicitly Parallel Meshfree Solver in Regent2020 IEEE/ACM 3rd Annual Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)10.1109/PAWATM51920.2020.00009(40-54)Online publication date: Nov-2020
      • (2019)A constraint-based approach to automatic data partitioning for distributed memory executionProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356199(1-24)Online publication date: 17-Nov-2019
      • (2019)Legate NumPyProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356175(1-23)Online publication date: 17-Nov-2019
      • (2019)Pygion: Flexible, Scalable Task-Based Parallelism with Python2019 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM)10.1109/PAW-ATM49560.2019.00011(58-72)Online publication date: Nov-2019
      • (2018)Dynamic tracingProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291702(1-13)Online publication date: 11-Nov-2018
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media