research-article

Scaling implicit parallelism via dynamic control replication

Authors:

Elliott Slaughter,

Mario Di Renzo,

Manolis Papadakis,

Patrick McCormick,

Michael Garland,

Alex AikenAuthors Info & Claims

PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 105 - 118

https://doi.org/10.1145/3437801.3441587

Published: 17 February 2021 Publication History

Abstract

We present dynamic control replication, a run-time program analysis that enables scalable execution of implicitly parallel programs on large machines through a distributed and efficient dynamic dependence analysis. Dynamic control replication distributes dependence analysis by executing multiple copies of an implicitly parallel program while ensuring that they still collectively behave as a single execution. By distributing and parallelizing the dependence analysis, dynamic control replication supports efficient, on-the-fly computation of dependences for programs with arbitrary control flow at scale. We describe an asymptotically scalable algorithm for implementing dynamic control replication that maintains the sequential semantics of implicitly parallel programs.

An implementation of dynamic control replication in the Legion runtime delivers the same programmer productivity as writing in other implicitly parallel programming models, such as Dask or TensorFlow, while providing better performance (11.4X and 14.9X respectively in our experiments), and scalability to hundreds of nodes. We also show that dynamic control replication provides good absolute performance and scaling for HPC applications, competitive in many cases with explicitly parallel programming systems.

References

[1]

2013. OpenMP Application Program Interface. http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf.

[2]

2013. Safe Object Finalization in Python. https://www.python.org/dev/peps/pep-0442/.

[3]

2019. CANDLE: Exascale Deep Learning and Simulation Enabled Precision Medicine for Cancer. https://candle.cels.anl.gov/.

[4]

2019. Uno: Predicting Tumor Dose Response across Multiple Data Sources. https://github.com/ECP-CANDLE/Benchmarks/tree/master/Pilot1/Uno.

[5]

2020. June 2020 Top 500 Supercomputers. https://www.top500.org/lists/top500/2020/06/.

[6]

2020. Regent Stencil Example. https://gitlab.com/StanfordLegion/legion/-/blob/master/language/examples/stencil.rg.

[7]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.

[8]

Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento, Florent Pruvost, Marc Sergent, and Samuel Thibault. 2016. Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model. Technical Report. Inria.

[9]

Alex Aiken and David Gay. 1998. Barrier Inference. In Proceedings of the Symposium on Principles of Programming Languages. 342--354.

[10]

Michael Bauer and Michael Garland. 2019. Legate NumPy: Accelerated and Distributed Array Computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '19). ACM, New York, NY, USA, Article 23, 23 pages.

Digital Library

[11]

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Supercomputing (SC).

Digital Library

[12]

Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2014. Structure slicing: Extending logical regions with fields. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 845--856.

Digital Library

[13]

Robert L. Bocchino, Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vakilian. 2009. A Type and Effect System for Deterministic Parallel Java. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (Orlando, Florida, USA) (OOPSLA '09). ACM, New York, NY, USA, 97--116.

Digital Library

[14]

George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Hérault, and Jack J. Dongarra. 2013. PaRSEC: Exploiting Heterogeneity to Enhance Scalability. Computing in Science & Engineering 15, 6 (2013), 36--45.

Digital Library

[15]

Javier Bueno, Xavier Martorell, Rosa M. Badia, Eduard Ayguadé, and Jesús Labarta. 2013. Implementing OmpSs Support for Regions of Data in Architectures with Multiple Address Spaces. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (Eugene, Oregon, USA) (ICS '13). ACM, New York, NY, USA, 359--368.

Digital Library

[16]

B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. Int. J. High Perform. Comput. Appl. 21, 3 (Aug. 2007), 291--312.

Digital Library

[17]

Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: An Object-oriented Approach to Non-uniform Cluster Computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (San Diego, CA, USA) (OOPSLA '05). ACM, New York, NY, USA, 519--538.

Digital Library

[18]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 248--255.

[19]

Mario Di Renzo, Lin Fu, and Javier Urzay. 2020. HTR solver: An open-source exascale-oriented task-based multi-GPU high-order code for hypersonic aerothermodynamics. Computer Physics Communications 255 (2020), 107262. (In Press).

[20]

Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing.

Digital Library

[21]

Charles Ferenbaugh. 2016. The PENNANT Mini-App. https://github.com/lanl/PENNANT/blob/master/doc/pennantdoc.pdf.

[22]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[23]

Nikhil Hegde, Qifan Chang, and Milind Kulkarni. 2019. D2P: From recursive formulations to distributed-memory codes. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--22.

Digital Library

[24]

Stephen T Heumann, Alexandros Tzannes, and Vikram S Adve. 2015. Scalable task scheduling and synchronization using hierarchical effects. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 125--137.

Digital Library

[25]

Reazul Hoque, Thomas Herault, George Bosilca, and Jack Dongarra. 2017. Dynamic Task Discovery in PaRSEC: A Data-flow Task-based Runtime. In Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (Denver, Colorado) (ScalA '17). ACM, New York, NY, USA, Article 6, 8 pages.

Digital Library

[26]

Z. Jia, S. Treichler, G. Shipman, M. Bauer, N. Watkins, C. Maltzahn, P. McCormick, and A. Aiken. 2017. Integrating External Resources with a Task-Based Programming Model. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC). 307--316.

[27]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. In SysML 2018.

[28]

Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, OR, USA) (PGAS '14). ACM, New York, NY, USA, Article 6, 11 pages.

Digital Library

[29]

L.V. Kalé and S. Krishnan. 1993. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of OOPSLA'93, A. Paepcke (Ed.). ACM Press, 91--108.

[30]

Wonchan Lee, Manolis Papadakis, Elliott Slaughter, and Alex Aiken. 2019. A Constraint-based Approach to Automatic Data Partitioning for Distributed Memory Execution. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '19). ACM, New York, NY, USA, Article 45, 24 pages.

Digital Library

[31]

Wonchan Lee, Elliott Slaughter, Michael Bauer, Sean Treichler, Todd Warszawski, Michael Garland, and Alex Aiken. 2018. Dynamic Tracing: Memoization of Task Graphs for Dynamic Task-based Runtimes. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) (SC '18). IEEE Press, Piscataway, NJ, USA, Article 34, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291702

[32]

Jonathan Lifflander and Sriram Krishnamoorthy. 2017. Cache locality optimization for recursive programs. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 1--16.

Digital Library

[33]

Omid Mashayekhi, Hang Qu, Chinmayee Shah, and Philip Levis. 2017. Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics. In USENIX Annual Technical Conference (USENIX ATC).

Digital Library

[34]

T. G. Mattson, R. Cledat, V. Cavé, V. Sarkar, Z. Budimlić, S. Chatterjee, J. Fryman, I. Ganev, R. Knauerhase, Min Lee, B. Meister, B. Nickerson, N. Pepperling, B. Seshasayee, S. Tasirlar, J. Teller, and N. Vrvilo. 2016. The Open Community Runtime: A runtime system for extreme scale computing. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.

[35]

Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, William Paul, Michael I. Jordan, and Ion Stoica. 2017. Ray: A Distributed Framework for Emerging AI Applications. CoRR abs/1712.05889 (2017). arXiv:1712.05889 http://arxiv.org/abs/1712.05889

Digital Library

[36]

NumPy 2019. NumPy v1.16 Manual. https://docs.scipy.org/doc/numpy/.

[37]

NVIDIA 2019. GPUDirect. https://developer.nvidia.com/gpudirect.

[38]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In NIPS Autodiff Workshop.

[39]

Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proceedings of the 14th Python in Science Conference, Kathryn Huff and James Bergstra (Eds.). 130 -- 136.

[40]

John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw. 2011. Parallel Random Numbers: As Easy As 1, 2, 3. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (Seattle, Washington) (SC '11). ACM, New York, NY, USA, Article 16, 12 pages.

Digital Library

[41]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018). http://arxiv.org/abs/1802.05799

[42]

Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael Bauer, and Alex Aiken. 2015. Regent: A High-productivity Programming Language for HPC with Logical Regions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Austin, Texas) (SC '15). ACM, New York, NY, USA, Article 81, 12 pages.

Digital Library

[43]

Elliott Slaughter, Wonchan Lee, Sean Treichler, Wen Zhang, Michael Bauer, Galen Shipman, Patrick McCormick, and Alex Aiken. 2017. Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '17). ACM, New York, NY, USA, Article 14, 12 pages.

Digital Library

[44]

E. Slaughter, W. Wu, Y. Fu, L. Brandenburg, N. Garcia, E. Marx, K.S. Morris, Q. Cao, G. Bosilca, S. Mirchandaney, W. Lee, S. Treichler, P. McCormick, and A. Aiken. 2020. Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance. In Proceedings of the International Conference on Supercomputing.

[45]

M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. 1998. MPI-The Complete Reference. MIT Press.

[46]

The HDF Group. 1997--2020. Hierarchical Data Format, version 5. http://www.hdfgroup.org/HDF5/.

[47]

Hilario Torres, Manolis Papadakis, Lluis Jofre, Wonchan Lee, Alex Aiken, and Gianluca Iaccarino. 2019. Soleil-X: Turbulence, Particles, and Radiation in the Regent Programming Language. In Proceedings of PAW@SC 2019: Parallel Applications Workshop, Held in conjunction with SC19: The International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, Colorado, USA, November 16-22, 2019. ACM.

[48]

S. Treichler, M. Bauer, and Aiken A. 2014. Realm: An Event-Based Low-Level Runtime for Distributed Memory Architectures. In Parallel Architectures and Compilation Techniques (PACT).

Digital Library

[49]

S. Treichler, M. Bauer, and A. Aiken. 2013. Language Support for Dynamic, Hierarchical Data Partitioning. In Object Oriented Programming, Systems, Languages, and Applications (OOPSLA).

[50]

S. Treichler, M. Bauer, Sharma R., Slaughter E., and A. Aiken. 2016. Dependent Partitioning. In Object Oriented Programming, Systems, Languages, and Applications (OOPSLA).

[51]

Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, and Tong Wen. 2007. Productivity and Performance Using Partitioned Global Address Space Languages. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation (London, Ontario, Canada) (PASCO '07). ACM, New York, NY, USA, 24--32.

Digital Library

[52]

Yuan Yu, Martin Abadi, Paul Barham, Eugene Brevdo, Mike Burrows, Andy Davis, Jeff Dean, Sanjay Ghemawat, Tim Harley, Peter Hawkins, Michael Isard, Manjunath Kudlur, Rajat Monga, Derek Murray, and Xiaoqiang Zheng. 2018. Dynamic Control Flow in Large-Scale Machine Learning. In Proceedings of EuroSys 2018. https://arxiv.org/pdf/1805.01772.pdf

Digital Library

[53]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=2228298.2228301

Digital Library

Cited By

Bauer MSlaughter ETreichler SLee WGarland MAiken ADehnavi MKulkarni MKrishnamoorthy S(2023)Visibility Algorithms for Dynamic Dependence Analysis and Distributed CoherenceProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577515(218-231)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577515
Sundram SLee WAiken A(2022)Task Fusion in Distributed Runtimes2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)10.1109/PAW-ATM56565.2022.00007(13-25)Online publication date: Nov-2022
https://doi.org/10.1109/PAW-ATM56565.2022.00007
Evans TSiegel ADraeger EDeslippe JFrancois MGermann THart WMartin D(2021)A survey of software implementations used by application codes in the Exascale Computing ProjectThe International Journal of High Performance Computing Applications10.1177/10943420211028940(109434202110289)Online publication date: 25-Jun-2021
https://doi.org/10.1177/10943420211028940
Show More Cited By

Index Terms

Scaling implicit parallelism via dynamic control replication
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

Control replication: compiling implicit parallelism to efficient SPMD with logical regions
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

We present control replication, a technique for generating high-performance and scalable SPMD code from implicitly parallel programs. In contrast to traditional parallel programming models that require the programmer to explicitly manage threads and the ...
Implicit parallelism with ordered transactions
PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming

Implicit Parallelism with Ordered Transactions (IPOT) is an extension of sequential or explicitly parallel programming models to support speculative parallelization. The key idea is to specify opportunities for parallelization in a sequential program ...
Improving implicit parallelism
Haskell '15

Using static analysis techniques compilers for lazy functional languages can be used to identify parts of a program that can be legitimately evaluated in parallel and ensure that those expressions are executed concurrently with the main thread of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2021

507 pages

ISBN:9781450382946

DOI:10.1145/3437801

General Chair:
Jaejin Lee
Seoul National University, South Korea
,
Program Chair:
Erez Petrank
Technion, Israel

Copyright © 2021 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '21

Sponsor:

PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 27, 2021

Virtual Event, Republic of Korea

Acceptance Rates

PPoPP '21 Paper Acceptance Rate 31 of 150 submissions, 21%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
463
Total Downloads

Downloads (Last 12 months)57
Downloads (Last 6 weeks)6

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bauer MSlaughter ETreichler SLee WGarland MAiken ADehnavi MKulkarni MKrishnamoorthy S(2023)Visibility Algorithms for Dynamic Dependence Analysis and Distributed CoherenceProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577515(218-231)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577515
Sundram SLee WAiken A(2022)Task Fusion in Distributed Runtimes2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)10.1109/PAW-ATM56565.2022.00007(13-25)Online publication date: Nov-2022
https://doi.org/10.1109/PAW-ATM56565.2022.00007
Evans TSiegel ADraeger EDeslippe JFrancois MGermann THart WMartin D(2021)A survey of software implementations used by application codes in the Exascale Computing ProjectThe International Journal of High Performance Computing Applications10.1177/10943420211028940(109434202110289)Online publication date: 25-Jun-2021
https://doi.org/10.1177/10943420211028940
Soi RBauer MTreichler SPapadakis MLee WMcCormick PAiken ASlaughter Ede Supinski BHall MGamblin T(2021)Index launchesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476175(1-18)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476175
Bauer MLee WPapadakis MZalewski MGarland MHinsen KDubey A(2021)Supercomputing in Python With LegateComputing in Science & Engineering10.1109/MCSE.2021.308823923:4(73-79)Online publication date: 1-Jul-2021
https://doi.org/10.1109/MCSE.2021.3088239
Shudler SPetruzza SPascucci VBremer P(2021)Portable and Composable Flow Graphs for In Situ Analytics2021 IEEE 11th Symposium on Large Data Analysis and Visualization (LDAV)10.1109/LDAV53230.2021.00014(63-72)Online publication date: Oct-2021
https://doi.org/10.1109/LDAV53230.2021.00014
Raut EAnderson JAraya-Polo MMeng J(2021)Evaluation of Distributed Tasks in Stencil-based Application on GPUs2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM254806.2021.00011(45-52)Online publication date: Nov-2021
https://doi.org/10.1109/ESPM254806.2021.00011

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents