Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3437801.3441587acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Scaling implicit parallelism via dynamic control replication

Published: 17 February 2021 Publication History

Abstract

We present dynamic control replication, a run-time program analysis that enables scalable execution of implicitly parallel programs on large machines through a distributed and efficient dynamic dependence analysis. Dynamic control replication distributes dependence analysis by executing multiple copies of an implicitly parallel program while ensuring that they still collectively behave as a single execution. By distributing and parallelizing the dependence analysis, dynamic control replication supports efficient, on-the-fly computation of dependences for programs with arbitrary control flow at scale. We describe an asymptotically scalable algorithm for implementing dynamic control replication that maintains the sequential semantics of implicitly parallel programs.
An implementation of dynamic control replication in the Legion runtime delivers the same programmer productivity as writing in other implicitly parallel programming models, such as Dask or TensorFlow, while providing better performance (11.4X and 14.9X respectively in our experiments), and scalability to hundreds of nodes. We also show that dynamic control replication provides good absolute performance and scaling for HPC applications, competitive in many cases with explicitly parallel programming systems.

References

[1]
2013. OpenMP Application Program Interface. http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf.
[2]
2013. Safe Object Finalization in Python. https://www.python.org/dev/peps/pep-0442/.
[3]
2019. CANDLE: Exascale Deep Learning and Simulation Enabled Precision Medicine for Cancer. https://candle.cels.anl.gov/.
[4]
2019. Uno: Predicting Tumor Dose Response across Multiple Data Sources. https://github.com/ECP-CANDLE/Benchmarks/tree/master/Pilot1/Uno.
[5]
2020. June 2020 Top 500 Supercomputers. https://www.top500.org/lists/top500/2020/06/.
[6]
2020. Regent Stencil Example. https://gitlab.com/StanfordLegion/legion/-/blob/master/language/examples/stencil.rg.
[7]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
[8]
Emmanuel Agullo, Olivier Aumage, Mathieu Faverge, Nathalie Furmento, Florent Pruvost, Marc Sergent, and Samuel Thibault. 2016. Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model. Technical Report. Inria.
[9]
Alex Aiken and David Gay. 1998. Barrier Inference. In Proceedings of the Symposium on Principles of Programming Languages. 342--354.
[10]
Michael Bauer and Michael Garland. 2019. Legate NumPy: Accelerated and Distributed Array Computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '19). ACM, New York, NY, USA, Article 23, 23 pages.
[11]
M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Supercomputing (SC).
[12]
Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2014. Structure slicing: Extending logical regions with fields. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 845--856.
[13]
Robert L. Bocchino, Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen Heumann, Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vakilian. 2009. A Type and Effect System for Deterministic Parallel Java. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (Orlando, Florida, USA) (OOPSLA '09). ACM, New York, NY, USA, 97--116.
[14]
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Hérault, and Jack J. Dongarra. 2013. PaRSEC: Exploiting Heterogeneity to Enhance Scalability. Computing in Science & Engineering 15, 6 (2013), 36--45.
[15]
Javier Bueno, Xavier Martorell, Rosa M. Badia, Eduard Ayguadé, and Jesús Labarta. 2013. Implementing OmpSs Support for Regions of Data in Architectures with Multiple Address Spaces. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (Eugene, Oregon, USA) (ICS '13). ACM, New York, NY, USA, 359--368.
[16]
B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. Int. J. High Perform. Comput. Appl. 21, 3 (Aug. 2007), 291--312.
[17]
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: An Object-oriented Approach to Non-uniform Cluster Computing. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (San Diego, CA, USA) (OOPSLA '05). ACM, New York, NY, USA, 519--538.
[18]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. IEEE, 248--255.
[19]
Mario Di Renzo, Lin Fu, and Javier Urzay. 2020. HTR solver: An open-source exascale-oriented task-based multi-GPU high-order code for hypersonic aerothermodynamics. Computer Physics Communications 255 (2020), 107262. (In Press).
[20]
Kayvon Fatahalian, Timothy J. Knight, Mike Houston, Mattan Erez, Daniel Reiter Horn, Larkhoon Leem, Ji Young Park, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing.
[21]
Charles Ferenbaugh. 2016. The PENNANT Mini-App. https://github.com/lanl/PENNANT/blob/master/doc/pennantdoc.pdf.
[22]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[23]
Nikhil Hegde, Qifan Chang, and Milind Kulkarni. 2019. D2P: From recursive formulations to distributed-memory codes. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--22.
[24]
Stephen T Heumann, Alexandros Tzannes, and Vikram S Adve. 2015. Scalable task scheduling and synchronization using hierarchical effects. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 125--137.
[25]
Reazul Hoque, Thomas Herault, George Bosilca, and Jack Dongarra. 2017. Dynamic Task Discovery in PaRSEC: A Data-flow Task-based Runtime. In Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (Denver, Colorado) (ScalA '17). ACM, New York, NY, USA, Article 6, 8 pages.
[26]
Z. Jia, S. Treichler, G. Shipman, M. Bauer, N. Watkins, C. Maltzahn, P. McCormick, and A. Aiken. 2017. Integrating External Resources with a Task-Based Programming Model. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC). 307--316.
[27]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. In SysML 2018.
[28]
Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. HPX: A Task Based Programming Model in a Global Address Space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, OR, USA) (PGAS '14). ACM, New York, NY, USA, Article 6, 11 pages.
[29]
L.V. Kalé and S. Krishnan. 1993. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of OOPSLA'93, A. Paepcke (Ed.). ACM Press, 91--108.
[30]
Wonchan Lee, Manolis Papadakis, Elliott Slaughter, and Alex Aiken. 2019. A Constraint-based Approach to Automatic Data Partitioning for Distributed Memory Execution. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '19). ACM, New York, NY, USA, Article 45, 24 pages.
[31]
Wonchan Lee, Elliott Slaughter, Michael Bauer, Sean Treichler, Todd Warszawski, Michael Garland, and Alex Aiken. 2018. Dynamic Tracing: Memoization of Task Graphs for Dynamic Task-based Runtimes. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) (SC '18). IEEE Press, Piscataway, NJ, USA, Article 34, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291702
[32]
Jonathan Lifflander and Sriram Krishnamoorthy. 2017. Cache locality optimization for recursive programs. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 1--16.
[33]
Omid Mashayekhi, Hang Qu, Chinmayee Shah, and Philip Levis. 2017. Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics. In USENIX Annual Technical Conference (USENIX ATC).
[34]
T. G. Mattson, R. Cledat, V. Cavé, V. Sarkar, Z. Budimlić, S. Chatterjee, J. Fryman, I. Ganev, R. Knauerhase, Min Lee, B. Meister, B. Nickerson, N. Pepperling, B. Seshasayee, S. Tasirlar, J. Teller, and N. Vrvilo. 2016. The Open Community Runtime: A runtime system for extreme scale computing. In 2016 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.
[35]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, William Paul, Michael I. Jordan, and Ion Stoica. 2017. Ray: A Distributed Framework for Emerging AI Applications. CoRR abs/1712.05889 (2017). arXiv:1712.05889 http://arxiv.org/abs/1712.05889
[36]
NumPy 2019. NumPy v1.16 Manual. https://docs.scipy.org/doc/numpy/.
[37]
NVIDIA 2019. GPUDirect. https://developer.nvidia.com/gpudirect.
[38]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic Differentiation in PyTorch. In NIPS Autodiff Workshop.
[39]
Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proceedings of the 14th Python in Science Conference, Kathryn Huff and James Bergstra (Eds.). 130 -- 136.
[40]
John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw. 2011. Parallel Random Numbers: As Easy As 1, 2, 3. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (Seattle, Washington) (SC '11). ACM, New York, NY, USA, Article 16, 12 pages.
[41]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018). http://arxiv.org/abs/1802.05799
[42]
Elliott Slaughter, Wonchan Lee, Sean Treichler, Michael Bauer, and Alex Aiken. 2015. Regent: A High-productivity Programming Language for HPC with Logical Regions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Austin, Texas) (SC '15). ACM, New York, NY, USA, Article 81, 12 pages.
[43]
Elliott Slaughter, Wonchan Lee, Sean Treichler, Wen Zhang, Michael Bauer, Galen Shipman, Patrick McCormick, and Alex Aiken. 2017. Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '17). ACM, New York, NY, USA, Article 14, 12 pages.
[44]
E. Slaughter, W. Wu, Y. Fu, L. Brandenburg, N. Garcia, E. Marx, K.S. Morris, Q. Cao, G. Bosilca, S. Mirchandaney, W. Lee, S. Treichler, P. McCormick, and A. Aiken. 2020. Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance. In Proceedings of the International Conference on Supercomputing.
[45]
M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. 1998. MPI-The Complete Reference. MIT Press.
[46]
The HDF Group. 1997--2020. Hierarchical Data Format, version 5. http://www.hdfgroup.org/HDF5/.
[47]
Hilario Torres, Manolis Papadakis, Lluis Jofre, Wonchan Lee, Alex Aiken, and Gianluca Iaccarino. 2019. Soleil-X: Turbulence, Particles, and Radiation in the Regent Programming Language. In Proceedings of PAW@SC 2019: Parallel Applications Workshop, Held in conjunction with SC19: The International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, Colorado, USA, November 16-22, 2019. ACM.
[48]
S. Treichler, M. Bauer, and Aiken A. 2014. Realm: An Event-Based Low-Level Runtime for Distributed Memory Architectures. In Parallel Architectures and Compilation Techniques (PACT).
[49]
S. Treichler, M. Bauer, and A. Aiken. 2013. Language Support for Dynamic, Hierarchical Data Partitioning. In Object Oriented Programming, Systems, Languages, and Applications (OOPSLA).
[50]
S. Treichler, M. Bauer, Sharma R., Slaughter E., and A. Aiken. 2016. Dependent Partitioning. In Object Oriented Programming, Systems, Languages, and Applications (OOPSLA).
[51]
Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, and Tong Wen. 2007. Productivity and Performance Using Partitioned Global Address Space Languages. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation (London, Ontario, Canada) (PASCO '07). ACM, New York, NY, USA, 24--32.
[52]
Yuan Yu, Martin Abadi, Paul Barham, Eugene Brevdo, Mike Burrows, Andy Davis, Jeff Dean, Sanjay Ghemawat, Tim Harley, Peter Hawkins, Michael Isard, Manjunath Kudlur, Rajat Monga, Derek Murray, and Xiaoqiang Zheng. 2018. Dynamic Control Flow in Large-Scale Machine Learning. In Proceedings of EuroSys 2018. https://arxiv.org/pdf/1805.01772.pdf
[53]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2. http://dl.acm.org/citation.cfm?id=2228298.2228301

Cited By

View all
  • (2023)Visibility Algorithms for Dynamic Dependence Analysis and Distributed CoherenceProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577515(218-231)Online publication date: 25-Feb-2023
  • (2022)Task Fusion in Distributed Runtimes2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)10.1109/PAW-ATM56565.2022.00007(13-25)Online publication date: Nov-2022
  • (2021)A survey of software implementations used by application codes in the Exascale Computing ProjectThe International Journal of High Performance Computing Applications10.1177/10943420211028940(109434202110289)Online publication date: 25-Jun-2021
  • Show More Cited By

Index Terms

  1. Scaling implicit parallelism via dynamic control replication

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    February 2021
    507 pages
    ISBN:9781450382946
    DOI:10.1145/3437801
    Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 February 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. dynamic control replication
    2. implicit parallelism
    3. legion
    4. scalable dependence analysis
    5. task-based runtime

    Qualifiers

    • Research-article

    Conference

    PPoPP '21

    Acceptance Rates

    PPoPP '21 Paper Acceptance Rate 31 of 150 submissions, 21%;
    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)57
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 12 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Visibility Algorithms for Dynamic Dependence Analysis and Distributed CoherenceProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577515(218-231)Online publication date: 25-Feb-2023
    • (2022)Task Fusion in Distributed Runtimes2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)10.1109/PAW-ATM56565.2022.00007(13-25)Online publication date: Nov-2022
    • (2021)A survey of software implementations used by application codes in the Exascale Computing ProjectThe International Journal of High Performance Computing Applications10.1177/10943420211028940(109434202110289)Online publication date: 25-Jun-2021
    • (2021)Index launchesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476175(1-18)Online publication date: 14-Nov-2021
    • (2021)Supercomputing in Python With LegateComputing in Science & Engineering10.1109/MCSE.2021.308823923:4(73-79)Online publication date: 1-Jul-2021
    • (2021)Portable and Composable Flow Graphs for In Situ Analytics2021 IEEE 11th Symposium on Large Data Analysis and Visualization (LDAV)10.1109/LDAV53230.2021.00014(63-72)Online publication date: Oct-2021
    • (2021)Evaluation of Distributed Tasks in Stencil-based Application on GPUs2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM254806.2021.00011(45-52)Online publication date: Nov-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media