research-article

Heartbeat scheduling: provable efficiency for nested parallelism

Authors:

Arthur Charguéraud,

Filip SieczkowskiAuthors Info & Claims

ACM SIGPLAN Notices, Volume 53, Issue 4

Pages 769 - 782

https://doi.org/10.1145/3296979.3192391

Published: 11 June 2018 Publication History

Abstract

A classic problem in parallel computing is to take a high-level parallel program written, for example, in nested-parallel style with fork-join constructs and run it efficiently on a real machine. The problem could be considered solved in theory, but not in practice, because the overheads of creating and managing parallel threads can overwhelm their benefits. Developing efficient parallel codes therefore usually requires extensive tuning and optimizations to reduce parallelism just to a point where the overheads become acceptable.

In this paper, we present a scheduling technique that delivers provably efficient results for arbitrary nested-parallel programs, without the tuning needed for controlling parallelism overheads. The basic idea behind our technique is to create threads only at a beat (which we refer to as the "heartbeat") and make sure to do useful work in between. We specify our heartbeat scheduler using an abstract-machine semantics and provide mechanized proofs that the scheduler guarantees low overheads for all nested parallel programs. We present a prototype C++ implementation and an evaluation that shows that Heartbeat competes well with manually optimized Cilk Plus codes, without requiring manual tuning.

Supplementary Material

WEBM File (p769-acar.webm)

Download
120.56 MB

References

[1]

Umut A. Acar, Guy Blelloch, Matthew Fluet, and Stefan K. Mullerand Ram Raghunathan. 2015. Coupling Memory and Computation for Locality Management. In Summit on Advances in Programming Languages (SNAPL).

[2]

Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2002. The data locality of work stealing. Theory of Computing Systems (TOCS) 35, 3 (2002), 321-347.

[3]

Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13).

Digital Library

[4]

Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2016. Oracle-guided scheduling for controlling granularity in implicitly parallel languages. Journal of Functional Programming (JFP) 26 (2016), e23.

[5]

Shivali Agarwal, Rajkishore Barik, Dan Bonachea, Vivek Sarkar, R. K. Shyamasundar, and Katherine A. Yelick. 2007. Deadlock-free scheduling of X10 computations with bounded resources. In SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Diego, California, USA, June 9-11, 2007. 229-240.

Digital Library

[6]

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures (SPAA '98). ACM Press, 119-129.

Digital Library

[7]

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 2001. Thread Scheduling for Multiprogrammed Multiprocessors. Theory of Computing Systems 34, 2 (2001), 115-144.

[8]

Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. 2012. Internally deterministic parallel algorithms can be fast. In PPoPP '12. 181-192.

Digital Library

[9]

Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri. 2011. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11). 355-366.

Digital Library

[10]

Guy E. Blelloch and Phillip B. Gibbons. 2004. Effectively sharing a cache among threads. In SPAA.

Digital Library

[11]

Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. 1999. Provably efficient scheduling for languages with fine-grained parallelism. J. ACM 46 (March 1999), 281-321. Issue 2.

Digital Library

[12]

Robert D. Blumofe and Charles E. Leiserson. 1998. Space-Efficient Scheduling of Multithreaded Computations. SIAM J. Comput. 27, 1 (1998), 202-229.

Digital Library

[13]

Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46 (Sept. 1999), 720-748. Issue 5.

Digital Library

[14]

Richard P. Brent. 1974. The parallel evaluation of general arithmetic expressions. J. ACM 21, 2 (1974), 201-206.

Digital Library

[15]

F. Warren Burton and M. Ronan Sleep. 1981. Executing functional programs on a virtual tree of processors. In Functional Programming Languages and Computer Architecture (FPCA '81). ACM Press, 187-194.

Digital Library

[16]

Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (OOPSLA '05). ACM, 519-538.

Digital Library

[17]

David Chase and Yossi Lev. 2005. Dynamic circular work-stealing deque. In SPAA '05. 21-28.

Digital Library

[18]

Rezaul Alam Chowdhury and Vijaya Ramachandran. 2008. Cache-efficient dynamic programming algorithms for multicores. In Proc. 20th ACM Symposium on Parallelism in Algorithms and Architectures. ACM, New York, NY, USA, 207-216.

Digital Library

[19]

A. Duran, J. Corbalan, and E. Ayguade. 2008. An adaptive cut-off for task parallelism. In 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis. 1-11.

Digital Library

[20]

Derek L. Eager, John Zahorjan, and Edward D. Lazowska. 1989. Speedup versus efficiency in parallel systems. IEEE Transactions on Computing 38, 3 (1989), 408-423.

Digital Library

[21]

Marc Feeley. 1992. A Message Passing Implementation of Lazy Task Creation. In Parallel Symbolic Computing. 94-107.

Digital Library

[22]

Marc Feeley. 1993. Polling efficiently on stock hardware. In Proceedings of the conference on Functional programming languages and computer architecture (FPCA '93). 179-187.

Digital Library

[23]

Matthias Felleisen and Daniel P. Friedman. 1987. Control Operators, the SECD-Machine, and the Lambda-Calculus. In Formal Description of Programming Concepts - III, M. Wirsing (Ed.). Elsevier Science Publisher B.V. (North-Holland), 193-219.

[24]

Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. 2011. Implicitly threaded parallelism in Manticore. Journal of Functional Programming 20, 5-6 (2011), 1-40.

Digital Library

[25]

Matthew Fluet, Mike Rainey, John H. Reppy, and Adam Shaw. 2008. Implicitly-threaded parallelism in Manticore. In ICFP. 119-130.

Digital Library

[26]

Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In PLDI. 212-223.

Digital Library

[27]

Seth Copen Goldstein, Klaus Erik Schauser, and David Culler. 1995. Enabling Primitives for Compiling Parallel Languages. In Third Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers. Troy, New York.

[28]

Seth Copen Goldstein, Klaus Erik Schauser, and David E Culler. 1996. Lazy threads: Implementing a fast parallel call. J. Parallel and Distrib. Comput. 37, 1 (1996), 5-20.

Digital Library

[29]

John Greiner and Guy E. Blelloch. 1999. A Provably Time-efficient Parallel Implementation of Full Speculation. ACM Transactions on Programming Languages and Systems 21, 2 (March 1999), 240-285.

Digital Library

[30]

Adrien Guatto, Sam Westrick, Ram Raghunathan, and Umut A. Acarand Matthew Fluet. 2018. Hierarchical Memory Management for Mutable State. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP). ACM Press.

Digital Library

[31]

Robert H. Halstead, Jr. 1984. Implementation of Multilisp: Lisp on a Multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming (LFP '84). ACM, 9-17.

Digital Library

[32]

E. A. Hauck and B. A. Dent. 1968. Burroughs' B6500/B7500 Stack Mechanism. In Proceedings of the April 30-May 2, 1968, Spring Joint Computer Conference (AFIPS '68 (Spring)). ACM, New York, NY, USA, 245-251.

Digital Library

[33]

Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. 2009. Backtracking-based load balancing. Proceedings of the 2009 ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming 44, 4 (February 2009), 55-64.

Digital Library

[34]

Lorenz Huelsbergen, James R. Larus, and Alexander Aiken. 1994. Using the run-time sizes of data structures to guide parallel-thread creation. In Proceedings of the 1994 ACM conference on LISP and functional programming (LFP '94). 79-90.

Digital Library

[35]

Shams Mahmood Imam and Vivek Sarkar. 2014. Habanero-Java library: a Java 8 framework for multicore programming. In 2014 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages and Tools, PPPJ '14. 75-86.

Digital Library

[36]

Intel. 2011. Intel Threading Building Blocks. https://www.threadingbuildingblocks.org/.

[37]

Shintaro Iwasaki and Kenjiro Taura. 2016. A static cut-off for task parallel programs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, 139-150.

Digital Library

[38]

Doug Lea. 2000. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande (JAVA '00). 36-43.

Digital Library

[39]

I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Zhunping Zhang, and Jim Sukha. 2015. On-the-Fly Pipeline Parallelism. TOPC 2, 3 (2015), 17:1-17:42.

Digital Library

[40]

I-Ting Angelina Lee, Silas Boyd-Wickizer, Zhiyi Huang, and Charles E. Leiserson. 2010. Using Memory Mapping to Support Cactus Stacks in Work-stealing Runtime Systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10). ACM, New York, NY, USA, 411-420.

Digital Library

[41]

Daan Leijen, Wolfram Schulte, and Sebastian Burckhardt. 2009. The design of a task parallel library. In Proceedings of the 24th ACM SIGPLAN conference on Object Oriented Programming Systems Languages and Applications (OOPSLA '09). 227-242.

Digital Library

[42]

P. Lopez, M. Hermenegildo, and S. Debray. 1996. A methodology for granularity-based control of parallelism in logic programs. Journal of Symbolic Computation 21 (June 1996), 715-734. Issue 4-6.

Digital Library

[43]

Simon Marlow. 2013. Parallel and Concurrent Programming in Haskell. O'Reilly.

[44]

E. Mohr, D. A. Kranz, and R. H. Halstead. 1991. Lazy task creation: a technique for increasing the granularity of parallel programs. IEEE Transactions on Parallel and Distributed Systems 2, 3 (1991), 264-280.

Digital Library

[45]

Girija J. Narlikar and Guy E. Blelloch. 1999. Space-Efficient Scheduling of Nested Parallelism. ACM Transactions on Programming Languages and Systems 21 (1999).

Digital Library

[46]

OpenMP Architecture Review Board. [n. d.]. OpenMP Application Program Interface. http://www.openmp.org/

[47]

Joseph Pehoushek and Joseph Weening. 1990. Low-cost process creation and dynamic partitioning in Qlisp. In Parallel Lisp: Languages and Systems, Takayasu Ito and Robert Halstead (Eds.). Lecture Notes in Computer Science, Vol. 441. Springer Berlin / Heidelberg, 182-199.

Digital Library

[48]

Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. 2016. Hierarchical Memory Management for Parallel Programs. In ICFP 2016. ACM Press.

Digital Library

[49]

Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. 2010. Flexible architectural support for fine-grain scheduling. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems (ASPLOS '10). ACM, New York, NY, USA, 311-322.

Digital Library

[50]

Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. 2012. Brief Announcement: The Problem Based Benchmark Suite. In Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12). 68-70.

Digital Library

[51]

K. C. Sivaramakrishnan, Lukasz Ziarek, and Suresh Jagannathan. 2014. MultiMLton: A multicore-aware runtime for standard ML. Journal of Functional Programming FirstView (6 2014), 1-62.

[52]

Daniel Spoonhower, Guy E. Blelloch, Phillip B. Gibbons, and Robert Harper. 2009. Beyond Nested Parallelism: Tight Bounds on Workstealing Overheads for Parallel Futures. In Proceedings of the Twentyfirst Annual Symposium on Parallelism in Algorithms and Architectures (SPAA '09). ACM, New York, NY, USA, 91-100.

Digital Library

[53]

Alexandros Tzannes, George C. Caragea, Rajeev Barua, and Uzi Vishkin. 2010. Lazy binary-splitting: a run-time adaptive work-stealing scheduler. In Symposium on Principles & Practice of Parallel Programming. 179-190.

Digital Library

[54]

Alexandros Tzannes, George C. Caragea, Rajeev Barua, and Uzi Vishkin. 2010. Lazy binary-splitting: a run-time adaptive work-stealing scheduler. In PPoPP '10. 179-190.

Digital Library

[55]

Alexandros Tzannes, George C. Caragea, Uzi Vishkin, and Rajeev Barua. 2014. Lazy Scheduling: A Runtime Adaptive Scheduler for Declarative Parallelism. TOPLAS 36, 3, Article 10 (Sept. 2014), 51 pages.

Digital Library

[56]

Leslie G. Valiant. 1990. A bridging model for parallel computation. CACM 33 (Aug. 1990), 103-111. Issue 8.

Digital Library

[57]

Joseph S. Weening. 1989. Parallel Execution of Lisp Programs. Ph.D. Dissertation. Stanford University. Computer Science Technical Report STAN-CS-89-1265.

Digital Library

[58]

Chaoran Yang and John Mellor-Crummey. 2016. A Practical Solution to the Cactus Stack Problem. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '16). ACM, New York, NY, USA, 61-70.

Digital Library

Cited By

Da Silva JLeão LPetrucci VGamatié APereira F(2021)Mapping Computations in Heterogeneous Multicore Systems with Statistical Regression on Program InputsACM Transactions on Embedded Computing Systems10.1145/347828820:6(1-35)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3478288
Westrick SFluet MRainey MAcar U(2024)Automatic Parallelism ManagementProceedings of the ACM on Programming Languages10.1145/36328808:POPL(1118-1149)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632880
Arora JWestrick SAcar U(2023)Efficient Parallel Functional Programming with EffectsProceedings of the ACM on Programming Languages10.1145/35912847:PLDI(1558-1583)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591284
Show More Cited By

Index Terms

Heartbeat scheduling: provable efficiency for nested parallelism
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Automatic Parallelism Management

On any modern computer architecture today, parallelism comes with a modest cost, born from the creation and management of threads or tasks. Today, programmers battle this cost by manually optimizing/tuning their codes to minimize the cost of parallelism ...
Heartbeat scheduling: provable efficiency for nested parallelism
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

A classic problem in parallel computing is to take a high-level parallel program written, for example, in nested-parallel style with fork-join constructs and run it efficiently on a real machine. The problem could be considered solved in theory, but not ...
Task parallel assembly language for uncompromising parallelism
PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 53, Issue 4

PLDI '18

April 2018

834 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/3296979

Editor:
Matthew Fluet
Rodchester Institude of Technology

Issue’s Table of Contents

PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2018
825 pages
ISBN:9781450356985
DOI:10.1145/3192366
General Chair:
Jeffrey S. Foster
University of Maryland at College Park, USA
,
Program Chair:
Dan Grossman
University of Washington, USA

Copyright © 2018 ACM.

© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2018

Published in SIGPLAN Volume 53, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

European Research Council

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
438
Total Downloads

Downloads (Last 12 months)67
Downloads (Last 6 weeks)10

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Da Silva JLeão LPetrucci VGamatié APereira F(2021)Mapping Computations in Heterogeneous Multicore Systems with Statistical Regression on Program InputsACM Transactions on Embedded Computing Systems10.1145/347828820:6(1-35)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3478288
Westrick SFluet MRainey MAcar U(2024)Automatic Parallelism ManagementProceedings of the ACM on Programming Languages10.1145/36328808:POPL(1118-1149)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632880
Arora JWestrick SAcar U(2023)Efficient Parallel Functional Programming with EffectsProceedings of the ACM on Programming Languages10.1145/35912847:PLDI(1558-1583)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591284
Muller SSinger KKeeney DNeth AAgrawal KLee IAcar U(2023)Responsive Parallelism with SynchronizationProceedings of the ACM on Programming Languages10.1145/35912497:PLDI(712-735)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591249
PRUIKSMA KPFENNING F(2022)Back to futuresJournal of Functional Programming10.1017/S095679682200001632Online publication date: 28-Feb-2022
https://doi.org/10.1017/S0956796822000016
Rainey MNewton RHale KHardavellas NCampanoni SDinda PAcar UFreund SYahav E(2021)Task parallel assembly language for uncompromising parallelismProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3460969(1064-1079)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3460969
Arora JWestrick SAcar U(2021)Provably space-efficient parallel functional programmingProceedings of the ACM on Programming Languages10.1145/34342995:POPL(1-33)Online publication date: 4-Jan-2021
https://dl.acm.org/doi/10.1145/3434299
Hale KCampanoni SHardavellas NDinda P(2021)The Case for an Interwoven Parallel Hardware/Software Stack2021 SC Workshops Supplementary Proceedings (SCWS)10.1109/SCWS55283.2021.00017(50-59)Online publication date: Nov-2021
https://doi.org/10.1109/SCWS55283.2021.00017
Ghosh SCuevas MCampanoni SDinda PCuicchi CQualters IKramer W(2020)Compiler-based timing for extremely fine-grain preemptive parallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433771(1-15)Online publication date: 9-Nov-2020
https://dl.acm.org/doi/10.5555/3433701.3433771
Iwasaki SAmer ATaura KBalaji P(2020)Analyzing the Performance Trade-Off in Implementing User-Level ThreadsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297605731:8(1859-1877)Online publication date: 1-Aug-2020
https://doi.org/10.1109/TPDS.2020.2976057
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents