article

Free access

Scheduling multithreaded computations by work stealing

Authors:

Robert D. Blumofe,

Charles E. LeisersonAuthors Info & Claims

Journal of the ACM (JACM), Volume 46, Issue 5

Pages 720 - 748

https://doi.org/10.1145/324133.324234

Published: 01 September 1999 Publication History

Abstract

This paper studies the problem of efficiently schedulling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing,” in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies.

Specifically, our analysis shows that the expected time to execute a fully strict computation on P processors using our work-stealing scheduler is T₁/P + O(T_∞, where T₁ is the minimum serial execution time of the multithreaded computation and (T_∞ is the minimum execution time with an infinite number of processors. Moreover, the space required by the execution is at most S₁P, where S₁ is the minimum serial space requirement. We also show that the expected total communication of the algorithm is at most O(PT_∞( 1 + n_d)S_max), where S_max is the size of the largest activation record of any thread and n_d is the maximum number of times that any thread synchronizes with its parent. This communication bound justifies the folk wisdom that work-stealing schedulers are more communication efficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor.

References

[1]

ARORA, N. S.,BLUMOFE,R.D.,AND PLAXTON, C. G. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'98) (Puerto Vallarta, Mexico, June 28--July 2). ACM, New York, pp. 119-129.

[2]

ARVIND,NIKHIL,R.S.,AND PINGALI, K. K. 1989. I-structures: Data structures for parallel computing. ACM Trans. Program. Lang. Syst. 11, 4 (Oct.), 598-632.

[3]

BLELLOCH,G.E.,GIBBONS,P.B.,AND MATIAS, Y. 1995. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'95) (Santa Barbara, Calif., July 17-19). ACM, New York, pp. 1-12.

[4]

BLELLOCH,G.E.,GIBBONS,P.B.,MATIAS, Y., AND NARLIKAR, G. J. 1997. Space-efficient scheduling of parallelism with synchronization variables. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'97) (Newport, R.I., June 22-25). ACM, New York, pp. 12-23.

[5]

BLUMOFE, R. D. 1995. Executing multithreaded programs efficiently. Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Also available as MIT Laboratory for Computer Science Technical Report MIT/LCS/TR-677.

[6]

BLUMOFE,R.D.,FRIGO, M., JOERG,C.F.,LEISERSON,C.E.,AND RANDALL, K. H. 1996a. An analysis of dag-consistent distributed shared-memory algorithms. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'96) (Padua, Italy, June 24-26). ACM, New York, pp. 297-308.

[7]

BLUMOFE,R.D.,FRIGO, M., JOERG,C.F.,LEISERSON,C.E.,AND RANDALL, K. H. 1996b. Dag-consistent distributed shared memory. In Proceedings of the 10th International Parallel Process-ing Symposium (IPPS) (Honolulu, Hawaii, April). IEEE Computer Society Press, Los Alamitos, Calif., pp. 132-141.

[8]

BLUMOFE,R.D.,JOERG,C.F.,KUSZMAUL,B.C.,LEISERSON,C.E.,RANDALL,K.H.,AND ZHOU,Y. 1996c. Cilk: An efficient multithreaded runtime system. J. Paral. Dist. Comput. 37, 1 (Aug.), 55-69.

[9]

BLUMOFE,R.D.,AND LEISERSON, C. E. 1998. Space-efficient scheduling of multithreaded compu-tations. SIAM J. Comput. 27, 1 (Feb.), 202-229.

[10]

BLUMOFE,R.D.,AND LISIECKI, P. A. 1997. Adaptive and reliable parallel computing on networks of workstations. In Proceedings of the USENIX 1997 Annual Technical Conference on UNIX and Advanced Computing Systems (Anaheim, Calif., Jan.). USENIX Associates, Berkeley, Calif., pp. 133-147.

[11]

BLUMOFE,R.D.,AND PAPADOPOULOS, D. 1998. The performance of work stealing in multipro-grammed environments. Tech. Rep. TR-98-13 (May). Dept. Computer Sciences, The University of Texas at Austin, Austin, Tex.

[12]

BRENT, R. P. 1974. The parallel evaluation of general arithmetic expressions. J. ACM 21, 2 (Apr.), 201-206.

[13]

BURTON, F. W. 1988. Storage management in virtual tree machines. IEEE Trans. Comput. 37,3 (Mar.), 321-328.

[14]

BURTON, F. W. 1996. Guaranteeing good memory bounds for parallel programs. IEEE Trans. Softw. Eng. 22, 10 (Oct.), 762-773.

[15]

BURTON,F.W.,AND SLEEP, M. R. 1981. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture (Portsmouth, N.H., Oct.). ACM, New York, N.Y., pp. 187-194.

[16]

CHENG, G.-I. 1998. Algorithms for data-race detection in multithreaded programs. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technol-ogy.

[17]

CHENG, G.-I., FENG, M., LEISERSON,C.E.,RANDALL,K.H.,AND STARK, A. F. 1998. Detecting data races in Cilk programs that use locks. In Proceedings of the 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA'98) (Puerto Vallarta, Mexico, June 28-July 2). ACM, New York, pp. 298-309.

[18]

CULLER,D.E.,AND ARVIND. 1988. Resource requirements of dataflow programs. In Proceedings of the 15th Annual International Symposium on Computer Architecture (ISCA) (Honolulu, Hawaii, May). IEEE Computer Society Press, Los Alamitos, Calif., pp. 141-150. Also available as MIT Laboratory for Computer Science, Computation Structures Group Memo 280.

[19]

EAGER,D.L.,ZAHORJAN, J., AND LAZOWSKA, E. D. 1989. Speedup versus efficiency in parallel systems. IEEE Trans. Comput. 38, 3 (Mar.), 408-423.

[20]

FELDMANN, R., MYSLIWIETZ, P., AND MONIEN, B. 1993. Game tree search on a massively parallel system. Adv. Comput. Chess 7, 203-219.

[21]

FENG, M., AND LEISERSON, C. E. 1997. Efficient detection of determinacy races in Cilk programs. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'97) (Newport, R.I., June 22-25). ACM, New York, pp. 1-11.

[22]

FINKEL, R., AND MANBER, U. 1987. DIB-A distributed implementation of backtracking. ACM Trans. Program. Lang. Syst. 9, 2 (Apr.), 235-256.

[23]

FRIGO, M. 1998. The weakest reasonable memory model. Master's thesis, Dept. Electrical Engi-neering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass.

[24]

FRIGO, M., LEISERSON,C.E.,AND RANDALL, K. H. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'98) (Montreal, Canada, June 17-19). ACM, New York.

[25]

FRIGO, M., AND LUCHANGCO, V. 1998. Computation-centric memory models. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'98) (Puerto Vallarta, Mexico, June 28-July 2). ACM, New York, pp. 240-249.

[26]

GRAHAM, R. L. 1966. Bounds for certain multiprocessing anomalies. Bell Syst. Tech. J. 45, 1563-1581.

[27]

GRAHAM, R. L. 1969. Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17,2 (Mar.), 416-429.

[28]

HALBHERR, M., ZHOU, Y., AND JOERG, C. F. 1994. MIMD-style parallel programming with continuation-passing threads. In Proceedings of the 2nd International Workshop on Massive Parallel-ism: Hardware, Software, and Applications (Capri, Italy, Sept.). World Scientific, Singapore. (Also available as MIT Laboratory for Computer Science Computation Structures, Group Memo 355, March 1994. MIT, Cambridge, Mass.

[29]

HALSTEAD,R.H.,JR. 1984. Implementation of Multilisp: Lisp on a multiprocessor. In Conference Record of the 1984 ACM Symposium on LISP and Functional Programming (Austin, Tex., Aug.) ACM, New York, pp. 9-17.

[30]

JOERG, C. F. 1996. The Cilk System for Parallel Multithreaded Computing. Ph.D. dissertation. Dept. Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass.

[31]

JOERG, C., AND KUSZMAUL, B. C. 1994. Massively parallel chess. In Proceedings of the 3rd DIMACS Parallel Implementation Challenge (Rutgers University, New Jersey, Oct. 1994).

[32]

KARP,R.M.,AND ZHANG, Y. 1993. Randomized parallel algorithms for backtrack search and branch-and-bound computation. J. ACM 40, 3 (July), 765-789.

[33]

KUSZMAUL, B. C. 1994. Synchronized MIMD computing. Ph.D. thesis, Dept. Electrical Engineer-ing and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass. Also available as MIT Laboratory for Computer Science Technical Report MIT/LCS/TR-645.

[34]

LISIECKI, P. 1996. Macroscheduling in the Cilk network of workstations environment. Master's thesis, Dept. Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass.

[35]

LIU, P., AIELLO, W., AND BHATT, S. 1993. An atomic model for message-passing. In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'93) (Velen, Germany, June 30-July 2). ACM, New York, pp. 154-163.

[36]

MOHR, E., KRANZ,D.A.,AND HALSTEAD,R.H.,JR. 1991. Lazy task creation: A technique for increasing the granularity of parallel programs. IEEE Trans. Parall. Dist. Syst. 2, 3 (July), 264-280.

[37]

PANDE,V.S.,JOERG,C.F.,GROSBERG,A.Y.,AND TANAKA, T. 1994. Enumerations of the Hamiltonian walks on a cubic sublattice. J. Phys. A 27.

[38]

PAPADOPOULOS, D. P. 1998. Hood: A user-level thread library for multiprogramming multiproces-sors. Master's thesis, Dept. Computer Sciences, The University of Texas at Austin, Austin, Tex.

[39]

RANADE, A. 1987. How to emulate shared memory. In Proceedings of the 28th Annual Symposium on Foundations of Computer Science (FOCS) (Los Angeles, Calif., Oct.). IEEE Computer Society Press, Los Alamitos, Calif., pp. 185-194.

[40]

RANDALL, K. H. 1998. Cilk: Efficient multithreaded computing. Ph.D. dissertation. Dept. Electri-cal Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass.

[41]

RUDOLPH, L., SLIVKIN-ALLALOUF, M., AND UPFAL, E. 1991. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'91) (Hilton Head, S.C., July 21-24). ACM, New York, pp. 237-245.

[42]

RUGGIERO,C.A.,AND SARGEANT, J. 1987. Control of parallelism in the Manchester dataflow machine. In Functional Programming Languages and Computer Architecture, Number 274 in Lecture Notes in Computer Science. Springer-Verlag, New York, pp. 1-15.

[43]

STARK, A. F. 1998. Debugging multithreaded programs that incorporate user-level locking. Mas-ter's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass.

[44]

VANDEVOORDE,M.T.,AND ROBERTS, E. S. 1988. WorkCrews: An abstraction for controlling parallelism. International Journal of Parallel Programming 17, 4 (Aug.), 347-366.

[45]

WU, I.-C., AND KUNG, H. T. 1991. Communication complexity for parallel divide-and-conquer. In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science (FOCS) (San Juan, Puerto Rico, Oct. 1991). IEEE Computer Society Press, Los Alamitos, Calif., pp. 151-162.

[46]

ZHANG, Y., AND ORTYNSKI, A. 1994. The efficiency of randomized parallel backtrack search. In Proceedings of the 6th IEEE Symposium on Parallel and Distributed Processing (Dallas, Texas, Oct. 1994). IEEE Computer Society Press, Los Alamitos, Calif.

Cited By

Kouatly RKhan T(2024)Performance of Text-Independent Automatic Speaker Recognition on a Multicore SystemTsinghua Science and Technology10.26599/TST.2023.901001829:2(447-456)Online publication date: Apr-2024
https://doi.org/10.26599/TST.2023.9010018
de Wolff Ivan Balen DKeller GMcDonell T(2024)Zero-Overhead Parallel Scans for Multi-Core CPUsProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649248(52-61)Online publication date: 3-Mar-2024
https://dl.acm.org/doi/10.1145/3649169.3649248
Dubslaff CHusung NKäfer N(2024)Configuring BDD Compilation Techniques for Feature ModelsProceedings of the 28th ACM International Systems and Software Product Line Conference - Volume B10.1145/3646548.3676538(209-216)Online publication date: 2-Sep-2024
https://dl.acm.org/doi/10.1145/3646548.3676538
Show More Cited By

Index Terms

Scheduling multithreaded computations by work stealing
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Multithreading
        Scheduling

Recommendations

Adaptive work stealing with parallelism feedback
PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming

We present an adaptive work-stealing thread scheduler, A-Steal, for fork-join multithreaded jobs, like those written using the Cilk multithreaded language or the Hood work-stealing library. The A-Steal algorithm is appropriate for large parallel servers ...
Adaptive work-stealing with parallelism feedback

Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted ...
Scheduling multithreaded computations by work stealing
SFCS '94: Proceedings of the 35th Annual Symposium on Foundations of Computer Science

This paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is "work stealing," ...

Reviews

Reviewer: Alexander Romanovsky

A randomized work-stealing algorithm for scheduling fully strict multithreaded computation on MIMD computers is proposed. A formula bounding the algorithm's expected execution time on a fixed number of processors is given and rigorously proven. Under the stated assumptions and for this type of computation, this first provably good work-stealing algorithm performs better than all known work-sharing algorithms. The authors begin with a discussion of a simple scheduling algorithm (a “busy-leaves” algorithm), which is not really practical because it is centralized, but which clearly shows important characteristics of the general algorithm. The subsequent part of the paper presents the main algorithm itself. To analyze it, the authors introduce the atomic access model and prove important properties that hold for their computational model. The use of these results allows the authors to analyze the time, communication overhead, and space required by the algorithm. The computational model, assumptions, and type of computation are very general and realistic, making the approach useful for many existing and potential applications. In this well-written paper, the authors have succeeded in delivering their results in an attractive and convincing fashion. I have only one concern: the last part of the paper, which discusses the applications of the Cilk language, whose runtime incorporates the algorithm, looks too much like an advertisement. It would have been more appropriate to include technical information there.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM

Journal of the ACM Volume 46, Issue 5

Sept. 1999

210 pages

ISSN:0004-5411

EISSN:1557-735X

DOI:10.1145/324133

Issue’s Table of Contents

Copyright © 1999 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 1999

Published in JACM Volume 46, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

910
Total Citations
View Citations
5,377
Total Downloads

Downloads (Last 12 months)480
Downloads (Last 6 weeks)54

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kouatly RKhan T(2024)Performance of Text-Independent Automatic Speaker Recognition on a Multicore SystemTsinghua Science and Technology10.26599/TST.2023.901001829:2(447-456)Online publication date: Apr-2024
https://doi.org/10.26599/TST.2023.9010018
de Wolff Ivan Balen DKeller GMcDonell T(2024)Zero-Overhead Parallel Scans for Multi-Core CPUsProceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3649169.3649248(52-61)Online publication date: 3-Mar-2024
https://dl.acm.org/doi/10.1145/3649169.3649248
Dubslaff CHusung NKäfer N(2024)Configuring BDD Compilation Techniques for Feature ModelsProceedings of the 28th ACM International Systems and Software Product Line Conference - Volume B10.1145/3646548.3676538(209-216)Online publication date: 2-Sep-2024
https://dl.acm.org/doi/10.1145/3646548.3676538
Shi JDhulipala LShun J(2024)Parallel Algorithms for Hierarchical Nucleus DecompositionProceedings of the ACM on Management of Data10.1145/36392872:1(1-27)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639287
Westrick SFluet MRainey MAcar U(2024)Automatic Parallelism ManagementProceedings of the ACM on Programming Languages10.1145/36328808:POPL(1118-1149)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632880
Tang WDong DLi SWang CYao PZhou JZhang C(2024) Octopus: Scaling Value-Flow Analysis via Parallel Collection of Realizable Path ConditionsACM Transactions on Software Engineering and Methodology10.1145/363274333:3(1-33)Online publication date: 24-Jan-2024
https://dl.acm.org/doi/10.1145/3632743
Manohar MShen ZBlelloch GDhulipala LGu YSimhadri HSun YLee IChabbi MSteuwer M(2024)ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate Nearest Neighbor Search AlgorithmsProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638475(270-285)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638475
Wang JLiu YFu MHärtig HChen HAgrawal KPetrank E(2024)Brief Announcement: Work Stealing through Partial Asynchronous DelegationProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3660261(281-283)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3660261
Papp PAnegg GKaranasiou AYzelman AAgrawal KPetrank E(2024)Efficient Multi-Processor Scheduling in Increasingly Realistic ModelsProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659972(463-474)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659972
Abdi JPosluns GZhang GWang BJeffrey MAgrawal KPetrank E(2024)When Is Parallelism Fearless and Zero-Cost with Rust?Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659966(27-40)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659966
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents