research-article

Scalable work stealing

Authors:

D. Brian Larkins,

Sriram Krishnamoorthy,

Jarek NieplochaAuthors Info & Claims

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Article No.: 53, Pages 1 - 11

https://doi.org/10.1145/1654059.1654113

Published: 14 November 2009 Publication History

Abstract

Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed dynamic load balancing systems. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on large-scale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.

References

[1]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proc. 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 119--129, 1998.

Digital Library

[2]

G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. J. Harrison, S. Hirata, S. Krishanmoorthy, S. Krishnan, C.-C. Lam, Q. Lu, M. Nooijen, R. M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE, 93(2):276--292, Feb. 2005.

[3]

P. Berenbrink, T. Friedetzky, and L. Goldberg. The natural work-stealing algorithm is stable. In Proc. 42nd IEEE Symposium on Foundations of Computer Science (FOCS), pages 178--187, 2001.

Digital Library

[4]

G. E. Blelloch and J. Greiner. A provable time and space efflcient implementation of NESL. In Proc. 1st ACM SIGPLAN Intl. Conf. on Functional Programming (ICFP), pages 213--225, Philadelphia, Pennsylvania, May 1996.

Digital Library

[5]

R. D. Blumofe and C. Leiserson. Scheduling multithreaded computations by work stealing. In Proc. 35th Symposium on Foundations of Computer Science (FOCS), pages 356--368, Nov. 1994.

Digital Library

[6]

R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In Proc. USENIX Annual Technical Conference (ATEC), pages 10--10, 1997.

Digital Library

[7]

Ü. V. Çatalyürek, E. G. Boman, K. D. Devine, D. Bozdag, R. Heaphy, and L. A. Riesen. Hypergraph-based dynamic load balancing for adaptive scientific computations. In Proc. 21st Intl. Parallel and Distributed Processing Symposium (IPDPS), pages 1--11. IEEE, 2007.

[8]

B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the chapel language. Int. J. High Performance Computing Applications (IJHPCA), 21(3):291--312, 2007.

Digital Library

[9]

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proc. Conf. on Object Oriented Prog. Systems, Languages, and Applications (OOPSLA), pages 519--538, 2005.

Digital Library

[10]

G. Cong, S. Kodali, S. Krishnamoorty, D. Lea, V. Saraswat, and T. Wen. Solving irregular graph problems using adaptive work-stealing. In Proc. 37th Int Conf. on Parallel Processing (ICPP), Portland, OR, Sept. 2008.

Digital Library

[11]

K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, J. D. Teresco, J. Faik, J. E. Flaherty, and L. G. Gervasio. New challanges in dynamic load balancing. J. Appl. Numer. Math., 52(2--3):133--152, 2005.

Digital Library

[12]

J. Dinan, S. Krishnamoorthy, D. B. Larkins, J. Nieplocha, and P. Sadayappan. Scioto: A framework for global-view task parallelism. In Proc. 37th Intl. Conf. on Parallel Processing (ICPP), pages 586--593, 2008.

Digital Library

[13]

J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic load balancing of unbalanced computations using message passing. In Proc. of 6th Intl. Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS), pages 1--8, 2007.

[14]

J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. A message passing benchmark for unbalanced applications. J. Simulation Modelling Practice and Theory, 16(9):1177--1189, 2008.

[15]

N. Francez and M. Rodeh. Achieving distributed termination without freezing. IEEE Trans. on Software Engineering, SE-8(3):287--292, May 1982.

Digital Library

[16]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proc. Conf. on Prog. Language Design and Implementation (PLDI), pages 212--223. ACM SIGPLAN, 1998.

Digital Library

[17]

Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for terminally strict parallel programs. In Proc. 23rd Intl. Parallel and Distributed Processing Symposium (IPDPS), 2009.

[18]

D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. J. Distributed Computing, 18(3):189--207, 2006.

Digital Library

[19]

Intel Corporation. Cluster OpenMP user's guide v9.1. (309096-002 US), 2005--2006.

[20]

L. V. Kalé and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In Proc. Conf. on Object Oriented Prog. Systems, Languages, and Applications (OOPSLA), pages 91--108, 1993.

Digital Library

[21]

G. Karypis and V. Kumar. MeTis: Unstrctured Graph Partitioning and Sparse Matrix Ordering System, Version 4.0, Sept. 1998.

[22]

S. Krishnamoorthy, Ü. Çatalyürek, J. Nieplocha, A. Rountev, and P. Sadayappan. Hypergraph partitioning for automatic memory hierarchy management. In Proc. ACM/IEEE Conference Supercomputing (SC), page 98, 2006.

Digital Library

[23]

V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load balancing techniques for parallel computers. J. Parallel Distrib. Comput., 22(1):60--79, 1994.

Digital Library

[24]

Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys, 31(4):406--471, 1999.

Digital Library

[25]

L. Lamport. A new solution of dijkstra's concurrent programming problem. Commun. ACM, 17(8):453--455, 1974.

Digital Library

[26]

M. M. Michael, M. T. Vechev, and V. A. Saraswat. Idempotent work stealing. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), pages 45--54, Feb. 2009.

Digital Library

[27]

M. Mitzenmacher. Analyses of load stealing models based on differential equations. In Proc. 10th Symposium on Parallel Algorithms and Architectures (SPAA), pages 212--221. ACM, 1998.

Digital Library

[28]

MPI Forum. MPI: A message-passing interface standard. Technical Report UT-CS-94-230, 1994.

Digital Library

[29]

J. Nieplocha and B. Carpenter. ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems. Lecture Notes in Computer Science, 1586:533--546, 1999.

Digital Library

[30]

J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: a portable "shared-memory" programming model for distributed memory computers. In Proc. ACM/IEEE Conference Supercomputing (SC), pages 340--349, 1994.

Digital Library

[31]

S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An unbalanced tree search benchmark. In Proc. 19th Intl Workshop on Languages and Compilers for Parallel Computing (LCPC), pages 235--250, 2006.

Digital Library

[32]

S. Olivier and J. Prins. Scalable dynamic load balancing using UPC. In Proc. of 37th Intl. Conference on Parallel Processing (ICPP), Sept. 2008.

Digital Library

[33]

A. Sinha and L. V. Kalé. A load balancing strategy for prioritized execution of tasks. In Proc. 7th Intl. Parallel Processing Symposium (IPPS), pages 230--237, 1993.

Digital Library

[34]

G. L. Steele Jr. Parallel programming and parallel abstractions in fortress. In Proc. 14th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), page 157, 2005.

Digital Library

[35]

A. Trifunović and W. J. Knottenbelt. Parallel multilevel algorithms for hypergraph partitioning. J. Parallel Distrib. Comput., 68(5):563--581, 2008.

Digital Library

[36]

UPC Consortium. UPC language specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005.

[37]

R. V. van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In Proc. of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), 2001.

Digital Library

[38]

T. Yanai, G. I. Fann, Z. Gan, R. J. Harrison, and G. Beylkin. Multiresolution quantum chemistry in multiwavelet bases: Analytic derivatives for hartree--fock and density functional theory. J. Chemical Physics, 121(7):2866--2876, 2004.

[39]

K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance java dialect. Concurrency: Practice and Experience, 10(11--13):825--836, 1998.

Cited By

Yang JJiang JWen ZMian A(2024)Parallel and Distributed Bayesian Network Structure LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332683235:4(517-530)Online publication date: Apr-2024
https://doi.org/10.1109/TPDS.2023.3326832
Kiridana TKapukotuwa SDissanayake BGunasekera K(2024)Mobile Crowd Computing with Mobile Agents and Work Stealing2024 Moratuwa Engineering Research Conference (MERCon)10.1109/MERCon63886.2024.10688742(97-102)Online publication date: 8-Aug-2024
https://doi.org/10.1109/MERCon63886.2024.10688742
Nookala PChard KRaicu I(2024)X-OpenMP — eXtreme fine-grained tasking using lock-less work stealingFuture Generation Computer Systems10.1016/j.future.2024.05.019159(444-458)Online publication date: Oct-2024
https://doi.org/10.1016/j.future.2024.05.019
Show More Cited By

Index Terms

Recommendations

Enabling a highly-scalable global address space model for petascale computing
CF '10: Proceedings of the 7th ACM international conference on Computing frontiers

Over the past decade, the trajectory to the petascale has been built on increased complexity and scale of the underlying parallel architectures. Meanwhile, software developers have struggled to provide tools that maintain the productivity of ...
Optimized distributed work-stealing
IA^3 '16: Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms

Work-stealing is a popular approach for dynamic load balancing of task-parallel programs. However, as has been widely studied, the use of classical work-stealing algorithms on massively parallel and distributed supercomputers introduces several ...
Work stealing and persistence-based load balancers for iterative overdecomposed applications
HPDC '12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing

Applications often involve iterative execution of identical or slowly evolving calculations. Such applications require incremental rebalancing to improve load balance across iterations. In this paper, we consider the design and evaluation of two ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

November 2009

778 pages

ISBN:9781605587448

DOI:10.1145/1654059

Conference Chair:
Wilfred Pinfold

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC '09

Sponsor:

SIGARCH
IEEE-CS

SC '09: International Conference for High Performance Computing, Networking, Storage and Analysis

November 14 - 20, 2009

Oregon, Portland

Acceptance Rates

SC '09 Paper Acceptance Rate 59 of 261 submissions, 23%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

209
Total Citations
View Citations
315
Total Downloads

Downloads (Last 12 months)76
Downloads (Last 6 weeks)13

Reflects downloads up to 13 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang JJiang JWen ZMian A(2024)Parallel and Distributed Bayesian Network Structure LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332683235:4(517-530)Online publication date: Apr-2024
https://doi.org/10.1109/TPDS.2023.3326832
Kiridana TKapukotuwa SDissanayake BGunasekera K(2024)Mobile Crowd Computing with Mobile Agents and Work Stealing2024 Moratuwa Engineering Research Conference (MERCon)10.1109/MERCon63886.2024.10688742(97-102)Online publication date: 8-Aug-2024
https://doi.org/10.1109/MERCon63886.2024.10688742
Nookala PChard KRaicu I(2024)X-OpenMP — eXtreme fine-grained tasking using lock-less work stealingFuture Generation Computer Systems10.1016/j.future.2024.05.019159(444-458)Online publication date: Oct-2024
https://doi.org/10.1016/j.future.2024.05.019
Helbecque GCarneiro TMelab NGmys JBouvry P(2024)PGAS Data Structure for Unbalanced Tree-Based Algorithms at ScaleComputational Science – ICCS 202410.1007/978-3-031-63759-9_13(103-111)Online publication date: 29-Jun-2024
https://doi.org/10.1007/978-3-031-63759-9_13
Cheng LRuttenberg MJung DRichmond DTaylor MOskin MBatten CAamodt TJerger NSwift M(2023)Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad MemoriesProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582020(46-58)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582020
Shiina STaura KMohror KArnold DBadia R(2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607049
Xu JGuo HShen HRaj MWurster SPeterka T(2023)Reinforcement Learning for Load-Balanced Parallel Particle TracingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.314874529:6(3052-3066)Online publication date: 1-Jun-2023
https://doi.org/10.1109/TVCG.2022.3148745
Chen CWeng QWang WLi BLi B(2023)Accelerating Distributed Learning in Non-Dedicated EnvironmentsIEEE Transactions on Cloud Computing10.1109/TCC.2021.310259311:1(515-531)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TCC.2021.3102593
Firmin TTalbi E(2023)Massively Parallel Asynchronous Fractal Optimization2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00151(930-938)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00151
Dash SHaque Monil MYin JAnandakrishnan RWang F(2023)Distributing Simplex-Shaped Nested for-Loops to Identify Carcinogenic Gene Combinations2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00101(974-984)Online publication date: May-2023
https://doi.org/10.1109/IPDPS54959.2023.00101
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents