Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1654059.1654113acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Scalable work stealing

Published: 14 November 2009 Publication History

Abstract

Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed dynamic load balancing systems. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on large-scale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.

References

[1]
N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proc. 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 119--129, 1998.
[2]
G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. J. Harrison, S. Hirata, S. Krishanmoorthy, S. Krishnan, C.-C. Lam, Q. Lu, M. Nooijen, R. M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE, 93(2):276--292, Feb. 2005.
[3]
P. Berenbrink, T. Friedetzky, and L. Goldberg. The natural work-stealing algorithm is stable. In Proc. 42nd IEEE Symposium on Foundations of Computer Science (FOCS), pages 178--187, 2001.
[4]
G. E. Blelloch and J. Greiner. A provable time and space efflcient implementation of NESL. In Proc. 1st ACM SIGPLAN Intl. Conf. on Functional Programming (ICFP), pages 213--225, Philadelphia, Pennsylvania, May 1996.
[5]
R. D. Blumofe and C. Leiserson. Scheduling multithreaded computations by work stealing. In Proc. 35th Symposium on Foundations of Computer Science (FOCS), pages 356--368, Nov. 1994.
[6]
R. D. Blumofe and P. A. Lisiecki. Adaptive and reliable parallel computing on networks of workstations. In Proc. USENIX Annual Technical Conference (ATEC), pages 10--10, 1997.
[7]
Ü. V. Çatalyürek, E. G. Boman, K. D. Devine, D. Bozdag, R. Heaphy, and L. A. Riesen. Hypergraph-based dynamic load balancing for adaptive scientific computations. In Proc. 21st Intl. Parallel and Distributed Processing Symposium (IPDPS), pages 1--11. IEEE, 2007.
[8]
B. Chamberlain, D. Callahan, and H. Zima. Parallel programmability and the chapel language. Int. J. High Performance Computing Applications (IJHPCA), 21(3):291--312, 2007.
[9]
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proc. Conf. on Object Oriented Prog. Systems, Languages, and Applications (OOPSLA), pages 519--538, 2005.
[10]
G. Cong, S. Kodali, S. Krishnamoorty, D. Lea, V. Saraswat, and T. Wen. Solving irregular graph problems using adaptive work-stealing. In Proc. 37th Int Conf. on Parallel Processing (ICPP), Portland, OR, Sept. 2008.
[11]
K. D. Devine, E. G. Boman, R. T. Heaphy, B. A. Hendrickson, J. D. Teresco, J. Faik, J. E. Flaherty, and L. G. Gervasio. New challanges in dynamic load balancing. J. Appl. Numer. Math., 52(2--3):133--152, 2005.
[12]
J. Dinan, S. Krishnamoorthy, D. B. Larkins, J. Nieplocha, and P. Sadayappan. Scioto: A framework for global-view task parallelism. In Proc. 37th Intl. Conf. on Parallel Processing (ICPP), pages 586--593, 2008.
[13]
J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. Dynamic load balancing of unbalanced computations using message passing. In Proc. of 6th Intl. Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS), pages 1--8, 2007.
[14]
J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C.-W. Tseng. A message passing benchmark for unbalanced applications. J. Simulation Modelling Practice and Theory, 16(9):1177--1189, 2008.
[15]
N. Francez and M. Rodeh. Achieving distributed termination without freezing. IEEE Trans. on Software Engineering, SE-8(3):287--292, May 1982.
[16]
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proc. Conf. on Prog. Language Design and Implementation (PLDI), pages 212--223. ACM SIGPLAN, 1998.
[17]
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for terminally strict parallel programs. In Proc. 23rd Intl. Parallel and Distributed Processing Symposium (IPDPS), 2009.
[18]
D. Hendler, Y. Lev, M. Moir, and N. Shavit. A dynamic-sized nonblocking work stealing deque. J. Distributed Computing, 18(3):189--207, 2006.
[19]
Intel Corporation. Cluster OpenMP user's guide v9.1. (309096-002 US), 2005--2006.
[20]
L. V. Kalé and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. In Proc. Conf. on Object Oriented Prog. Systems, Languages, and Applications (OOPSLA), pages 91--108, 1993.
[21]
G. Karypis and V. Kumar. MeTis: Unstrctured Graph Partitioning and Sparse Matrix Ordering System, Version 4.0, Sept. 1998.
[22]
S. Krishnamoorthy, Ü. Çatalyürek, J. Nieplocha, A. Rountev, and P. Sadayappan. Hypergraph partitioning for automatic memory hierarchy management. In Proc. ACM/IEEE Conference Supercomputing (SC), page 98, 2006.
[23]
V. Kumar, A. Y. Grama, and N. R. Vempaty. Scalable load balancing techniques for parallel computers. J. Parallel Distrib. Comput., 22(1):60--79, 1994.
[24]
Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys, 31(4):406--471, 1999.
[25]
L. Lamport. A new solution of dijkstra's concurrent programming problem. Commun. ACM, 17(8):453--455, 1974.
[26]
M. M. Michael, M. T. Vechev, and V. A. Saraswat. Idempotent work stealing. In Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), pages 45--54, Feb. 2009.
[27]
M. Mitzenmacher. Analyses of load stealing models based on differential equations. In Proc. 10th Symposium on Parallel Algorithms and Architectures (SPAA), pages 212--221. ACM, 1998.
[28]
MPI Forum. MPI: A message-passing interface standard. Technical Report UT-CS-94-230, 1994.
[29]
J. Nieplocha and B. Carpenter. ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems. Lecture Notes in Computer Science, 1586:533--546, 1999.
[30]
J. Nieplocha, R. J. Harrison, and R. J. Littlefield. Global arrays: a portable "shared-memory" programming model for distributed memory computers. In Proc. ACM/IEEE Conference Supercomputing (SC), pages 340--349, 1994.
[31]
S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An unbalanced tree search benchmark. In Proc. 19th Intl Workshop on Languages and Compilers for Parallel Computing (LCPC), pages 235--250, 2006.
[32]
S. Olivier and J. Prins. Scalable dynamic load balancing using UPC. In Proc. of 37th Intl. Conference on Parallel Processing (ICPP), Sept. 2008.
[33]
A. Sinha and L. V. Kalé. A load balancing strategy for prioritized execution of tasks. In Proc. 7th Intl. Parallel Processing Symposium (IPPS), pages 230--237, 1993.
[34]
G. L. Steele Jr. Parallel programming and parallel abstractions in fortress. In Proc. 14th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT), page 157, 2005.
[35]
A. Trifunović and W. J. Knottenbelt. Parallel multilevel algorithms for hypergraph partitioning. J. Parallel Distrib. Comput., 68(5):563--581, 2008.
[36]
UPC Consortium. UPC language specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005.
[37]
R. V. van Nieuwpoort, T. Kielmann, and H. E. Bal. Efficient load balancing for wide-area divide-and-conquer applications. In Proc. of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), 2001.
[38]
T. Yanai, G. I. Fann, Z. Gan, R. J. Harrison, and G. Beylkin. Multiresolution quantum chemistry in multiwavelet bases: Analytic derivatives for hartree--fock and density functional theory. J. Chemical Physics, 121(7):2866--2876, 2004.
[39]
K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance java dialect. Concurrency: Practice and Experience, 10(11--13):825--836, 1998.

Cited By

View all
  • (2024)Parallel and Distributed Bayesian Network Structure LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332683235:4(517-530)Online publication date: Apr-2024
  • (2024)Mobile Crowd Computing with Mobile Agents and Work Stealing2024 Moratuwa Engineering Research Conference (MERCon)10.1109/MERCon63886.2024.10688742(97-102)Online publication date: 8-Aug-2024
  • (2024)X-OpenMP — eXtreme fine-grained tasking using lock-less work stealingFuture Generation Computer Systems10.1016/j.future.2024.05.019159(444-458)Online publication date: Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
November 2009
778 pages
ISBN:9781605587448
DOI:10.1145/1654059
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ARMCI
  2. PGAS
  3. dynamic load balancing
  4. global arrays
  5. task pools
  6. work stealing

Qualifiers

  • Research-article

Funding Sources

Conference

SC '09
Sponsor:

Acceptance Rates

SC '09 Paper Acceptance Rate 59 of 261 submissions, 23%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)76
  • Downloads (Last 6 weeks)13
Reflects downloads up to 13 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Parallel and Distributed Bayesian Network Structure LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332683235:4(517-530)Online publication date: Apr-2024
  • (2024)Mobile Crowd Computing with Mobile Agents and Work Stealing2024 Moratuwa Engineering Research Conference (MERCon)10.1109/MERCon63886.2024.10688742(97-102)Online publication date: 8-Aug-2024
  • (2024)X-OpenMP — eXtreme fine-grained tasking using lock-less work stealingFuture Generation Computer Systems10.1016/j.future.2024.05.019159(444-458)Online publication date: Oct-2024
  • (2024)PGAS Data Structure for Unbalanced Tree-Based Algorithms at ScaleComputational Science – ICCS 202410.1007/978-3-031-63759-9_13(103-111)Online publication date: 29-Jun-2024
  • (2023)Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad MemoriesProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582020(46-58)Online publication date: 25-Mar-2023
  • (2023)Itoyori: Reconciling Global Address Space and Global Fork-Join Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607049(1-15)Online publication date: 12-Nov-2023
  • (2023)Reinforcement Learning for Load-Balanced Parallel Particle TracingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.314874529:6(3052-3066)Online publication date: 1-Jun-2023
  • (2023)Accelerating Distributed Learning in Non-Dedicated EnvironmentsIEEE Transactions on Cloud Computing10.1109/TCC.2021.310259311:1(515-531)Online publication date: 1-Jan-2023
  • (2023)Massively Parallel Asynchronous Fractal Optimization2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00151(930-938)Online publication date: May-2023
  • (2023)Distributing Simplex-Shaped Nested for-Loops to Identify Carcinogenic Gene Combinations2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00101(974-984)Online publication date: May-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media