Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Adaptive work-stealing with parallelism feedback

Published: 22 September 2008 Publication History

Abstract

Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted processors. We present a randomized work-stealing thread scheduler for fork-join multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors. Our A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job's execution. Assuming that the job scheduler never allots a job more processors than requested by the job's thread scheduler, A-STEAL guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors.
We model the job scheduler as the thread scheduler's adversary, challenging the thread scheduler to be robust to the operating environment as well as to the job scheduler's administrative policies. For example, the job scheduler might make a large number of processors available exactly when the job has little use for them. To analyze the performance of our adaptive thread scheduler under this stringent adversarial assumption, we introduce a new technique called trim analysis, which allows us to prove that our thread scheduler performs poorly on no more than a small number of time steps, exhibiting near-optimal behavior on the vast majority.
More precisely, suppose that a job has work T1 and span T. On a machine with P processors, A-STEAL completes the job in an expected duration of O(T1/ + T + L lg P) time steps, where L is the length of a scheduling quantum, and denotes the O(T + L lg P)-trimmed availability. This quantity is the average of the processor availability over all time steps except the O(T + L lg P) time steps that have the highest processor availability. When the job's parallelism dominates the trimmed availability, that is, < T1/T, the job achieves nearly perfect linear speedup. Conversely, when the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly the length of its span, which is optimal.
We measured the performance of A-STEAL on a simulated multiprocessor system using synthetic workloads. For jobs with sufficient parallelism, our experiments confirm that A-STEAL provides almost perfect linear speedup across a variety of processor availability profiles. We compared A-STEAL with the ABP algorithm, an adaptive work-stealing thread scheduler developed by Arora et al. [1998] which does not employ parallelism feedback. On moderately to heavily loaded machines with large numbers of processors, A-STEAL typically completed jobs more than twice as quickly as ABP, despite being allotted the same number or fewer processors on every step, while wasting only 10% of the processor cycles wasted by ABP.

References

[1]
Acar, U. A., Blelloch, G. E., and Blumofe, R. D. 2000. The data locality of work stealing. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA). 1--12.
[2]
Agrawal, K., He, Y., Hsu, W. J., and Leiserson, C. E. 2006a. Adaptive task scheduling with parallelism feedback. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).
[3]
Agrawal, K., He, Y., and Leiserson, C. E. 2006b. An empirical evaluation of work stealing with parallelism feedback. In Proceedings of the 2006 IEEE International Conference on Distributed Computing Systems (ICDCS'06).
[4]
Agrawal, K., He, Y., and Leiserson, C. E. 2007. Adaptive work stealing with parallelism feedback. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS '07).
[5]
Arora, N. S., Blumofe, R. D., and Plaxton, C. G. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA). 119--129.
[6]
Aspnes, J., Herlihy, M., and Shavit, N. 1994. Counting networks. J. ACM 41, 5, 1020--1048.
[7]
Bansal, N., Dhamdhere, K., Konemann, J., and Sinha, A. 2004. Non-clairvoyant scheduling for minimizing mean slowdown. Algorithmica 40, 4, 305--318.
[8]
Blelloch, G., Gibbons, P., and Matias, Y. 1999. Provably efficient scheduling for languages with fine-grained parallelism. J. ACM 46, 2, 281--321.
[9]
Blelloch, G. E., Gibbons, P. B., and Matias, Y. 1995. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA). 1--12.
[10]
Blelloch, G. E. and Greiner, J. 1996. A provable time and space efficient implementation of NESL. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming (ICFP'96). 213--225.
[11]
Blumofe, R. D. 1995. Executing multithreaded programs efficiently. Ph.D. Thesis. Massachusetts Institute of Technology.
[12]
Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. 1995. Cilk: an efficient multithreaded runtime system. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 207--216.
[13]
Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. 1996. Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37, 1, 55--69.
[14]
Blumofe, R. D. and Leiserson, C. E. 1998. Space-efficient scheduling of multithreaded computations. SIAM J. Comput. 27, 1 (Feb.), 202--229.
[15]
Blumofe, R. D. and Leiserson, C. E. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46, 5, 720--748.
[16]
Blumofe, R. D., Leiserson, C. E., and Song, B. 1998. Automatic processor allocation for work-stealing jobs. unpublished.
[17]
Blumofe, R. D. and Lisiecki, P. A. 1997. Adaptive and reliable parallel computing on networks of workstations. In Proceedings of the USENIX 1997 Annual Technical Conference (USENJX'97). pp. 133--147.
[18]
Blumofe, R. D. and Papadopoulos, D. 1998. The performance of work stealing in multiprogrammed environments. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pp. 266--267.
[19]
Blumofe, R. D. and Papadopoulos, D. 1999. Hood: a user-level threads library for multiprogrammed multiprocessors. Tech. Rep., University of Texas at Austin.
[20]
Blumofe, R. D. and Park, D. S. 1994. Scheduling large-scale parallel computations on networks of workstations. In Proceedings of the IEEE International Symposium on High Performance Distributed Computing (HPDC'94), pp. 96--105.
[21]
Brent, R. P. 1974. The parallel evaluation of general arithmetic expressions. J. ACM 21, 2, 201--206.
[22]
Burton, F. W. and Sleep, M. R. 1981. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture (FPCA'81). 187--194.
[23]
Chase, D. and Lev, Y. 2005. Dynamic circular work-stealing deque. In Proceedings of the ACM symposium on Parallelism in Algorithms and Architectures. 21--28.
[24]
Chiang, S.-H. and Vernon, M. K. 1996. Dynamic vs. static quantum-based parallel processor allocation. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP). 200--223.
[25]
Cirne, W. and Berman, F. 2001. A model for moldable supercomputer jobs. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS'01). IEEE Computer Society, Washington, DC, USA, pp. 50--59.
[26]
Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms (Second ed). The MIT Press and McGraw-Hill.
[27]
Deng, X. and Dymond, P. 1996. On multiprocessor system scheduling. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 82--88.
[28]
Deng, X., Gu, N., Brecht, T., and Lu, K. 1996. Preemptive scheduling of parallel jobs on multiprocessors. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '96), pp. 159--167. Society for Industrial and Applied Mathematics.
[29]
DESMOJ. 1999. DESMO-J: a framework for discrete-event modelling and simulation. http://asi-www.informatik.uni-hamburg.de/desmoj/.
[30]
Downey, A. B. 1998. A parallel workload model and its implications for processor allocation. Cluster Comput. 1, 1, 133--145.
[31]
Eager, D. L., Zahorjan, J., and Lozowska, E. D. 1989. Speedup versus efficiency in parallel systems. IEEE Trans. Comput. 38, 3, 408--423.
[32]
Edmonds, J. 1999. Scheduling in the dark. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC'99), pp. 179--188.
[33]
Edmonds, J., Chinn, D. D., Brecht, T., and Deng, X. 2003. Non-clairvoyant multiprocessor scheduling of jobs with changing execution characteristics. J. Sched. 6, 3, 231--250.
[34]
Fang, Z., Tang, P., Yew, P.-C., and Zhu, C.-Q. 1990. Dynamic processor self-scheduling for general parallel nested loops. IEEE Trans. Comput. 39, 7, 919--929.
[35]
Feitelson, D. 2005. Parallel workloads archive. http://www.cs.huji.ac.il/labs/parallel/workload/.
[36]
Feitelson, D. G. 1996. Packing schemes for gang scheduling. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), D. G. Feitelson and L. Rudolph, Eds. Vol. 1162, pp. 89--110. Springer.
[37]
Feitelson, D. G. 1997. Job scheduling in multiprogrammed parallel systems (extended version). Tech. Rep., IBM Research Report RC 19790 (87657) 2nd Revision.
[38]
Finkel, R. and Manber, U. 1987. DIB—A distributed implementation of backtracking. Trans. Progr. Lang. 9, 2 (Apr.), 235--256.
[39]
Frigo, M., Leiserson, C. E., and Randall, K. H. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI'98), pp. 212--223.
[40]
Ghosal, D., Serazzi, G., and Tripathi, S. K. 1991. The processor working set and its use in scheduling multiprocessor systems. IEEE Trans. Softw. Eng. 17, 5, 443--453.
[41]
Graham, R. L. 1969. Bounds on multiprocessing anomalies. SIAM Journ. Appl. Math. 17, 2, 416--429.
[42]
Gu, N. 1995. Competitive analysis of dynamic processor allocation strategies. Masters Thesis. York University.
[43]
Halbherr, M., Zhou, Y., and Joerg, C. F. 1994. MIMD-style parallel programming with continuation-passing threads. In Proceedings of the International Workshop on Massive Parallelism: Hardware, Software, and Applications.
[44]
Halstead Jr., R. H. 1984. Implementation of Multilisp: Lisp on a multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and Functional Programming (LFP'84). Austin, Texas, pp. 9--17.
[45]
Harchol-Balter, M. 1999. The effect of heavy-tailed job size. distributions on computer system design. In Proceedings of the Conference on Applications of Heavy Tailed Distributions in Economics.
[46]
Harchol-Balter, M. and Downey, A. B. 1997. Exploiting process lifetime distributions for dynamic load balancing. ACM Trans. Comput. Syst. 15, 3, 253--285.
[47]
Hendler, D., Lev, Y., Moir, M., and Shavit, N. 2006. A dynamic-sized nonblocking work stealing deque. Distrib. Comput. 18, 3, 189--207.
[48]
Hendler, D. and Shavit, N. 2002. Non-blocking steal-half work queues. In Proceedings of the Annual Symposium on Principles of Distributed Computing, 280--289.
[49]
Hummel, S. F. and Schonberg, E. 1991. Low-overhead scheduling of nested parallelism. IBM J. Res. Develop. 35, 5-6, 743--765.
[50]
Karp, R. M. and Zhang, Y. 1988. A randomized parallel branch-and-bound procedure. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC'88). 290--300.
[51]
Leland, W. and Ott, T. J. 1986. Load-balancing heuristics and process behavior. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 54--69.
[52]
Leutenegger, S. T. and Vernon, M. K. 1990. The performance of multiprogrammed multiprocessor scheduling policies. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 226--236.
[53]
Lublin, U. and Feitelson, D. G. 2003. The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63, 11, 1105--1122.
[54]
Martorell, X., Corbalán, J., Nikolopoulos, D. S., Navarro, N., Polychronopoulos, E. D., Papatheodorou, T. S., and Labarta, J. 2000. A tool to schedule parallel applications on multiprocessors: the NANOS CPU manager. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), 87--112.
[55]
McCann, C., Vaswani, R., and Zahorjan, J. 1993. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors. ACM Trans. Comput. Syst. 11, 2, 146--178.
[56]
Mohr, E., Kranz, D. A., and Halstead, Jr., R. H. 1990. Lazy task creation: a technique for increasing the granularity of parallel programs. In Proceedings of the 1990 ACM Symposium on LISP and Functional Programming (LFP'90), pp. 185--197.
[57]
Motwani, R., Phillips, S., and Torng, E. 1993. Non-clairvoyant scheduling. In Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'93), pp. 422--431.
[58]
Motwani, R. and Raghavan, P. 1995. Randomized Algorithms (1st Ed). Cambridge University Press.
[59]
Narlikar, G. J. and Blelloch, G. E. 1999. Space-efficient scheduling of nested parallelism. ACM Trans. Prog. Lang. Syst. 21, 1, 138--173.
[60]
Nguyen, T. D., Vaswani, R., and Zahorjan, J. 1996a. Maximizing speedup through self-tuning of processor allocation. In Proceedings of the 10th International Parallel Processing Symposium (IPPS'96), pp. 463--468.
[61]
Nguyen, T. D., Vaswani, R., and Zahorjan, J. 1996b. Using runtime measured workload characteristics in parallel processor scheduling. In Proceedings of the International Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), pp. 155--174.
[62]
Parsons, E. W. and Sevcik, K. C. 1995. Multiprocessor scheduling for high-variability service time distributions. In Proceedings of the 9th International Parallel Processing Symposium (IPPS '95). 127--145.
[63]
Rosti, E., Smirni, E., Dowdy, L. W., Serazzi, G., and Carlson, B. M. 1994. Robust partitioning schemes of multiprocessor systems. Perform. Eval. 19, 2-3, 141--165.
[64]
Rosti, E., Smirni, E., Serazzi, G., and Dowdy, L. W. 1995. Analysis of non-work-conserving processor partitioning policies. In Proceedings of the 9th International Parallel Processing Symposium (IPPS '95), pp. 165--181.
[65]
Rudolph, L., Slivkin-Allalouf, M., and Upfal, E. 1991. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA). 237--245.
[66]
Sen, S. 2004. Dynamic processor allocation for adaptively parallel jobs. Masters Thesis. Massachusetts Institute of Technology.
[67]
Sevcik, K. C. 1989. Characterizations of parallelism in applications and their use in scheduling. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pp. 171--180.
[68]
Sevcik, K. C. 1994. Application scheduling and processor allocation in multiprogrammed parallel processing systems. Perform. Eval. 19, 2-3, 107--140.
[69]
Song, B. 1998. Scheduling adaptively parallel jobs. Masters Thesis. Massachusetts Institute of Technology.
[70]
Squillante, M. S. 1995. On the benefits and limitations of dynamic partitioning in parallel computer systems. In Proceedings of the 9th International Parallel Processing Symposium (IPPS'95), pp. 219--238.
[71]
Supercomputing Technologies Group. 2001. Cilk 5.3.2 Reference Manual. MIT Laboratory for Computer Science.
[72]
Timothy B. Brecht, K. G. 1996. Using parallel program characteristics in dynamic processor allocation policies. Perform. Eval. 27-28, 519--539.
[73]
Tucker, A. and Gupta, A. 1989. Process control and scheduling issues for multiprogrammed shared-memory multiprocessors. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles (SOSP'89). 159--166.
[74]
Yue, K. K. and Lilja, D. J. 2001. Implementing a dynamic processor allocation policy for multiprogrammed parallel applications in the Solaris#8482; operating system. Concurrency Computat. Pract. Exper. 13, 6, 449--464.
[75]
Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.

Cited By

View all
  • (2025)Energy-harvesting-aware federated scheduling of parallel real-time tasksThe Journal of Supercomputing10.1007/s11227-024-06685-781:1Online publication date: 1-Jan-2025
  • (2024)Scheduling Out-Trees Online to Optimize Maximum FlowProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659955(77-88)Online publication date: 17-Jun-2024
  • (2024)Aquifer: Transparent Microsecond-scale Scheduling for vRAN WorkloadsIEEE Transactions on Services Computing10.1109/TSC.2024.3440032(1-14)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 26, Issue 3
September 2008
108 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/1394441
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 September 2008
Revised: 01 July 2008
Accepted: 01 October 2007
Received: 01 February 2007
Published in TOCS Volume 26, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Adaptive scheduling
  2. adversary
  3. instantaneous parallelism
  4. job scheduling
  5. multiprocessing
  6. multiprogramming
  7. parallel computation
  8. parallelism feedback
  9. processor allocation
  10. randomized algorithm
  11. space sharing
  12. span
  13. thread scheduling
  14. trim analysis
  15. two-level scheduling
  16. work
  17. work-stealing

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Energy-harvesting-aware federated scheduling of parallel real-time tasksThe Journal of Supercomputing10.1007/s11227-024-06685-781:1Online publication date: 1-Jan-2025
  • (2024)Scheduling Out-Trees Online to Optimize Maximum FlowProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659955(77-88)Online publication date: 17-Jun-2024
  • (2024)Aquifer: Transparent Microsecond-scale Scheduling for vRAN WorkloadsIEEE Transactions on Services Computing10.1109/TSC.2024.3440032(1-14)Online publication date: 2024
  • (2022)Adaptive scheduling of multiprogrammed dynamic-multithreading applicationsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.009162:C(76-88)Online publication date: 1-Apr-2022
  • (2021)Scheduling computations with provably low synchronization overheadsJournal of Scheduling10.1007/s10951-021-00706-6Online publication date: 21-Oct-2021
  • (2020)AMCilk: A Framework for Multiprogrammed Parallel Workloads2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC50609.2020.00035(212-222)Online publication date: Dec-2020
  • (2017)Understanding and overcoming parallelism bottlenecks in ForkJoin applicationsProceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering10.5555/3155562.3155657(765-775)Online publication date: 30-Oct-2017
  • (2017)Using intra-core loop-task accelerators to improve the productivity and performance of task-based parallel programsProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3136952(759-773)Online publication date: 14-Oct-2017
  • (2017)Understanding and overcoming parallelism bottlenecks in ForkJoin applications2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)10.1109/ASE.2017.8115687(765-775)Online publication date: Oct-2017
  • (2016)Work stealing for interactive services to meet target latencyACM SIGPLAN Notices10.1145/3016078.285115151:8(1-13)Online publication date: 27-Feb-2016
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media