research-article

Flexible architectural support for fine-grain scheduling

Authors:

Daniel Sanchez,

Richard M. Yoo,

Christos KozyrakisAuthors Info & Claims

ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

Pages 311 - 322

https://doi.org/10.1145/1736020.1736055

Published: 13 March 2010 Publication History

Abstract

To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, managing tasks of a few thousand instructions is particularly challenging, as the runtime must ensure load balance without compromising locality and introducing small overheads. Software-only schedulers can implement various scheduling algorithms that match the characteristics of different applications and programming models, but suffer significant overheads as they synchronize and communicate task information over the deep cache hierarchy of a large-scale CMP. To reduce these costs, hardware-only schedulers like Carbon, which implement task queuing and scheduling in hardware, have been proposed. However, a hardware-only solution fixes the scheduling algorithm and leaves no room for other uses of the custom hardware.

This paper presents a combined hardware-software approach to build fine-grain schedulers that retain the flexibility of software schedulers while being as fast and scalable as hardware ones. We propose asynchronous direct messages (ADM), a simple architectural extension that provides direct exchange of asynchronous, short messages between threads in the CMP without going through the memory hierarchy. ADM is sufficient to implement a family of novel, software-mostly schedulers that rely on low-overhead messaging to efficiently coordinate scheduling and transfer task information. These schedulers match and often exceed the performance and scalability of Carbon when using the same scheduling algorithm. When the ADM runtime tailors its scheduling algorithm to application characteristics, it outperforms Carbon by up to 70%.

References

[1]

A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The MIT Alewife machine: architecture and performance. Proc. of the 22nd annual International Symposium on Computer Architecture, 1995.

Digital Library

[2]

S. Agarwal, R. Barik, D. Bonachea, V. Sarkar, R. K. Shyamasundar, and K. Yelick. Deadlock-free scheduling of X10 computations with bounded resources. In Proc. of the 19th annual ACM Symposium on Parallel Algorithms and Architectures, 2007.

Digital Library

[3]

G. Al-Kadi and A. S. Terechko. A hardware task scheduler for embedded video processing. In Proc. of the 4th International Conference on High Performance and Embedded Architectures and Compilers, 2009.

Digital Library

[4]

A. R. Alameldeen and D. A. Wood. Variability in architectural simulations of multi-threaded workloads. In Proc. of the 9th International Symposium on High-Performance Computer Architecture, 2003.

Digital Library

[5]

B. Ang, D. Chiou, L. Rudolf, and Arvind. Message passing support on StarT-Voyager. In Proc. of the 5th International Conference on High Performance Computing, 1998.

Digital Library

[6]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proc. of the 10th annual ACM Symposium on Parallel Algorithms and Architectures, 1998.

Digital Library

[7]

D. A. Bader and V. Sachdeva. A cache-aware parallel implementation of the push-relabel network flow algorithm and experimental evaluation of the gap relabeling heuristic. In Proc. 18th International Conference on Parallel and Distributed Computing Systems, 2005.

[8]

J. Balfour and W. J. Dally. Design tradeoffs for tiled CMP on-chip networks. In Proc. of the 20th annual International Conference on Supercomputing, 2006.

Digital Library

[9]

S. Bell et al. TILE64 processor: A 64-core SoC with mesh interconnect. In International Solid-State Circuits Conference, 2008.

[10]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. Technical Report TR-811-08, Princeton University, 2008.

Digital Library

[11]

N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4), 2006.

Digital Library

[12]

G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. J. ACM, 46(2), 1999.

Digital Library

[13]

R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. In Proc. of the 35th Annual Symposium on Foundations of Computer Science, 1994.

Digital Library

[14]

A. Bracy, K. Doshi, and Q. Jacobson. Disintermediated active communication. IEEE Comput. Archit. Lett., 5(2), 2006.

Digital Library

[15]

J. Canny. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell., 8(6), 1986.

Digital Library

[16]

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proc. of the 20th annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, 2005.

Digital Library

[17]

D. Chase and Y. Lev. Dynamic circular work-stealing deque. In Proc. of the 17th annual ACM Symposium on Parallel Algorithms and Architectures, 2005.

Digital Library

[18]

S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. In Proc. of the 19th annual ACM Symposium on Parallel Algorithms and Architectures, 2007.

Digital Library

[19]

G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat, and T. Wen. Solving large, irregular graph problems using adaptive work-stealing. In Proc. of the 37th International Conference on Parallel Processing, 2008.

Digital Library

[20]

W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., 2003.

Digital Library

[21]

A. Duran, J. Corbalán, and E. Ayguadé. Evaluation of OpenMP task scheduling strategies. In 4th International Workshop in OpenMP, 2008.

Digital Library

[22]

Z. Fang, L. Zhang, J. B. Carter, A. Ibrahim, and M. A. Parker. Active memory operations. In Proc. of the 21st annual International Conference on Supercomputing, 2007.

Digital Library

[23]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proc. of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, 1998.

Digital Library

[24]

G. Grohoski. Niagara2: A highly-threaded server-on-a-chip. In 18th Hot Chips Symposium, 2006.

[25]

Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for terminally strict parallel programs. In Proc. of the 23rd IEEE International Parallel and Distributed Processing Symposium, 2009.

Digital Library

[26]

M. D. Hill and M. R. Marty. Amdahl's law in the multicore era. Computer, 41(7), 2008.

Digital Library

[27]

R. Hoffmann, M. Korch, and T. Rauber. Performance evaluation of task pools based on hardware synchronization. In Proc. of the 2004 ACM/IEEE Conference on Supercomputing, 2004.

Digital Library

[28]

HPF Language Specification. Version 2.0, 1997.

[29]

Intel TBB. http://www.threadingbuildingblocks.org.

[30]

Intel Tera-scale Computing Research Program. http://www.intel.com/research/platform/terascale.

[31]

A. Kägi, D. Burger, and J. R. Goodman. Efficient synchronization: let them eat QOLB. In Proc. of the 24th annual International Symposium on Computer Architecture, 1997.

Digital Library

[32]

J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, B. Tuohy, A. Mahesri, S. Lumetta, M. Frank, and S. J. Patel. Rigel: An architecture and scalable programming interface for a 1000-core accelerator. In Proc. of the 36th International Symposium on Computer Architecture, 2009.

Digital Library

[33]

M. Kulkarni, P. Carribault, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. P. Chew. Scheduling strategies for optimistic parallel execution of irregular programs. In Proc. of the 20th annual Symposium on Parallelism in Algorithms and Architectures, 2008.

Digital Library

[34]

S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In Proc. of the 34th annual International Symposium on Computer Architecture, 2007.

Digital Library

[35]

W. S. Lee, W. Dally, S. Keckler, N. Carter, and A. Chang. An efficient, protected message interface. IEEE Computer, 31(11), 1998.

Digital Library

[36]

K. Mackenzie, J. Kubiatowicz, M. Frank, W. Lee, V. Lee, A. Agarwal, and M. Kaashoek. Exploiting two-case delivery for fast protected messaging. In Proc. of the 4th International Symposium on High-Performance Computer Architecture, 1998.

Digital Library

[37]

M. M. Martin et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Computer Architecture News, 2005.

Digital Library

[38]

A. Mathuriya, D. A. Bader, C. E. Heitsch, and S. C. Harvey. GTfold: a scalable multicore code for RNA secondary structure prediction. In Proc. of the 2009 ACM Symposium on Applied Computing, 2009.

Digital Library

[39]

V. Nagarajan and R. Gupta. ECMon: exposing cache events for monitoring. In Proc. of the 36th annual International Symposium on Computer Architecture, 2009.

Digital Library

[40]

M. D. Noakes, D. A. Wallach, and W. J. Dally. The J-machine multicomputer: an architectural evaluation. In Proc. of the 20th annual International Symposium on Computer Architecture, 1993.

Digital Library

[41]

OpenMP Application Program Interface. Version 2.5, 2005.

[42]

R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August. Decoupled software pipelining with the synchronization array. In Proc. of the 13th International Conference on Parallel Architectures and Compilation Techniques, 2004.

Digital Library

[43]

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many--core x86 architecture for visual computing. ACM Trans. Graph., 27(3), 2008.

Digital Library

[44]

M. Själander, A. Terechko, and M. Duranton. A look-ahead task management unit for embedded multi-core architectures. In Proc. of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools, 2008.

Digital Library

[45]

M. F. Spear, A. Shriraman, H. Hossain, S. Dwarkadas, and M. L. Scott. Alert-on-update: a communication aid for shared memory multiprocessors. In Proc. of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2007.

Digital Library

[46]

J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan. GRAMPS: A programming model for graphics pipelines. ACM Trans. Graph., 28(1), 2009.

Digital Library

[47]

S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. CACTI 5.1. Technical Report HPL-2008-20, HP Labs, 2008.

[48]

T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active messages: a mechanism for integrated communication and computation. In Proc. of the 19th annual International Symposium on Computer Architecture, 1992.

Digital Library

[49]

H. Wong, A. Bracy, E. Schuchman, T. M. Aamodt, J. D. Collins, P. H. Wang, G. Chinya, A. K. Groen, H. Jiang, and H. Wang. Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor. In Proc. of the 17th International Conference on Parallel Architectures and Compilation Techniques, 2008.

Digital Library

Cited By

Stojkovic JLiu CShahbaz MTorrellas JSolihin YHeinrich M(2023)μManycore: A Cloud-Native CPU for Tail at ScaleProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589068(1-15)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589068
Nguyen QSanchez D(2023)Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071026(1262-1274)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071026
Humphries JKaffes KMazières DKozyrakis CAngel SKasikci BKohler E(2021)A case against (most) context switchesProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3458336.3465274(17-25)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1145/3458336.3465274
Show More Cited By

Index Terms

Flexible architectural support for fine-grain scheduling
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments

Recommendations

Flexible architectural support for fine-grain scheduling
ASPLOS '10

To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, managing tasks of a few thousand instructions is particularly challenging, as the runtime must ensure load balance without ...
Flexible architectural support for fine-grain scheduling
ASPLOS '10

To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, managing tasks of a few thousand instructions is particularly challenging, as the runtime must ensure load balance without ...
Architectural support for task scheduling: hardware scheduling for dataflow on NUMA systems

To harness the compute resource of many-core system with tens to hundreds of cores, applications have to expose parallelism to the hardware. Researchers are aggressively looking for program execution models that make it easier to expose parallelism and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

March 2010

422 pages

ISBN:9781605588391

DOI:10.1145/1736020

General Chair:
James C. Hoe
Carnegie Mellon University, USA
,
Program Chair:
Vikram S. Adve
University of Illinois at Urbana-Champaign, USA

ACM SIGPLAN Notices Volume 45, Issue 3
ASPLOS '10
March 2010
399 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1735971
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 38, Issue 1
ASPLOS '10
March 2010
399 pages
ISSN:0163-5964
DOI:10.1145/1735970
Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 March 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '10

Sponsor:

ASPLOS '10: Architectural Support for Programming Languages and Operating Systems

March 13 - 17, 2010

Pennsylvania, Pittsburgh, USA

Acceptance Rates

ASPLOS XV Paper Acceptance Rate 32 of 181 submissions, 18%;

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

114
Total Citations
View Citations
1,209
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Stojkovic JLiu CShahbaz MTorrellas JSolihin YHeinrich M(2023)μManycore: A Cloud-Native CPU for Tail at ScaleProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589068(1-15)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589068
Nguyen QSanchez D(2023)Phloem: Automatic Acceleration of Irregular Applications with Fine-Grain Pipeline Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071026(1262-1274)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071026
Humphries JKaffes KMazières DKozyrakis CAngel SKasikci BKohler E(2021)A case against (most) context switchesProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3458336.3465274(17-25)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1145/3458336.3465274
Song WDelimitrou CShen ZRenesse RWeatherspoon HBenmohamed LVaulx FMahmoudi C(2021)CacheInspectorACM Transactions on Architecture and Code Optimization10.1145/345737318:3(1-25)Online publication date: 8-Jun-2021
https://dl.acm.org/doi/10.1145/3457373
Lü YGuo HHuang LYu QShen LXiao NWang Z(2021)GraphPEGACM Transactions on Architecture and Code Optimization10.1145/345044018:3(1-24)Online publication date: 10-May-2021
https://dl.acm.org/doi/10.1145/3450440
Shan MKhan O(2021)Accelerating Concurrent Priority Scheduling Using Adaptive in-Hardware Task Distribution in MulticoresIEEE Computer Architecture Letters10.1109/LCA.2020.304567020:1(17-21)Online publication date: 1-Jan-2021
https://doi.org/10.1109/LCA.2020.3045670
Arif MVandierendonck H(2021)Reducing the burden of parallel loop schedulers for many‐core processorsConcurrency and Computation: Practice and Experience10.1002/cpe.624133:13Online publication date: 5-Apr-2021
https://doi.org/10.1002/cpe.6241
Heidarshenas AGangwani TYesil SMorrison ATorrellas JAyguadé EHwu WBadia RHofstee H(2020)SnugProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392740(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392740
Wang MTa TCheng LBatten C(2020)Efficiently Supporting Dynamic Task Parallelism on Heterogeneous Cache-Coherent Systems2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA45697.2020.00025(173-186)Online publication date: May-2020
https://doi.org/10.1109/ISCA45697.2020.00025
Rheindt SMaier SPohle NNolte LLenke OSchmaus FWild TSchröder-Preikschat WHerkersdorf A(2020)DySHARQ: Dynamic Software-Defined Hardware-Managed Queues for Tile-Based ArchitecturesInternational Journal of Parallel Programming10.1007/s10766-020-00687-7Online publication date: 20-Nov-2020
https://doi.org/10.1007/s10766-020-00687-7
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten