Article

Scheduling threads for constructive cache sharing on CMPs

Authors:

Phillip B. Gibbons,

Michael Kozuch,

Vasileios Liaskovitis,

Anastassia Ailamaki,

Guy E. Blelloch,

Nikos Hardavellas,

Chris WilkersonAuthors Info & Claims

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

Pages 105 - 115

https://doi.org/10.1145/1248377.1248396

Published: 09 June 2007 Publication History

Abstract

In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3--1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.

References

[1]

U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3), 2002.

[2]

A. Agarwal, M. Horowitz, and J. L. Hennessy. An analytical cache model. ACM Trans. on Computer Systems, 7(2), 1989.

Digital Library

[3]

J. Anderson and J. Calandrino. Parallel real-time task scheduling on multicore platforms. In RTSS, 2006.

Digital Library

[4]

R. Balasubramonian, D. H. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. A dynamically tunable memory hierarchy. IEEE Trans. on Computers, 52(10), 2003.

Digital Library

[5]

G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In SPAA, 2004.

Digital Library

[6]

G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. J. of the ACM, 46(2), 1999.

Digital Library

[7]

G. E. Blelloch, P. B. Gibbons, Y. Matias, and G. J. Narlikar. Space-efficient scheduling of parallelism with synchronization variables. In SPAA, 1997.

Digital Library

[8]

R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, and K. H. Randall. An analysis of dag-consistent distributed shared-memory algorithms. In SPAA, 1996.

Digital Library

[9]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiseron, K. H. Randall, and Y. Zhou. CILK: An efficient multithreaded runtime system. In PPoPP, 1995.

Digital Library

[10]

R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. of the ACM, 46(5), 1999.

Digital Library

[11]

S. Borkar. Design challenges of technology scaling. IEEE Micro, 19(4), 1999.

Digital Library

[12]

J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. R. Kunkel. A multithreaded PowerPC processor for commercial servers. IBM JRD, 44(6), 2000.

Digital Library

[13]

D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA, 2005.

Digital Library

[14]

G. Chen, H. Chen, M. Haurylau, N. Nelson, D. Albonesi, P. M. Fauchet, and E. G. Friedman. Electrical and optical on-chip interconnects in scaled microprocessors. In International Symp. on Circuits and Systems, 2005.

[15]

S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry. Inspector joins. In VLDB, 2005.

Digital Library

[16]

S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. Technical Report IRP-TR-07-01, Intel Research Pittsburgh, 2007.

Digital Library

[17]

Y.-Y. Chen, J.-K. Peir, and C.-T. King. Performance of shared caches on multithreaded architectures. J. of Information Science and Engineering, 14(2), 1998.

[18]

Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizing replication, communication, and capacity allocation in CMPs. In ISCA, 2005.

Digital Library

[19]

J. Clabes, J. Friedrich, M. Sweet, and J. Dilullo. Design and implementation of the POWER5 microprocessor. In International Solid State Circuits Conf., 2004.

Digital Library

[20]

J. D. Davis, J. Laudon, and K. Olukotun. Maximizing CMP throughput with mediocre cores. In PACT, 2005.

Digital Library

[21]

S. Eddy. HMMER: profile HMMs for protein sequence analysis. http://hmmer.wustl.edu/.

[22]

A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In USENIX ATC, 2005.

Digital Library

[23]

C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In ASPLOS-X, 2002.

Digital Library

[24]

S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT, 2004.

Digital Library

[25]

K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz. Smart memories: a modular reconfigurable architecture. In ISCA, 2000.

Digital Library

[26]

T. Moreshet, R. I. Bahar, and M. Herlihy. Energy-aware microprocessor synchronization: Transactional memory vs. locks. In WMPI, 2006.

[27]

G. J. Narlikar. A parallel, multithreaded decision tree builder. Technical Report CMU-CS-98-184, Carnegie Mellon University, 1998.

[28]

G. J. Narlikar and G. E. Blelloch. Space-efficient scheduling of nested parallelism. ACM Trans. on Programming Languages and Systems, 21(1), 1999.

Digital Library

[29]

S. Parekh, S. Eggers, and H. Levy. Thread-sensitive scheduling for SMT processors. Technical report, U. Washington, 2000.

[30]

J. Philbin, J. Edler, O. J. Anshus, C. C. Douglas, and K. Li. Thread scheduling for cache locality. In ASPLOS, 1996.

Digital Library

[31]

R. H. Saavedra-Barrera, D. E. Culler, and T. von Eicken. Analysis of multithreaded architectures for parallel computing. In SPAA, 1990.

Digital Library

[32]

Semiconductor Industry Association. The International Technology Roadmap for Semiconductors (ITRS) 2005 Edition, 2005.

[33]

J. R. Shewchuk. Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulator. In Applied Computational Geometry: Towards Geometric Engineering, vol. 1148, 1996.

Digital Library

[34]

P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and area model. Technical Report WRL 2001/2, Compaq Computer Corporation, 2001.

[35]

A. Snavely and D. M. Tullsen. Symbiotic job scheduling for a simultaneous multithreading processor. In ASPLOS, 2000.

Digital Library

[36]

G. E. Suh, S. Devadas, and L. Rudolph. Analytical cache models with application to cache partitioning. In International Conf. on Supercomputing, 2001.

Digital Library

[37]

G. E. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In HPCA, 2002.

Digital Library

[38]

G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. J. of Supercomputing, 28(1), 2004.

Digital Library

[39]

D. Thibaut and H. S. Stone. Footprints in the cache. ACM Trans. on Computer Systems, 5(4), 1987.

Digital Library

[40]

M. W. Weissmann. Libpmsort. http://freshmeat.net/projects/libpmsort.

[41]

S.-H. Yang, B. Falsafi, M. D. Powell, and T. N. Vijaykumar. Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay. In HPCA, 2002.

Digital Library

[42]

M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In ISCA, 2005.

Digital Library

Cited By

Dadu VNowatzki TFalsafi BFerdman MLu SWenisch T(2022)TaskStream: accelerating task-parallel workloads by recovering program structureProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507706(1-13)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507706
DeLayo DZhang KAgrawal KBender MBerry JDas RMoseley BPhillips CAgrawal KLee I(2022)Automatic HBM ManagementProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538570(147-159)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538570
Shiina STaura K(2022)Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work StealingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319619233:12(4530-4546)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3196192
Show More Cited By

Index Terms

Scheduling threads for constructive cache sharing on CMPs
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Multithreading
        Scheduling

Recommendations

Effectively sharing a cache among threads
SPAA '04: Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures

We compare the number of cache misses M₁ for running a computation on a single processor with cache size C₁ to the total number of misses M_p for the same computation when using p processors or threads and a shared cache of size C_p. We show that for any ...
A leakage-aware cache sharing technique for low-power chip multi-processors (CMPs) with private L2 caches
MEDEA '08: Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture

Power dissipation becomes an important issue in modern microprocessors such as chip multiprocessors (CMPs). Especially as the process technology advances below 90nm, the leakage power consumption becomes dominant in the total power dissipation, thus ...
Directory based cache coherence verification logic in CMPs cache system
MES '13: Proceedings of the First International Workshop on Many-core Embedded Systems

This work reports a high speed protocol verificaion logic for Chip Multiprocessors (CMPs) realizing directory based cache coherence system. A special class of cellular automata (CA) referred to as single length cycle 2-attractor CA (TACA), has been ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

June 2007

376 pages

ISBN:9781595936677

DOI:10.1145/1248377

General Chair:
Phillip B. Gibbons
Intel Research, USA
,
Program Chair:
Christian Scheideler
Technische Universität München, Germany

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SPAA07

Sponsor:

SPAA07: 19th ACM Symposium on Parallelism in Algorithms and Architectures

June 9 - 11, 2007

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Upcoming Conference

SPAA '25

Sponsor:
sigact
sigact

37th ACM Symposium on Parallelism in Algorithms and Architectures

July 28 - August 1, 2025

Portland , OR , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

112
Total Citations
View Citations
1,535
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)2

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dadu VNowatzki TFalsafi BFerdman MLu SWenisch T(2022)TaskStream: accelerating task-parallel workloads by recovering program structureProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507706(1-13)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507706
DeLayo DZhang KAgrawal KBender MBerry JDas RMoseley BPhillips CAgrawal KLee I(2022)Automatic HBM ManagementProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538570(147-159)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538570
Shiina STaura K(2022)Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work StealingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319619233:12(4530-4546)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3196192
De Nicola RDi Stefano LInverso OUwimbabazi A(2022)Automated replication of tuple spaces via static analysisScience of Computer Programming10.1016/j.scico.2022.102863223:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.scico.2022.102863
Basso MRosales ESchiavio FRosà ABinder W(2022)Accurate Fork-Join Profiling on the Java Virtual MachineEuro-Par 2022: Parallel Processing10.1007/978-3-031-12597-3_3(35-50)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-12597-3_3
Kandemir MTang XZhao HRyoo JKarakoy MFreund SYahav E(2021)Distance-in-time versus distance-in-spaceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454069(665-680)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454069
Das RAgrawal KBender MBerry JMoseley BPhillips CScheideler CSpear M(2020)How to Manage High-Bandwidth Memory AutomaticallyProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400233(187-199)Online publication date: 6-Jul-2020
https://dl.acm.org/doi/10.1145/3350755.3400233
Ye XLin ZLee JZhang JZheng SYang Y(2019)GAPLE: Generalizable Approaching Policy LEarning for Robotic Object Searching in Indoor EnvironmentIEEE Robotics and Automation Letters10.1109/LRA.2019.29304264:4(4003-4010)Online publication date: Oct-2019
https://doi.org/10.1109/LRA.2019.2930426
Thoman PZangerl PFahringer T(2019)Static Compiler Analyses for Application-specific Optimization of Task-Parallel Runtime SystemsJournal of Signal Processing Systems10.1007/s11265-018-1356-991:3-4(303-320)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11265-018-1356-9
Rezaei AKhetawat HPatil OMueller FHargrove PRoman E(2019)End-to-End Resilience for HPC ApplicationsHigh Performance Computing10.1007/978-3-030-20656-7_14(271-290)Online publication date: 17-May-2019
https://doi.org/10.1007/978-3-030-20656-7_14
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten