Article

Effectively sharing a cache among threads

Authors:

Guy E. Blelloch and

Phillip B. GibbonsAuthors Info & Claims

SPAA '04: Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures

June 2004

Pages 235 - 244

https://doi.org/10.1145/1007912.1007948

Published: 27 June 2004 Publication History

Abstract

We compare the number of cache misses M₁ for running a computation on a single processor with cache size C₁ to the total number of misses M_p for the same computation when using p processors or threads and a shared cache of size C_p. We show that for any computation, and with an appropriate (greedy) parallel schedule, if C_p ≥ C₁ + pd then M_p ≤ M₁. The depth d of the computation is the length of the critical path of dependences. This gives the perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single-processor cache, and gives some theoretical justification for designing machines with shared caches.We model a computation as a DAG and the sequential execution as a depth first schedule of the DAG. The parallel schedule we study is a parallel depth-first schedule (PDF schedule) based on the sequential one. The schedule is greedy and therefore work-efficient. Our main results assume the Ideal Cache model, but we also present results for other more realistic cache models.

References

[1]

U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002.

[2]

A. Agarwal, M. Horowitz, and J. L. Hennessy. An analytical cache model. ACM Trans. on Computer Systems, 7(2):184--215, 1989.

Digital Library

[3]

L. Arge, M. A. Bender, E. D. Demaine, B. Holland-Minkley, and J. I. Munro. Cache-oblivious priority queue and graph algorithm applications. In Proc. 34th ACM Symp. on Theory of Computing (STOC), pages 268--276, May 2002.

Digital Library

[4]

L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese.

[5]

Piranha: A scalable architecture based on single-chip multiprocessing. In Proc. 27th ACM International Symp. on Computer Architecture (ISCA), pages 282--293, June 2000.

Digital Library

[6]

R. D. Barve, E. F. Grove, and J. S. Vitter. Application-controlled paging for a shared cache. SIAM Journal on Computing, 29(4):1290--1303, 2000.

Digital Library

[7]

L. A. Belady. A study of replacment algorithms for virtual storage computers. IBM Systems Journal, 5(2):78--101, 1966.

Digital Library

[8]

G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. Journal of the ACM, 46(2):281--321, 1999.

Digital Library

[9]

G. E. Blelloch, P. B. Gibbons, Y. Matias, and G. J. Narlikar. Space-efficient scheduling of parallelism with synchronization variables. In Proc. 9th ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 12--23, June 1997.

Digital Library

[10]

R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, and K. H. Randall. An analysis of dag-consistent distributed shared-memory algorithms. In Proc. 8th ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 297--308, June 1996.

Digital Library

[11]

R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5):720--748, 1999.

Digital Library

[12]

Y.-Y. Chen, J.-K. Peir, and C.-T. King. Performance of shared caches on multithreaded architectures. Journal of Information Science and Engineering, 14(2):499--514, 1998.

[13]

P. Fatourou. Low-contention depth-first scheduling of parallel computations with write-once synchronization variables. In Proc. 13th ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 189--198, July 2001.

Digital Library

[14]

M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th IEEE Symp. on Foundations of Computer Science (FOCS), pages 285--298, Oct. 1999.

Digital Library

[15]

L. Hammond, B. A. Hubbert, M. Siu, M. K. Prabhu, M. Chen, and K. Olukotun. The Stanford Hydra CMP. IEEE Micro, 20(2):71--84, 2000.

Digital Library

[16]

L. Hammond, B. Nayfeh, and K. Olukotun. A single-chip multiprocessor. IEEE Computer, 30(9):79--85, 1997.

Digital Library

[17]

S. Irani. Competitive analysis of paging. In Online Algorithms. Springer, 1998. LNCS, 1442:52--73.

Digital Library

[18]

N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. 17th ACM International Symp. on Computer Architecture (ISCA), pages 364--373, May 1990.

Digital Library

[19]

R. Kalla, B. Sinharoy, and J. Tendler. Simultaneous multi-threading implementation in POWER5. In 15th IEEE Hot Chips, Aug. 2003.

[20]

D. T. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller, and M. Upton. Hyper-threading technology architecture and microarchitecture, white paper. Intel Technical Journal, 6(1), Feb. 2002.

[21]

D. Naishlos, J. Nuzman, C.-W. Tseng, and U. Vishkin. Towards a first vertical prototyping of an extremely fine-grained parallel programming approach. In Proc. 13th ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 93--102, July 2001.

Digital Library

[22]

G. J. Narlikar. Scheduling threads for low space requirement and good locality. Theory of Computing Systems, 35(2):151--187, 2002.

[23]

G. J. Narlikar and G. E. Blelloch. Space-efficient scheduling of nested parallelism. ACM Trans. on Programming Languages and Systems, 21(1):138--173, 1999.

Digital Library

[24]

R. H. Saavedra-Barrera, D. E. Culler, and T. von Eicken. Analysis of multithreaded architectures for parallel computing. In Proc. 2nd ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 169--178, July 1990.

Digital Library

[25]

D. D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging rules. Communications of the ACM, 28(2):202--208, 1985.

Digital Library

[26]

G. E. Suh, S. Devadas, and L. Rudolph. Analytical cache models with application to cache partitioning. In Proc. 2001 ACM International Conference on Supercomputing, pages 1--12, June 2001.

Digital Library

[27]

G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. Journal of Supercomputing, 28(1):7--26, 2004.

Digital Library

[28]

J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 system microarchitecture, technical white paper. Technical Report 20, IBM Server Group, Oct. 2001.

[29]

D. Thibaut and H. S. Stone. Footprints in the cache. ACM Trans. on Computer Systems, 5(4):305--329, 1987.

Digital Library

[30]

D. Thibaut and H. S. Stone. Improving disk cache hit-ratios through cache partitioning. IEEE Transactions on Computers, 41(6):665--676, 1992.

Digital Library

[31]

M. Tremblay, J. Chan, S. Chaudhry, A. W. Conigliaro, and S. S. Tse. The MAJC architecture: A synthesis of parallelism and scalability. IEEE Micro, 20(6):12--25, 2000.

Digital Library

[32]

D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In Proc. 22nd ACM International Symp. on Computer Architecture (ISCA), pages 392--403, June 1995.

Digital Library

Cited By

Shen ZWan ZGu YSun YAgrawal KLee I(2022)Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficientProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538574(273-286)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538574
Shiina STaura K(2022)Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work StealingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319619233:12(4530-4546)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3196192
Arora JWestrick SAcar U(2021)Provably space-efficient parallel functional programmingProceedings of the ACM on Programming Languages10.1145/34342995:POPL(1-33)Online publication date: 4-Jan-2021
https://dl.acm.org/doi/10.1145/3434299
Show More Cited By

Index Terms

Effectively sharing a cache among threads
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Scheduling threads for constructive cache sharing on CMPs
SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely ...
Read More
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10

Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Read More
Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches
HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers

This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large-scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets' usages. CE decouples the physical locations of cache blocks from ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '04: Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures

June 2004

332 pages

ISBN:1581138407

DOI:10.1145/1007912

General Chair:
Phil Gibbons
Intel Research
,
Program Chair:
Micah Adler
University of Massachusetts

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SPAA04

Sponsor:

SPAA04: 16th ACM Symposium on Parallelism in Algorithms and Architectures 2004

June 27 - 30, 2004

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

58
Total Citations
View Citations
1,010
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Shen ZWan ZGu YSun YAgrawal KLee I(2022)Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficientProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538574(273-286)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538574
Shiina STaura K(2022)Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work StealingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319619233:12(4530-4546)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3196192
Arora JWestrick SAcar U(2021)Provably space-efficient parallel functional programmingProceedings of the ACM on Programming Languages10.1145/34342995:POPL(1-33)Online publication date: 4-Jan-2021
https://dl.acm.org/doi/10.1145/3434299
Ahmad ZChowdhury RDas RGanapathi PGregory AJavanmard MAgrawal KAzar Y(2021)Low-Span Parallel Algorithms for the Binary-Forking ModelProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461802(22-34)Online publication date: 6-Jul-2021
https://dl.acm.org/doi/10.1145/3409964.3461802
Dong XGu YSun YZhang YAgrawal KAzar Y(2021)Efficient Stepping Algorithms and Implementations for Parallel Shortest PathsProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461782(184-197)Online publication date: 6-Jul-2021
https://dl.acm.org/doi/10.1145/3409964.3461782
Westrick SYadav RFluet MAcar U(2019)Disentanglement in nested-parallel programsProceedings of the ACM on Programming Languages10.1145/33711154:POPL(1-32)Online publication date: 20-Dec-2019
https://dl.acm.org/doi/10.1145/3371115
Ren BBalakrishna SJo YKrishnamoorthy SAgrawal KKulkarni M(2019)Extracting SIMD Parallelism from Recursive Task-Parallel ProgramsACM Transactions on Parallel Computing10.1145/33656636:4(1-37)Online publication date: 26-Dec-2019
https://dl.acm.org/doi/10.1145/3365663
Muller SWestrick SAcar U(2019)Fairness in responsive parallelismProceedings of the ACM on Programming Languages10.1145/33416853:ICFP(1-30)Online publication date: 26-Jul-2019
https://dl.acm.org/doi/10.1145/3341685
Carra DMichiardi P(2019)Memory Partitioning and Management in MemcachedIEEE Transactions on Services Computing10.1109/TSC.2016.261304812:4(564-576)Online publication date: 1-Jul-2019
https://doi.org/10.1109/TSC.2016.2613048
Acar UCharguéraud AGuatto ARainey MSieczkowski F(2018)Heartbeat scheduling: provable efficiency for nested parallelismACM SIGPLAN Notices10.1145/3296979.319239153:4(769-782)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192391
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents