Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1007912.1007948acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
Article

Effectively sharing a cache among threads

Published: 27 June 2004 Publication History
  • Get Citation Alerts
  • Abstract

    We compare the number of cache misses M1 for running a computation on a single processor with cache size C1 to the total number of misses Mp for the same computation when using p processors or threads and a shared cache of size Cp. We show that for any computation, and with an appropriate (greedy) parallel schedule, if CpC1 + pd then MpM1. The depth d of the computation is the length of the critical path of dependences. This gives the perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single-processor cache, and gives some theoretical justification for designing machines with shared caches.We model a computation as a DAG and the sequential execution as a depth first schedule of the DAG. The parallel schedule we study is a parallel depth-first schedule (PDF schedule) based on the sequential one. The schedule is greedy and therefore work-efficient. Our main results assume the Ideal Cache model, but we also present results for other more realistic cache models.

    References

    [1]
    U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002.
    [2]
    A. Agarwal, M. Horowitz, and J. L. Hennessy. An analytical cache model. ACM Trans. on Computer Systems, 7(2):184--215, 1989.
    [3]
    L. Arge, M. A. Bender, E. D. Demaine, B. Holland-Minkley, and J. I. Munro. Cache-oblivious priority queue and graph algorithm applications. In Proc. 34th ACM Symp. on Theory of Computing (STOC), pages 268--276, May 2002.
    [4]
    L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese.
    [5]
    Piranha: A scalable architecture based on single-chip multiprocessing. In Proc. 27th ACM International Symp. on Computer Architecture (ISCA), pages 282--293, June 2000.
    [6]
    R. D. Barve, E. F. Grove, and J. S. Vitter. Application-controlled paging for a shared cache. SIAM Journal on Computing, 29(4):1290--1303, 2000.
    [7]
    L. A. Belady. A study of replacment algorithms for virtual storage computers. IBM Systems Journal, 5(2):78--101, 1966.
    [8]
    G. E. Blelloch, P. B. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. Journal of the ACM, 46(2):281--321, 1999.
    [9]
    G. E. Blelloch, P. B. Gibbons, Y. Matias, and G. J. Narlikar. Space-efficient scheduling of parallelism with synchronization variables. In Proc. 9th ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 12--23, June 1997.
    [10]
    R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, and K. H. Randall. An analysis of dag-consistent distributed shared-memory algorithms. In Proc. 8th ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 297--308, June 1996.
    [11]
    R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5):720--748, 1999.
    [12]
    Y.-Y. Chen, J.-K. Peir, and C.-T. King. Performance of shared caches on multithreaded architectures. Journal of Information Science and Engineering, 14(2):499--514, 1998.
    [13]
    P. Fatourou. Low-contention depth-first scheduling of parallel computations with write-once synchronization variables. In Proc. 13th ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 189--198, July 2001.
    [14]
    M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. 40th IEEE Symp. on Foundations of Computer Science (FOCS), pages 285--298, Oct. 1999.
    [15]
    L. Hammond, B. A. Hubbert, M. Siu, M. K. Prabhu, M. Chen, and K. Olukotun. The Stanford Hydra CMP. IEEE Micro, 20(2):71--84, 2000.
    [16]
    L. Hammond, B. Nayfeh, and K. Olukotun. A single-chip multiprocessor. IEEE Computer, 30(9):79--85, 1997.
    [17]
    S. Irani. Competitive analysis of paging. In Online Algorithms. Springer, 1998. LNCS, 1442:52--73.
    [18]
    N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proc. 17th ACM International Symp. on Computer Architecture (ISCA), pages 364--373, May 1990.
    [19]
    R. Kalla, B. Sinharoy, and J. Tendler. Simultaneous multi-threading implementation in POWER5. In 15th IEEE Hot Chips, Aug. 2003.
    [20]
    D. T. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller, and M. Upton. Hyper-threading technology architecture and microarchitecture, white paper. Intel Technical Journal, 6(1), Feb. 2002.
    [21]
    D. Naishlos, J. Nuzman, C.-W. Tseng, and U. Vishkin. Towards a first vertical prototyping of an extremely fine-grained parallel programming approach. In Proc. 13th ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 93--102, July 2001.
    [22]
    G. J. Narlikar. Scheduling threads for low space requirement and good locality. Theory of Computing Systems, 35(2):151--187, 2002.
    [23]
    G. J. Narlikar and G. E. Blelloch. Space-efficient scheduling of nested parallelism. ACM Trans. on Programming Languages and Systems, 21(1):138--173, 1999.
    [24]
    R. H. Saavedra-Barrera, D. E. Culler, and T. von Eicken. Analysis of multithreaded architectures for parallel computing. In Proc. 2nd ACM Symp. on Parallel Algorithms and Architectures (SPAA), pages 169--178, July 1990.
    [25]
    D. D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging rules. Communications of the ACM, 28(2):202--208, 1985.
    [26]
    G. E. Suh, S. Devadas, and L. Rudolph. Analytical cache models with application to cache partitioning. In Proc. 2001 ACM International Conference on Supercomputing, pages 1--12, June 2001.
    [27]
    G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. Journal of Supercomputing, 28(1):7--26, 2004.
    [28]
    J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 system microarchitecture, technical white paper. Technical Report 20, IBM Server Group, Oct. 2001.
    [29]
    D. Thibaut and H. S. Stone. Footprints in the cache. ACM Trans. on Computer Systems, 5(4):305--329, 1987.
    [30]
    D. Thibaut and H. S. Stone. Improving disk cache hit-ratios through cache partitioning. IEEE Transactions on Computers, 41(6):665--676, 1992.
    [31]
    M. Tremblay, J. Chan, S. Chaudhry, A. W. Conigliaro, and S. S. Tse. The MAJC architecture: A synthesis of parallelism and scalability. IEEE Micro, 20(6):12--25, 2000.
    [32]
    D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In Proc. 22nd ACM International Symp. on Computer Architecture (ISCA), pages 392--403, June 1995.

    Cited By

    View all
    • (2022)Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficientProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538574(273-286)Online publication date: 11-Jul-2022
    • (2022)Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work StealingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319619233:12(4530-4546)Online publication date: 1-Dec-2022
    • (2021)Provably space-efficient parallel functional programmingProceedings of the ACM on Programming Languages10.1145/34342995:POPL(1-33)Online publication date: 4-Jan-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SPAA '04: Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
    June 2004
    332 pages
    ISBN:1581138407
    DOI:10.1145/1007912
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 June 2004

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. chip multiprocessors
    2. multithreaded architectures
    3. scheduling algorithms
    4. shared cache

    Qualifiers

    • Article

    Conference

    SPAA04

    Acceptance Rates

    Overall Acceptance Rate 447 of 1,461 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Many Sequential Iterative Algorithms Can Be Parallel and (Nearly) Work-efficientProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538574(273-286)Online publication date: 11-Jul-2022
    • (2022)Improving Cache Utilization of Nested Parallel Programs by Almost Deterministic Work StealingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319619233:12(4530-4546)Online publication date: 1-Dec-2022
    • (2021)Provably space-efficient parallel functional programmingProceedings of the ACM on Programming Languages10.1145/34342995:POPL(1-33)Online publication date: 4-Jan-2021
    • (2021)Low-Span Parallel Algorithms for the Binary-Forking ModelProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461802(22-34)Online publication date: 6-Jul-2021
    • (2021)Efficient Stepping Algorithms and Implementations for Parallel Shortest PathsProceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3409964.3461782(184-197)Online publication date: 6-Jul-2021
    • (2019)Disentanglement in nested-parallel programsProceedings of the ACM on Programming Languages10.1145/33711154:POPL(1-32)Online publication date: 20-Dec-2019
    • (2019)Extracting SIMD Parallelism from Recursive Task-Parallel ProgramsACM Transactions on Parallel Computing10.1145/33656636:4(1-37)Online publication date: 26-Dec-2019
    • (2019)Fairness in responsive parallelismProceedings of the ACM on Programming Languages10.1145/33416853:ICFP(1-30)Online publication date: 26-Jul-2019
    • (2019)Memory Partitioning and Management in MemcachedIEEE Transactions on Services Computing10.1109/TSC.2016.261304812:4(564-576)Online publication date: 1-Jul-2019
    • (2018)Heartbeat scheduling: provable efficiency for nested parallelismACM SIGPLAN Notices10.1145/3296979.319239153:4(769-782)Online publication date: 11-Jun-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media