Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Cooperative Caching for Chip Multiprocessors

Published: 01 May 2006 Publication History

Abstract

This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through cooperation among private caches. Locally active data are attracted to the private caches by their accessing processors to reduce remote on-chip references, while globally active data are cooperatively identified and kept in the aggregate cache to reduce off-chip accesses. Examples of cooperation include cache-to-cache transfers of clean data, replication-aware data replacement, and global replacement of inactive data. These policies can be implemented by modifying an existing cache replacement policy and cache coherence protocol, or by the new implementation of a directory-based protocol presented in this paper. Our evaluation using full-system simulation shows that cooperative caching achieves an off-chip miss rate similar to that of a shared cache, and a local cache hit rate similar to that of using private caches. Cooperative caching performs robustly over a range of system/cache sizes and memory latencies. For an 8-core CMP with 1MB L2 cache per core, the best cooperative caching scheme improves the performance of multithreaded commercial workloads by 5-11% compared with a shared cache and 4-38% compared with private caches. For a 4-core CMP running multiprogrammed SPEC2000 workloads, cooperative caching is on average 11% and 6% faster than shared and private cache organizations, respectively.

References

[1]
{1} A. R. Alameldeen, M. M. K. Martin, C. J. Mauer, K. E. Moore, M. Xu, D. J. Sorin, M. D. Hill, and D. A. Wood. Simulating a $2M commercial server on a $2K PC. IEEE Computer, 36(2):50-57, Feb. 2003.
[2]
{2} J. K. Archibald. A cache coherence approach for large multiprocessor systems. In the 2nd ICS, pages 337-345, 1988.
[3]
{3} V. Aslot, M. J. Domeika, R. Eigenmann, G. Gaertner, W. B. Jones, and B. Parady. SPECOMP: A new benchmark suite for measuring parallel computer performance. In the International Workshop on OpenMP Applications and Tools, pages 1-10, 2001.
[4]
{4} J.-L. Baer and W.-H. Wang. On the inclusion properties for multi-level cache hierarchies. In the 15th ISCA, pages 73- 80, 1988.
[5]
{5} L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A scalable architecture based on single-chip multiprocessing. In the 27th ISCA, pages 282-293, June 2000.
[6]
{6} B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In the 37th MICRO, pages 319-330, Dec. 2004.
[7]
{7} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In the 36th MICRO, pages 55- 66, Dec 2003.
[8]
{8} Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizing replication, communication and capacity allocation in CMPs. In the 32th ISCA, pages 357-368, June 2005.
[9]
{9} M. Dahlin, R. Wang, T. E. Anderson, and D. A. Patterson. Cooperative caching: Using remote client memory to improve file system performance. In the 1st OSDI, pages 267-280, Nov 1994.
[10]
{10} L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE Transactions on Networking, 8(3): 281-293, 2000.
[11]
{11} M. J. Feeley, W. E. Morgan, E. P. Pighin, A. R. Karlin, H. M. Levy, and C. A. Thekkath. Implementing global memory management in a workstation cluster. In the 15th SOSP, pages 201-212, Dec 1995.
[12]
{12} E. Hagersten, A. Landin, and S. Haridi. DDM: A cache-only memory architecture. IEEE Computer, 25(9): 44-54, 1992.
[13]
{13} S. Harris. Synergistic Caching in Single-Chip Multiprocessors. PhD thesis, Stanford University, 2005.
[14]
{14} J. Huh, D. Burger, and S. W. Keckler. Exploring the design space of future CMPs. In the 2001 International Conference on Parallel Architectures and Compilation Techniques, pages 199-210, Sep 2001.
[15]
{15} J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A NUCA substrate for flexible CMP cache sharing. In the 19th ICS, pages 31-40, June 2005.
[16]
{16} R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In the 18th ICS, pages 257-266, June 2004.
[17]
{17} C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In ASPLOS-X, pages 211-222, Oct, 2002.
[18]
{18} S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In the 13th International Conference on Parallel Architecture and Compilation Techniques, 2004.
[19]
{19} P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro, 25(2): 21-29, 2005.
[20]
{20} C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for CMPs. In HPCA-10, pages 176-185, Feb. 2004.
[21]
{21} P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2): 50-58, Feb 2002.
[22]
{22} M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Computer Architecture News, 2005.
[23]
{23} M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In the 30th ISCA, pages 182-193, June 2003.
[24]
{24} M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. Wood. Improving multiple-CMP systems using token coherence. In HPCA-11, pages 328-339, Feb 2005.
[25]
{25} A. K. Nanda, A.-T. Nguyen, M. M. Michael, and D. J. Joseph. High-throughput coherence control and hardware messaging in Everest. IBM Journal of Research and Development, 45(2), 2001.
[26]
{26} B. A. Nayfeh, L. Hammond, and K. Olukotun. Evaluation of design alternatives for a multiprocessor microprocessor. In the 23rd ISCA, pages 67-77, May 1996.
[27]
{27} A. K. Osowski and D. J. Lilja. MinneSPEC: A new spec benchmark workload for simulation-based computer architecture research. Computer Architecture Letters, June 2002.
[28]
{28} M. S. Papamarcos and J. H. Patel. A low-overhead coherence solution for multiprocessors with private cache memories. In the 11th ISCA, pages 348-354, 1984.
[29]
{29} M. K. Qureshi, D. Thompson, and Y. N. Patt. The V-way cache: Demand based associativity via global replacement. In the 32nd ISCA, pages 544-555, June 2005.
[30]
{30} A. Saulsbury, T. Wilkinson, J. Carter, and A. Landin. An argument for simple COMA. In HPCA 1, pages 276-285, Jan, 1995.
[31]
{31} M. Takahashi, H. Takano, E. Kaneko, and S. Suzuki. A shared-bus control mechanism and a cache coherence protocol for a high-performance on-chip multiprocessor. In HPCA 2, pages 314-322, Feb 1996.
[32]
{32} J. M. Tendler, J. S. Dodson, J. S. F. Jr., H. Le, and B. Sinharoy. IBM Power4 system microarchitecture. IBM Journal of Research and Development, 46(1): 5-26, 2002.
[33]
{33} B. Verghese, A. Gupta, and M. Rosenblum. Performance isolation: Sharing and isolation in shared-memory multiprocessors. In ASPLOS-VIII, pages 181-192, Oct, 1998.
[34]
{34} T. Y. Yeh and G. Reinman. Fast and fair: data-stream quality of service. In CASES '05, pages 237-248, Sep 2005.
[35]
{35} M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled CMPs. In the 32th ISCA, pages 336-345, June 2005.

Cited By

View all
  • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
  • (2022)Coherency Traffic Reduction in Manycore Systems2022 25th Euromicro Conference on Digital System Design (DSD)10.1109/DSD57027.2022.00043(262-267)Online publication date: Aug-2022
  • (2020)DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00066(578-589)Online publication date: May-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 34, Issue 2
May 2006
383 pages
ISSN:0163-5964
DOI:10.1145/1150019
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '06: Proceedings of the 33rd annual international symposium on Computer Architecture
    June 2006
    383 pages
    ISBN:076952608X

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2006
Published in SIGARCH Volume 34, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)4
Reflects downloads up to 27 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
  • (2022)Coherency Traffic Reduction in Manycore Systems2022 25th Euromicro Conference on Digital System Design (DSD)10.1109/DSD57027.2022.00043(262-267)Online publication date: Aug-2022
  • (2020)DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00066(578-589)Online publication date: May-2020
  • (2019)Adaptive memory-side last-level GPU cachingProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322235(411-423)Online publication date: 22-Jun-2019
  • (2019)CA Based Detection of Coherence Exploiting Hardware TrojansJournal of Circuits, Systems and Computers10.1142/S0218126620501200Online publication date: 9-Sep-2019
  • (2019)D3N: A multi-layer cache for the rest of us2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006396(327-338)Online publication date: Dec-2019
  • (2019)Cache Memory Architectures for Handling Big Data Applications: A SurveySmart Computing Paradigms: New Progresses and Challenges10.1007/978-981-13-9680-9_18(211-220)Online publication date: 1-Dec-2019
  • (2019)Data Similarity-Aware Computation Infrastructure for the CloudSearchable Storage in Cloud Computing10.1007/978-981-13-2721-6_7(153-178)Online publication date: 9-Feb-2019
  • (2017)Enhance the Performance of Associative Memory by Using New MethodsVFAST Transactions on Software Engineering10.21015/vtse.v12i3.504(49-56)Online publication date: 1-Nov-2017
  • (2017)Experience from Two Years of Visualizing Flash with SSDPlayerACM Transactions on Storage10.1145/314935613:4(1-24)Online publication date: 17-Nov-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media