Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs

Published: 24 February 2014 Publication History

Abstract

Locality has always been a critical factor in on-chip data placement on CMPs as accessing further-away caches has in the past been more costly than accessing nearby ones. Substantial research on locality-aware designs have thus focused on keeping a copy of the data private. However, this complicatesthe problem of data tracking and search/invalidation; tracking the state of a line at all on-chip caches at a directory or performing full-chip broadcasts are both non-scalable and extremely expensive solutions. In this paper, we make the case for Locality-Oblivious Cache Organization (LOCO), a CMP cache organization that leverages the on-chip network to create virtual single-cycle paths between distant caches, thus redefining the notion of locality. LOCO is a clustered cache organization, supporting both homogeneous and heterogeneous cluster sizes, and provides near single-cycle accesses to data anywhere within the cluster, just like a private cache. Globally, LOCO dynamically creates a virtual mesh connecting all the clusters, and performs an efficient global data search and migration over this virtual mesh, without having to resort to full-chip broadcasts or perform expensive directory lookups. Trace-driven and full system simulations running SPLASH-2 and PARSEC benchmarks show that LOCO improves application run time by up to 44.5% over baseline private and shared cache.

References

[1]
N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha. Garnet: A detailed on-chip network model inside a full-system simulator. In ISPASS, 2009.
[2]
N. Agarwal, L.-S. Peh, and N. K. Jha. In-network snoop ordering (inso): Snoopy coherence on unordered interconnects. In HPCA, 2009.
[3]
M. Awasthi, K. Sudan, R. Balasubramonian, and J. B. Carter. Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In HPCA, 2009.
[4]
R. Balasubramonian, N. P. Jouppi, and N. Muralimanohar. Multi-Core Cache Hierarchies. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2011.
[5]
N. Barrow-Williams, C. Fensch, and S. W. Moore. A communication characterisation of splash-2 and parsec. In IISWC, 2009.
[6]
B. M. Beckmann and D. A. Wood. TLC: Transmission Line Caches. In MICRO, 2003.
[7]
B. M. Beckmann, M. R. Marty, and D. A. Wood. Asr: Adaptive selective replication for cmp caches. In MICRO, 2006.
[8]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: characterization and architectural implications. In PACT, 2008.
[9]
J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In ISCA, 2006.
[10]
M. Chaudhuri. Pagenuca: Selected policies for page-grain locality management in large shared chip-multiprocessor caches. In HPCA, 2009.
[11]
C.-H. O. Chen, S. Park, T. Krishna, S. Subramanian, A. P. Chandrakasan, and L.-S. Peh. SMART: A Single-Cycle Reconfigurable NoC for SoC Applications. In DATE, 2013.
[12]
Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In MICRO, 2003.
[13]
Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizing replication, communication, and capacity allocation in cmps. In ISCA, 2005.
[14]
S. Cho and L. Jin. Managing distributed, shared l2 caches through os-level page allocation. In MICRO, 2006.
[15]
B. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. Duato. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In ISCA, 2011.
[16]
N. Eisley, L.-S. Peh, and L. Shang. Leveraging on-chip networks for data cache migration in chip multiprocessors. In PACT, 2008.
[17]
A. Gupta, W.-D. Weber, and T. C. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In ICPP (1), 1990.
[18]
M. Hammoud, S. Cho, and R. G. Melhem. Dynamic cache clustering for chip multiprocessors. In ICS, 2009.
[19]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive nuca: near-optimal block placement and replication in distributed caches. In ISCA, 2009.
[20]
R. Ho. On-Chip Wires: Scaling and Efficiency. PhD thesis, Stanford University, 2003.
[21]
Y. Hoskote, S. R. Vangal, A. Singh, N. Borkar, and S. Borkar. A 5-ghz mesh interconnect for a teraflops processor. IEEE Micro, 27(5):51--61, 2007.
[22]
J. Howard, S. Dighe, S. R. Vangal, G. Ruhl, N. Borkar, S. Jain, V. Erraguntla, M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, S. Steibl, S. Borkar, V. K. De, and R. F. V. der Wijngaart. A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS. In ISSCC, 2010.
[23]
J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A nuca substrate for flexible cmp cache sharing. In ICS, 2005.
[24]
N. D. E. Jerger, L.-S. Peh, and M. H. Lipasti. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In MICRO, 2008.
[25]
B. Kim and V. Stojanovic. Characterization of equalized and repeated interconnects for noc applications. IEEE Design and Test of Computers, 25:430--439, 2008.
[26]
C. Kim, D. Burger, and S. W. Keckler. An adaptive, nonuniform cache structure for wire-delay dominated on-chip caches. In ASPLOS, 2002.
[27]
J. Kim, W. J. Dally, B. Towles, and A. K. Gupta. Microarchitecture of a high-radix router. In ISCA, 2005.
[28]
J. Kim, W. J. Dally, and D. Abts. Flattened butterfly: a costefficient topology for high-radix networks. In ISCA, 2007.
[29]
T. Krishna, C.-H. O. Chen,W. C. Kwon, and L.-S. Peh. Breaking the on-chip latency barrier using SMART. In HPCA, 2013.
[30]
G. Kurian, J. E. Miller, J. Psota, J. Eastep, J. Liu, J. Michel, L. C. Kimerling, and A. Agarwal. ATAC: All-to-All Computing Using On-Chip Optical Interconnects. In BARC, 2007.
[31]
H. Lee, S. Cho, and B. R. Childers. Cloudcache: Expanding and shrinking private caches. In HPCA, 2011.
[32]
Y.-C. Maa, D. K. Pradhan, and D. Thiébaut. A hierarchical directory scheme for large-scale cache-coherent multipmcessors. In IPPS, 1992.
[33]
M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In ISCA, 2003.
[34]
M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Computer Architecture News, 33(4):92--99, 2005.
[35]
M. R. Marty and M. D. Hill. Virtual hierarchies to support server consolidation. In ISCA, 2007.
[36]
M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. Wood. Improving multiple-cmp systems using token coherence. In HPCA, 2005.
[37]
J. E. Miller, H. Kasture, G. Kurian, C. G. III, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. Graphite: A distributed parallel simulator for multicores. In HPCA, 2010.
[38]
S. Park, T. Krishna, C.-H. Chen, B. Daya, A. Chandrakasan, and L.-S. Peh. Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI. In DAC, 2012.
[39]
K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. W. Keckler, and C. R. Moore. Exploiting ilp, tlp, and dlp with the polymorphous trips architecture. IEEE Micro, 23(6):46--51, 2003.
[40]
S. Scott, D. Abts, J. Kim, and W. J. Dally. The BlackWidow high-radix clos network. In ISCA, 2006.
[41]
C. Sun, C.-H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Stojanovic. DSENT - a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In NoCS, 2010.
[42]
D. A. Wallach. PHD: A Hierarchical Cache Coherent Protocol. MS Thesis. MIT, 1992.
[43]
D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. B. III, and A. Agarwal. On-chip interconnection architecture of the tile processor. IEEE Micro, 27(5):15--31, 2007.
[44]
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: Characterization and methodological considerations. In ISCA, 1995.
[45]
M. Zhang and K. Asanovic. Victim migration: Dynamically adapting between private and shared cmp caches. Technical report, MIT, 2005.
[46]
M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In ISCA, 2005.

Index Terms

  1. Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 42, Issue 1
    ASPLOS '14
    March 2014
    729 pages
    ISSN:0163-5964
    DOI:10.1145/2654822
    Issue’s Table of Contents
    • cover image ACM Conferences
      ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
      February 2014
      780 pages
      ISBN:9781450323055
      DOI:10.1145/2541940
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 February 2014
    Published in SIGARCH Volume 42, Issue 1

    Check for updates

    Author Tags

    1. cache coherence
    2. cmp cache design
    3. locality
    4. multiprocessor
    5. network-on-chip

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Oct 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media