research-article

Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs

Authors:

Woo-Cheol Kwon,

Tushar Krishna,

Li-Shiuan PehAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 42, Issue 1

Pages 715 - 728

https://doi.org/10.1145/2654822.2541976

Published: 24 February 2014 Publication History

Abstract

Locality has always been a critical factor in on-chip data placement on CMPs as accessing further-away caches has in the past been more costly than accessing nearby ones. Substantial research on locality-aware designs have thus focused on keeping a copy of the data private. However, this complicatesthe problem of data tracking and search/invalidation; tracking the state of a line at all on-chip caches at a directory or performing full-chip broadcasts are both non-scalable and extremely expensive solutions. In this paper, we make the case for Locality-Oblivious Cache Organization (LOCO), a CMP cache organization that leverages the on-chip network to create virtual single-cycle paths between distant caches, thus redefining the notion of locality. LOCO is a clustered cache organization, supporting both homogeneous and heterogeneous cluster sizes, and provides near single-cycle accesses to data anywhere within the cluster, just like a private cache. Globally, LOCO dynamically creates a virtual mesh connecting all the clusters, and performs an efficient global data search and migration over this virtual mesh, without having to resort to full-chip broadcasts or perform expensive directory lookups. Trace-driven and full system simulations running SPLASH-2 and PARSEC benchmarks show that LOCO improves application run time by up to 44.5% over baseline private and shared cache.

References

[1]

N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha. Garnet: A detailed on-chip network model inside a full-system simulator. In ISPASS, 2009.

[2]

N. Agarwal, L.-S. Peh, and N. K. Jha. In-network snoop ordering (inso): Snoopy coherence on unordered interconnects. In HPCA, 2009.

[3]

M. Awasthi, K. Sudan, R. Balasubramonian, and J. B. Carter. Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In HPCA, 2009.

[4]

R. Balasubramonian, N. P. Jouppi, and N. Muralimanohar. Multi-Core Cache Hierarchies. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2011.

Digital Library

[5]

N. Barrow-Williams, C. Fensch, and S. W. Moore. A communication characterisation of splash-2 and parsec. In IISWC, 2009.

Digital Library

[6]

B. M. Beckmann and D. A. Wood. TLC: Transmission Line Caches. In MICRO, 2003.

Digital Library

[7]

B. M. Beckmann, M. R. Marty, and D. A. Wood. Asr: Adaptive selective replication for cmp caches. In MICRO, 2006.

Digital Library

[8]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: characterization and architectural implications. In PACT, 2008.

Digital Library

[9]

J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. In ISCA, 2006.

Digital Library

[10]

M. Chaudhuri. Pagenuca: Selected policies for page-grain locality management in large shared chip-multiprocessor caches. In HPCA, 2009.

[11]

C.-H. O. Chen, S. Park, T. Krishna, S. Subramanian, A. P. Chandrakasan, and L.-S. Peh. SMART: A Single-Cycle Reconfigurable NoC for SoC Applications. In DATE, 2013.

Digital Library

[12]

Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In MICRO, 2003.

Digital Library

[13]

Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Optimizing replication, communication, and capacity allocation in cmps. In ISCA, 2005.

Digital Library

[14]

S. Cho and L. Jin. Managing distributed, shared l2 caches through os-level page allocation. In MICRO, 2006.

Digital Library

[15]

B. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. Duato. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In ISCA, 2011.

Digital Library

[16]

N. Eisley, L.-S. Peh, and L. Shang. Leveraging on-chip networks for data cache migration in chip multiprocessors. In PACT, 2008.

Digital Library

[17]

A. Gupta, W.-D. Weber, and T. C. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In ICPP (1), 1990.

[18]

M. Hammoud, S. Cho, and R. G. Melhem. Dynamic cache clustering for chip multiprocessors. In ICS, 2009.

Digital Library

[19]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive nuca: near-optimal block placement and replication in distributed caches. In ISCA, 2009.

Digital Library

[20]

R. Ho. On-Chip Wires: Scaling and Efficiency. PhD thesis, Stanford University, 2003.

[21]

Y. Hoskote, S. R. Vangal, A. Singh, N. Borkar, and S. Borkar. A 5-ghz mesh interconnect for a teraflops processor. IEEE Micro, 27(5):51--61, 2007.

Digital Library

[22]

J. Howard, S. Dighe, S. R. Vangal, G. Ruhl, N. Borkar, S. Jain, V. Erraguntla, M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, S. Steibl, S. Borkar, V. K. De, and R. F. V. der Wijngaart. A 48-Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS. In ISSCC, 2010.

[23]

J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A nuca substrate for flexible cmp cache sharing. In ICS, 2005.

Digital Library

[24]

N. D. E. Jerger, L.-S. Peh, and M. H. Lipasti. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In MICRO, 2008.

[25]

B. Kim and V. Stojanovic. Characterization of equalized and repeated interconnects for noc applications. IEEE Design and Test of Computers, 25:430--439, 2008.

Digital Library

[26]

C. Kim, D. Burger, and S. W. Keckler. An adaptive, nonuniform cache structure for wire-delay dominated on-chip caches. In ASPLOS, 2002.

Digital Library

[27]

J. Kim, W. J. Dally, B. Towles, and A. K. Gupta. Microarchitecture of a high-radix router. In ISCA, 2005.

Digital Library

[28]

J. Kim, W. J. Dally, and D. Abts. Flattened butterfly: a costefficient topology for high-radix networks. In ISCA, 2007.

Digital Library

[29]

T. Krishna, C.-H. O. Chen,W. C. Kwon, and L.-S. Peh. Breaking the on-chip latency barrier using SMART. In HPCA, 2013.

Digital Library

[30]

G. Kurian, J. E. Miller, J. Psota, J. Eastep, J. Liu, J. Michel, L. C. Kimerling, and A. Agarwal. ATAC: All-to-All Computing Using On-Chip Optical Interconnects. In BARC, 2007.

[31]

H. Lee, S. Cho, and B. R. Childers. Cloudcache: Expanding and shrinking private caches. In HPCA, 2011.

Digital Library

[32]

Y.-C. Maa, D. K. Pradhan, and D. Thiébaut. A hierarchical directory scheme for large-scale cache-coherent multipmcessors. In IPPS, 1992.

Digital Library

[33]

M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In ISCA, 2003.

Digital Library

[34]

M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Computer Architecture News, 33(4):92--99, 2005.

Digital Library

[35]

M. R. Marty and M. D. Hill. Virtual hierarchies to support server consolidation. In ISCA, 2007.

Digital Library

[36]

M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. Wood. Improving multiple-cmp systems using token coherence. In HPCA, 2005.

Digital Library

[37]

J. E. Miller, H. Kasture, G. Kurian, C. G. III, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. Graphite: A distributed parallel simulator for multicores. In HPCA, 2010.

[38]

S. Park, T. Krishna, C.-H. Chen, B. Daya, A. Chandrakasan, and L.-S. Peh. Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI. In DAC, 2012.

Digital Library

[39]

K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. W. Keckler, and C. R. Moore. Exploiting ilp, tlp, and dlp with the polymorphous trips architecture. IEEE Micro, 23(6):46--51, 2003.

Digital Library

[40]

S. Scott, D. Abts, J. Kim, and W. J. Dally. The BlackWidow high-radix clos network. In ISCA, 2006.

Digital Library

[41]

C. Sun, C.-H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Stojanovic. DSENT - a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In NoCS, 2010.

Digital Library

[42]

D. A. Wallach. PHD: A Hierarchical Cache Coherent Protocol. MS Thesis. MIT, 1992.

[43]

D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. B. III, and A. Agarwal. On-chip interconnection architecture of the tile processor. IEEE Micro, 27(5):15--31, 2007.

Digital Library

[44]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: Characterization and methodological considerations. In ISCA, 1995.

Digital Library

[45]

M. Zhang and K. Asanovic. Victim migration: Dynamically adapting between private and shared cmp caches. Technical report, MIT, 2005.

[46]

M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In ISCA, 2005.

Digital Library

Index Terms

Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data

Recommendations

Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs
ASPLOS '14

Locality has always been a critical factor in on-chip data placement on CMPs as accessing further-away caches has in the past been more costly than accessing nearby ones. Substantial research on locality-aware designs have thus focused on keeping a copy ...
Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Locality has always been a critical factor in on-chip data placement on CMPs as accessing further-away caches has in the past been more costly than accessing nearby ones. Substantial research on locality-aware designs have thus focused on keeping a copy ...
Balanced Prefetching Aggressiveness Controller for NoC-based Multiprocessor
SBCCI '14: Proceedings of the 27th Symposium on Integrated Circuits and Systems Design

The performance gap between memory hierarchy and processor is a well-known issue and the prefetching approach is often used to minimize this problem. This technique performs a data prefetch in memory and makes it available in the private cache before ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 42, Issue 1

ASPLOS '14

March 2014

729 pages

ISSN:0163-5964

DOI:10.1145/2654822

Editor:
Doug DeGroot
acm dot org

Issue’s Table of Contents

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
February 2014
780 pages
ISBN:9781450323055
DOI:10.1145/2541940
General Chairs:
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Program Chair:
Sarita Adve
University of Illinois at Urbana-Champ

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014

Published in SIGARCH Volume 42, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
445
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents