research-article

Group-caching for NoC based multicore cache coherent systems

Authors:

Qiao BaojunAuthors Info & Claims

DATE '09: Proceedings of the Conference on Design, Automation and Test in Europe

Pages 755 - 760

Published: 20 April 2009 Publication History

Abstract

Most CMPs use on-chip networks to connect cores and tend to integrate more simple cores on a single die. Low-radix networks, such as 2D-MESH, are widely used in tiled CMPs since they can be mapped to on-chip networks efficiently. However, low-radix networks introduce high network latency caused by long diameter. In this paper, we propose the use of group-caching design in NoC based multicore cache coherent systems. In our design, on-chip L2 banks are organized to form multiple groups. Each cache group behaves like a shared L2 cache for the cores inside cache group while the cache coherence between cache groups is maintained by coherence messages. Besides, group-caching also adopts the new cache replacement policy to improve the inefficient use of the aggregate L2 cache capacity. Compared to banked and shared L2 design, as most L2 accesses are served by local cache group, the hop count is significantly reduced. Experiment results based on full-system simulation show that for 2D-MESH, group-caching can increase the performance by 2%~8% compared to banked and shared L2 design, with network energy consumption reduced by 11%~13%. Experiment results also show that the communication overhead inside cache group plays an important role in the performance of group-caching.

References

[1]

M. B. Taylor, J. Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, "Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams" 31st Annual International Symposium on Computer Architecture, 2004. Proceedings. pp. 2--13.

Digital Library

[2]

P. Gratz, Kim Changkyu, K. Sankaralingam, H. Hanson, P. Shivakumar, S. W. Keckler, D. Burger, "On-Chip Interconnection Networks of the TRIPS Chip" Micro, IEEE Volume 27, Issue 5, pp. 41--50, 2007.

Digital Library

[3]

Y. Hoskote, S. Vangal, A. Singh, N. Borkar, S. Borkar, "A 5-GHz Mesh Interconnect for a Teraflops Processor" Micro, IEEE Volume 27, Issue 5, pp. 51--61, 2007.

Digital Library

[4]

D. Wentzlaff, P. Griffin, H. Hoffmann, Bao Liewei, B. Edwards, C. Ramey, M. Mattina, Miao Chyi-Chang J. F. Brown, A. Agarwal, "On-Chip Interconnection Architecture of the Tile Processor" Micro, IEEE Volume 27, Issue 5, pp. 15--31, 2007.

Digital Library

[5]

J. Kim, J. Balfour, W. J. Dally, "Flattened Butterfly Topology for On-Chip Networks" 40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007. Preceedings, pp. 172--182.

Digital Library

[6]

J. Balfour, W. J. Dally, "Design tradeoffs for tiled CMP on-chip networks" 20th Annual International Conference on Supercomputing, 2006. Preceedings, pp. 187--198.

Digital Library

[7]

V. Soteriou, Wang Hangsheng, L. Peh, "A Statistical Traffic Model for On-Chip Interconnection Networks" 14th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2006. Preceedings, pp. 104--116.

Digital Library

[8]

L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, B. Verghese, "Piranha: a scalable architecture based on single-chip multiprocessing" 27th International Symposium on Computer Architecture, 2000. Proceedings, pp. 282--293.

Digital Library

[9]

P. Kongetira, K. Aingaran, "Niagara: a 32-way multithreaded Sparc processor" Micro, IEEE Volume 25, Issue 2, 2005. Preceedings, pp. 21--29.

Digital Library

[10]

Chang Jichuan, G. S. Sohi, "Cooperative Caching for Chip Multiprocessors" 33rd International Symposium on Computer Architecture, 2006. Preceedings, pp. 264--276.

Digital Library

[11]

http://www.princeton.edu/~niketa/publications/garnet-tech-report.pdf.

[12]

P. Guerrier, A. Greiner, "A generic architecture for on-chip packet-switched interconnections" Design, Automation and Test in Europe Conference and Exhibition 2000. Proceedings, pp. 250--256.

Digital Library

[13]

Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, Michael R. Marty, Min Xu, Alaa R. Alameldeen, Kevin E. Moore, Mark D. Hill, and David A. Wood, "Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset" Computer Architecture News (CAN), September 2005.

Digital Library

[14]

M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, B. Werner, "Simics: A full system simulation platform Magnusson" Computer Volume 35, Issue 2, 2002. pp. 50--58.

Digital Library

[15]

A. R. Alameldeen, M. M. K. Martin, C. J. Mauer, K. E. Moore, M. Xu, D. J. Sorin, M. D. Hill, and D. A. Wood. "Simulating a $2M commercial server on a $2K PC. IEEE Computer" 2003. pp. 50--57.

Digital Library

[16]

Wang Hang-Sheng, Zhu Xinping, Peh Li-Shiuan, S. Malik, "Orion: a power-performance simulator for interconnection networks" 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002.Proceedings, pp. 294--305.

Digital Library

[17]

P. Barford, Mark Crovella, "Generating representative Web workloads for network and server performance evaluation" 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems table of contents. Proceedings, pp. 151--160.

Digital Library

[18]

Minglong Shao, Anastassia Ailamaki, Babak Falsafi, "DBmbench: Fast and Accurate Database Workload Representation on Modern Microarchitecture" Conference of the Centre for Advanced Studies on Collaborative Research 2005.

Digital Library

[19]

Cheng Liqun, N. Muralimanohar, K. Ramani, R. Balasubramonian, J. B. Carter, "Interconnect-Aware Coherence Protocols for Chip Multiprocessors" 33rd International Symposium on Computer Architecture, 2006. Preceedings, pp. 339--351.

Digital Library

[20]

Z. Chishti, M. D. Powell, T. N. Vijaykumar, "Optimizing replication, communication, and capacity allocation in CMPs" 32nd International Symposium on Computer Architecture, 2005. Proceedings, pp. 357--368.

Digital Library

[21]

M. Zhang and K. Asanovic "Victim replication: Maximizing capacity while hiding wire delay in tiled CMPs" 32nd International Symposium on Computer Architecture, 2005. Preceedings, pp. 336--345.

Digital Library

[22]

B. M. Beckmann and D. A. Wood. "Managing wire delay in large chip-multiprocessor caches", International Symposium on Microarchitecture, 2004. Preceedings, pp. 319--330.

Digital Library

Cited By

Hu SShi FJi WChen XTalpur S(2017)Exploring grouped coherence for clustered hierarchical cacheThe Journal of Supercomputing10.1007/s11227-017-2024-873:9(4137-4157)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1007/s11227-017-2024-8
Li JShi LXue CXu Y(2014)Dual partitioning multicasting for high-performance on-chip networksJournal of Parallel and Distributed Computing10.1016/j.jpdc.2013.07.00274:1(1858-1871)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.1016/j.jpdc.2013.07.002
Huang LWang ZXiao NBrunvard EStevens KCavallaro JZhang T(2012)An optimized multicore cache coherence design for exploiting communication localityProceedings of the great lakes symposium on VLSI10.1145/2206781.2206797(59-62)Online publication date: 3-May-2012
https://dl.acm.org/doi/10.1145/2206781.2206797

Group-caching for NoC based multicore cache coherent systems

Recommendations

A hybrid NoC design for cache coherence optimization for chip multiprocessors
DAC '12: Proceedings of the 49th Annual Design Automation Conference

On chip many-core systems, evolving from prior multi-processor systems, are considered as a promising solution to the performance scalability and power consumption problems. The long communication distance between the traditional multi-processors makes ...
NoC-aware cache design for multithreaded execution on tiled chip multiprocessors
HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers

In chip multiprocessors (CMPs), data access latency depends on the memory hierarchy organization, the on-chip interconnect (NoC), and the running workload. Reducing data access latency is vital to achieving performance improvements and scalability of ...
Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DATE '09: Proceedings of the Conference on Design, Automation and Test in Europe

April 2009

1776 pages

ISBN:9783981080155

General Chairs:
Luca Benini
University of Bologna, IT
,
Giovanni De Micheli
EPFL, CH
,
Program Chairs:
Bashir Al-Hashimi
University of Southampton, UK
,
Wolfgang Mueller
University of Paderborn, DE

Sponsors

EDAA: European Design Automation Association
ECSI
EDAC: Electronic Design Automation Consortium
SIGDA: ACM Special Interest Group on Design Automation
The IEEE Computer Society TTTC
The IEEE Computer Society DATC
The Russian Academy of Sciences: The Russian Academy of Sciences

Publisher

European Design and Automation Association

Leuven, Belgium

Publication History

Published: 20 April 2009

Check for updates

Author Tags

Qualifiers

Research-article

Conference

DATE '09

Sponsor:

EDAA
EDAC
SIGDA
The Russian Academy of Sciences

DATE '09: Design, Automation and Test in Europe

April 20 - 24, 2009

Nice, France

Acceptance Rates

Overall Acceptance Rate 518 of 1,794 submissions, 29%

Upcoming Conference

DATE '25

Sponsor:
sigda

Design, Automation and Test in Europe

March 31 - April 2, 2025

Lyon , France

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
100
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hu SShi FJi WChen XTalpur S(2017)Exploring grouped coherence for clustered hierarchical cacheThe Journal of Supercomputing10.1007/s11227-017-2024-873:9(4137-4157)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1007/s11227-017-2024-8
Li JShi LXue CXu Y(2014)Dual partitioning multicasting for high-performance on-chip networksJournal of Parallel and Distributed Computing10.1016/j.jpdc.2013.07.00274:1(1858-1871)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.1016/j.jpdc.2013.07.002
Huang LWang ZXiao NBrunvard EStevens KCavallaro JZhang T(2012)An optimized multicore cache coherence design for exploiting communication localityProceedings of the great lakes symposium on VLSI10.1145/2206781.2206797(59-62)Online publication date: 3-May-2012
https://dl.acm.org/doi/10.1145/2206781.2206797

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents