research-article

On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches

Authors:

Chris JesshopeAuthors Info & Claims

ACM Transactions on Embedded Computing Systems (TECS), Volume 13, Issue 3s

Article No.: 103, Pages 1 - 21

https://doi.org/10.1145/2567931

Published: 28 March 2014 Publication History

Abstract

When hardware cache coherence scales to many cores on chip, over saturated traffic of the shared memory system may offset the benefit from massive hardware concurrency. In this article, we investigate the cost of a write-update protocol in terms of on-chip memory network traffic and its adverse effects on the system performance based on a multithreaded many-core architecture with distributed caches. We discuss possible software and hardware solutions to alleviate the network pressure. We find that in the context of massive concurrency, by introducing a write-merging buffer with 0.46% area overhead to each core, applications with good locality and concurrency are boosted up by 18.74% in performance on average. Other applications also benefit from this addition and even achieve a throughput increase of 5.93%. In addition, this improvement indicates that higher levels of concurrency per core can be exploited without impacting performance, thus tolerating latency better and giving higher processor efficiencies compared to other solutions.

References

[1]

D. Agarwal and D. Yeung. 2003. Exploiting application-level information to reduce memory bandwidth consumption. In Proceedings of the 4th Workshop on Complexity-Effective Design, held in conjunction with the 30th International Symposium on Computer Architecture.

[2]

A. Bakhoda, J. Kim, and T. M. Aamodt. 2010. Throughput-effective on-chip networks for many-core accelerators. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 421--432.

Digital Library

[3]

T. Bernard, K. Bousias, L. Guang, C. Jesshope, M. Lankamp, M. Van Tol, and L. Zhang. 2008. A general model of concurrency and its implementation as many-core dynamic risc processors. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS'08). 1--9.

[4]

R. Bianchini, T. J. Leblanc, and J. Veenstra. 1994. Eliminating useless messages in write-update protocols on scalable multiprocessors. Tech. rep., University of Rochester.

Digital Library

[5]

K. Bousias, N. Hasasneh, and C. Jesshope. 2006. Instruction level parallelism through microthreading. A scalable approach to chip multiprocessors. Comput. J. 49, 2, 211--233.

Digital Library

[6]

D. Burger, J. R. Goodman, and A. Kägi. 1996. Memory bandwidth limitations of future microprocessors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA'96). ACM, 78--89.

Digital Library

[7]

M. Danek, L. Kafka, L. Kohout, J. Sykora, and R. Bartosinsk. 2011. UTLEON3: Exploring Fine-Grain Multi-Threading in FPGAs. Springer.

Digital Library

[8]

R. Das, O. Mutlu, T. Moscibroda, and C. R. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 280--291.

Digital Library

[9]

C. Ding and K. Kennedy. 2000. The memory of bandwidth bottleneck and its amelioration by a compiler. In Proceedings of the 14th International Parallel and Distributed Processing Symposium. 181--189.

Digital Library

[10]

A. Ferrante, S. Medardoni, and D. Bertozzi. 2008. Network interface sharing techniques for area optimized noc architectures. In Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools (DSD'08). 10--17.

Digital Library

[11]

D. Glasco, B. Delagi, and M. Flynn. 1994. Update-based cache coherence protocols for scalable shared-memory multiprocessors. In Proceedings of the 27th Hawaii International Conference on System Sciences, Vol. 1. 534--545.

[12]

P. Gratz, B. Grot, and S. W. Keckler. 2008. Regional congestion awareness for load balance in networks-on-chip. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA'08). 203--214.

[13]

B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. 2011. Kilo-noc: a heterogeneous network-on-chip architecture for scalability and service guarantees. SIGARCH Comput. Archit. News 39, 3, 401--412.

Digital Library

[14]

S. Gupta, S. W. Keckler, and D. Burger. 2000. Technology independent area and delay estimates for microprocessor building blocks. Tech. rep., Department of Computer Sciences, The University of Texas at Austin.

Digital Library

[15]

J. Howard, S. Dighe, Y. Hoskote, et al. 2010. A 48-core ia-32 message-passing processor with dvfs in 45nm cmos. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC'10). 108--109.

[16]

J. H. Kelm, M. R. Johnson, S. S. Lumettta, and S. J. Patel. 2010. Waypoint: Scaling coherence to thousand-core architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, 99--110.

Digital Library

[17]

J. Kim, J. Balfour, and W. Dally. 2007. Flattened butterfly topology for on-chip networks. Computer Archit. Lett. 6, 2, 37--40.

Digital Library

[18]

S. Kim and J. Lee. 2010. Write buffer-oriented energy reduction in the l1 data cache of two-level caches for the embedded system. In Proceedings of the 20th Great lakes Symposium on VLSI (GLSVLSI'10). ACM, 257--262.

Digital Library

[19]

M. Kondo, H. Okawara, H. Nakamura, and T. Boku. 2000. Scima: Software controlled integrated memory architecture for high performance computing. In Proceedings of the International Conference on Computer Design. 105--111.

Digital Library

[20]

M. Lankamp, R. Poss, Q. Yang, J. Fu, I. Uddin, and C. R. Jesshope. 2013. MGSim: Simulation tools for multi-core processor architectures. Tech. Rep. arXiv:1302.1390v1 {cs.AR}, University of Amsterdam.

[21]

J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis. 2007. Comparing memory systems for chip multiprocessors. SIGARCH Comput. Archit. News 35, 2, 358--368.

Digital Library

[22]

Z. Majo and T. R. Gross. 2011. Memory system performance in a numa multicore multiprocessor. In Proceedings of the 4th Annual International Conference on Systems and Storage (SYSTOR'11). ACM, 12:1--12:10.

Digital Library

[23]

M. M. Martin, M. D. Hill, and D. J. Sorin. 2012. Why on-chip cache coherence is here to stay. Commun. ACM 55, 7, 78--89.

Digital Library

[24]

C. Molina, A. Gonlaze, and J. Tubella. 1999. Reducing memory traffic via redundant store instructions. In Proceedings of the International Conference on High Performance Computing and Networking. 1246--1249.

Digital Library

[25]

F. Mounes-Toussi and D. Lilja. 1995. Write buffer design for cache-coherent shared-memory multiprocessors. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'95). 506--511.

Digital Library

[26]

O. Mutlu. 2011. Memory systems in the many-core era: challenges, opportunities, and solution directions. SIGPLAN Not. 46, 11, 77--78.

Digital Library

[27]

G. P. Nychis, C. Fallin, T. Moscibroda, O. Mutlu, and S. Seshan. 2012. On-chip networks from a networking perspective: congestion and scalability in many-core interconnects. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM'12). ACM, 407--418.

Digital Library

[28]

R. Poss. 2012. SL: A “quick and dirty” but working intermediate language for SVP systems. Tech. Rep. arXiv:1208.4572v1 {cs.PL}, University of Amsterdam.

[29]

R. Poss, M. Lankamp, Q. Yang, J. Fu, M. W. Van Tol, and C. Jesshope. 2012. Apple-CORE: Microgrids of SVP cores. In Proceedings of the 15th Euromicro Conference on Digital System Design (DSD'12). S. Niar, Ed., IEEE.

Digital Library

[30]

R. Poss, M. Lankamp, Q. Yang, J. Fu, M. W. Van Tol, I. Uddin, and C. Jesshope. 2013. Apple-CORE: Harnessing general-purpose many-cores with hardware concurrency management. Microprocess. Microsyst. 37, 8, 1090--1101.

Digital Library

[31]

D. Sanchez and C. Kozyrakis. 2012. Scd: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the IEEE 18th International Symposium on High Performance Computer Architecture. IEEE, 1--12.

Digital Library

[32]

S. Secchi, A. Tumeo, and O. Villa. 2012. A bandwidth-optimized multi-core architecture for irregular applications. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'12). 580--587.

Digital Library

[33]

K. Skadron and D. W. Clark. 1997. Design issues and tradeoffs for write buffers. In Proceedings of the 3rd International Symposium on High-Performance Computer Architecture. 144--155.

Digital Library

[34]

M. W. Van Tol, R. Bakker, M. Verstraaten, C. Grelck, and C. Jesshope. 2011. Efficient memory copy operations on the 48-core intel scc processor. In Proceedings of the MARC Symposium. 13--18.

[35]

P. T. Wolkotte, G. J. Smit, and J. E. Becker. 2005. Energy-efficient noc for best-effort communication. In Proceedings of the 15th International Conference on Field Programmable Logic and Applications. IEEE, 197--202.

[36]

Q. Yang, C. Jesshope, and J. Fu. 2011. A micro threading based concurrency model for parallel computing. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW'11). 1668--1674.

Digital Library

[37]

H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan. 2011. Spatl: Honey, I shrunk the coherence directory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'11). IEEE, 33--44.

Digital Library

Cited By

Zakkak FPratikakis PBinder WSchoeberl M(2016)Building a Java™ Virtual Machine for Non-Cache-Coherent Many-core ArchitecturesProceedings of the 14th International Workshop on Java Technologies for Real-Time and Embedded Systems10.1145/2990509.2990510(1-10)Online publication date: 29-Aug-2016
https://dl.acm.org/doi/10.1145/2990509.2990510
Zakkak FPratikakis PZheng YBinder WTůma P(2016)DiSquawkProceedings of the 13th International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools10.1145/2972206.2972212(1-12)Online publication date: 29-Aug-2016
https://dl.acm.org/doi/10.1145/2972206.2972212
Rawat TShrivastava ANebel WAtienza D(2015)Enabling multi-threaded applications on hybrid shared memory manycore architecturesProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2755922(742-747)Online publication date: 9-Mar-2015
https://dl.acm.org/doi/10.5555/2755753.2755922

Index Terms

On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems
  2. Real-time systems
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling
    2. Software system structures
      1. Embedded software
      2. Real-time systems software

Recommendations

Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
A Direct Coherence Protocol for Many-Core Chip Multiprocessors

Future many-core CMP designs that will integrate tens of processor cores on-chip will be constrained by area and power. Area constraints make impractical the use of a bus or a crossbar as the on-chip interconnection network, and tiled CMPs organized ...
Efficient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

As we enter the era of many-core, providing the shared memory abstraction through cache coherence has become progressively difficult. The de-facto standard directory-based cache coherence has been extensively studied; but it does not scale well with ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 13, Issue 3s

Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

March 2014

403 pages

ISSN:1539-9087

EISSN:1558-3465

DOI:10.1145/2597868

Editors:
Sandeep K. Shukla
Virginia Tech
,
Masoud Daneshtalab
University of Turku, Finland
,
Maurizio Palesi
Kore University, Italy
,
Juha Plosila
University of Turku, Finland
,
Maurizio Palesi
Kore University, Italy
,
Todor Stefanov
Leiden University, The Netherlands

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 28 March 2014

Accepted: 01 August 2013

Revised: 01 May 2013

Received: 01 December 2012

Published in TECS Volume 13, Issue 3s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
315
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zakkak FPratikakis PBinder WSchoeberl M(2016)Building a Java™ Virtual Machine for Non-Cache-Coherent Many-core ArchitecturesProceedings of the 14th International Workshop on Java Technologies for Real-Time and Embedded Systems10.1145/2990509.2990510(1-10)Online publication date: 29-Aug-2016
https://dl.acm.org/doi/10.1145/2990509.2990510
Zakkak FPratikakis PZheng YBinder WTůma P(2016)DiSquawkProceedings of the 13th International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools10.1145/2972206.2972212(1-12)Online publication date: 29-Aug-2016
https://dl.acm.org/doi/10.1145/2972206.2972212
Rawat TShrivastava ANebel WAtienza D(2015)Enabling multi-threaded applications on hybrid shared memory manycore architecturesProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2755922(742-747)Online publication date: 9-Mar-2015
https://dl.acm.org/doi/10.5555/2755753.2755922

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents