Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches

Published: 28 March 2014 Publication History

Abstract

When hardware cache coherence scales to many cores on chip, over saturated traffic of the shared memory system may offset the benefit from massive hardware concurrency. In this article, we investigate the cost of a write-update protocol in terms of on-chip memory network traffic and its adverse effects on the system performance based on a multithreaded many-core architecture with distributed caches. We discuss possible software and hardware solutions to alleviate the network pressure. We find that in the context of massive concurrency, by introducing a write-merging buffer with 0.46% area overhead to each core, applications with good locality and concurrency are boosted up by 18.74% in performance on average. Other applications also benefit from this addition and even achieve a throughput increase of 5.93%. In addition, this improvement indicates that higher levels of concurrency per core can be exploited without impacting performance, thus tolerating latency better and giving higher processor efficiencies compared to other solutions.

References

[1]
D. Agarwal and D. Yeung. 2003. Exploiting application-level information to reduce memory bandwidth consumption. In Proceedings of the 4th Workshop on Complexity-Effective Design, held in conjunction with the 30th International Symposium on Computer Architecture.
[2]
A. Bakhoda, J. Kim, and T. M. Aamodt. 2010. Throughput-effective on-chip networks for many-core accelerators. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 421--432.
[3]
T. Bernard, K. Bousias, L. Guang, C. Jesshope, M. Lankamp, M. Van Tol, and L. Zhang. 2008. A general model of concurrency and its implementation as many-core dynamic risc processors. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS'08). 1--9.
[4]
R. Bianchini, T. J. Leblanc, and J. Veenstra. 1994. Eliminating useless messages in write-update protocols on scalable multiprocessors. Tech. rep., University of Rochester.
[5]
K. Bousias, N. Hasasneh, and C. Jesshope. 2006. Instruction level parallelism through microthreading. A scalable approach to chip multiprocessors. Comput. J. 49, 2, 211--233.
[6]
D. Burger, J. R. Goodman, and A. Kägi. 1996. Memory bandwidth limitations of future microprocessors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA'96). ACM, 78--89.
[7]
M. Danek, L. Kafka, L. Kohout, J. Sykora, and R. Bartosinsk. 2011. UTLEON3: Exploring Fine-Grain Multi-Threading in FPGAs. Springer.
[8]
R. Das, O. Mutlu, T. Moscibroda, and C. R. Das. 2009. Application-aware prioritization mechanisms for on-chip networks. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 280--291.
[9]
C. Ding and K. Kennedy. 2000. The memory of bandwidth bottleneck and its amelioration by a compiler. In Proceedings of the 14th International Parallel and Distributed Processing Symposium. 181--189.
[10]
A. Ferrante, S. Medardoni, and D. Bertozzi. 2008. Network interface sharing techniques for area optimized noc architectures. In Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools (DSD'08). 10--17.
[11]
D. Glasco, B. Delagi, and M. Flynn. 1994. Update-based cache coherence protocols for scalable shared-memory multiprocessors. In Proceedings of the 27th Hawaii International Conference on System Sciences, Vol. 1. 534--545.
[12]
P. Gratz, B. Grot, and S. W. Keckler. 2008. Regional congestion awareness for load balance in networks-on-chip. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA'08). 203--214.
[13]
B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu. 2011. Kilo-noc: a heterogeneous network-on-chip architecture for scalability and service guarantees. SIGARCH Comput. Archit. News 39, 3, 401--412.
[14]
S. Gupta, S. W. Keckler, and D. Burger. 2000. Technology independent area and delay estimates for microprocessor building blocks. Tech. rep., Department of Computer Sciences, The University of Texas at Austin.
[15]
J. Howard, S. Dighe, Y. Hoskote, et al. 2010. A 48-core ia-32 message-passing processor with dvfs in 45nm cmos. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC'10). 108--109.
[16]
J. H. Kelm, M. R. Johnson, S. S. Lumettta, and S. J. Patel. 2010. Waypoint: Scaling coherence to thousand-core architectures. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. ACM, 99--110.
[17]
J. Kim, J. Balfour, and W. Dally. 2007. Flattened butterfly topology for on-chip networks. Computer Archit. Lett. 6, 2, 37--40.
[18]
S. Kim and J. Lee. 2010. Write buffer-oriented energy reduction in the l1 data cache of two-level caches for the embedded system. In Proceedings of the 20th Great lakes Symposium on VLSI (GLSVLSI'10). ACM, 257--262.
[19]
M. Kondo, H. Okawara, H. Nakamura, and T. Boku. 2000. Scima: Software controlled integrated memory architecture for high performance computing. In Proceedings of the International Conference on Computer Design. 105--111.
[20]
M. Lankamp, R. Poss, Q. Yang, J. Fu, I. Uddin, and C. R. Jesshope. 2013. MGSim: Simulation tools for multi-core processor architectures. Tech. Rep. arXiv:1302.1390v1 {cs.AR}, University of Amsterdam.
[21]
J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, M. Horowitz, and C. Kozyrakis. 2007. Comparing memory systems for chip multiprocessors. SIGARCH Comput. Archit. News 35, 2, 358--368.
[22]
Z. Majo and T. R. Gross. 2011. Memory system performance in a numa multicore multiprocessor. In Proceedings of the 4th Annual International Conference on Systems and Storage (SYSTOR'11). ACM, 12:1--12:10.
[23]
M. M. Martin, M. D. Hill, and D. J. Sorin. 2012. Why on-chip cache coherence is here to stay. Commun. ACM 55, 7, 78--89.
[24]
C. Molina, A. Gonlaze, and J. Tubella. 1999. Reducing memory traffic via redundant store instructions. In Proceedings of the International Conference on High Performance Computing and Networking. 1246--1249.
[25]
F. Mounes-Toussi and D. Lilja. 1995. Write buffer design for cache-coherent shared-memory multiprocessors. In Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'95). 506--511.
[26]
O. Mutlu. 2011. Memory systems in the many-core era: challenges, opportunities, and solution directions. SIGPLAN Not. 46, 11, 77--78.
[27]
G. P. Nychis, C. Fallin, T. Moscibroda, O. Mutlu, and S. Seshan. 2012. On-chip networks from a networking perspective: congestion and scalability in many-core interconnects. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM'12). ACM, 407--418.
[28]
R. Poss. 2012. SL: A “quick and dirty” but working intermediate language for SVP systems. Tech. Rep. arXiv:1208.4572v1 {cs.PL}, University of Amsterdam.
[29]
R. Poss, M. Lankamp, Q. Yang, J. Fu, M. W. Van Tol, and C. Jesshope. 2012. Apple-CORE: Microgrids of SVP cores. In Proceedings of the 15th Euromicro Conference on Digital System Design (DSD'12). S. Niar, Ed., IEEE.
[30]
R. Poss, M. Lankamp, Q. Yang, J. Fu, M. W. Van Tol, I. Uddin, and C. Jesshope. 2013. Apple-CORE: Harnessing general-purpose many-cores with hardware concurrency management. Microprocess. Microsyst. 37, 8, 1090--1101.
[31]
D. Sanchez and C. Kozyrakis. 2012. Scd: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the IEEE 18th International Symposium on High Performance Computer Architecture. IEEE, 1--12.
[32]
S. Secchi, A. Tumeo, and O. Villa. 2012. A bandwidth-optimized multi-core architecture for irregular applications. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid'12). 580--587.
[33]
K. Skadron and D. W. Clark. 1997. Design issues and tradeoffs for write buffers. In Proceedings of the 3rd International Symposium on High-Performance Computer Architecture. 144--155.
[34]
M. W. Van Tol, R. Bakker, M. Verstraaten, C. Grelck, and C. Jesshope. 2011. Efficient memory copy operations on the 48-core intel scc processor. In Proceedings of the MARC Symposium. 13--18.
[35]
P. T. Wolkotte, G. J. Smit, and J. E. Becker. 2005. Energy-efficient noc for best-effort communication. In Proceedings of the 15th International Conference on Field Programmable Logic and Applications. IEEE, 197--202.
[36]
Q. Yang, C. Jesshope, and J. Fu. 2011. A micro threading based concurrency model for parallel computing. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW'11). 1668--1674.
[37]
H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan. 2011. Spatl: Honey, I shrunk the coherence directory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'11). IEEE, 33--44.

Cited By

View all
  • (2016)Building a Java™ Virtual Machine for Non-Cache-Coherent Many-core ArchitecturesProceedings of the 14th International Workshop on Java Technologies for Real-Time and Embedded Systems10.1145/2990509.2990510(1-10)Online publication date: 29-Aug-2016
  • (2016)DiSquawkProceedings of the 13th International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools10.1145/2972206.2972212(1-12)Online publication date: 29-Aug-2016
  • (2015)Enabling multi-threaded applications on hybrid shared memory manycore architecturesProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2755922(742-747)Online publication date: 9-Mar-2015

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 13, Issue 3s
Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
March 2014
403 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/2597868
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 28 March 2014
Accepted: 01 August 2013
Revised: 01 May 2013
Received: 01 December 2012
Published in TECS Volume 13, Issue 3s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hardware coherence
  2. distributed cache
  3. many-core system
  4. massive parallelism
  5. on-chip memory network
  6. write combination

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2016)Building a Java™ Virtual Machine for Non-Cache-Coherent Many-core ArchitecturesProceedings of the 14th International Workshop on Java Technologies for Real-Time and Embedded Systems10.1145/2990509.2990510(1-10)Online publication date: 29-Aug-2016
  • (2016)DiSquawkProceedings of the 13th International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools10.1145/2972206.2972212(1-12)Online publication date: 29-Aug-2016
  • (2015)Enabling multi-threaded applications on hybrid shared memory manycore architecturesProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2755922(742-747)Online publication date: 9-Mar-2015

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media