research-article

Adaptive Cache Coherence Mechanisms with Producer–Consumer Sharing Optimization for Chip Multiprocessors

Authors:

Olivier Serres,

Tarek El-GhazawiAuthors Info & Claims

IEEE Transactions on Computers, Volume 64, Issue 2

Pages 316 - 328

https://doi.org/10.1109/TC.2013.217

Published: 01 February 2015 Publication History

Abstract

In chip multiprocessors (CMPs), maintaining cache coherence can account for a major performance overhead. Write-invalidate protocols adapted by most CMPs generate high cache-to-cache misses under producer-consumer sharing patterns. Accordingly, this paper presents three cache coherence mechanisms optimized for CMPs. First, to reduce coherence misses observed in write-invalidate-based protocols, we propose a dynamic write-update mechanism augmented on top of a write-invalidate protocol. This mechanism is specifically triggered at the detection of a producer-consumer sharing pattern. Second, we extend this adaptive protocol with a bandwidth-adaptive mechanism to eliminate performance degradation from write-updates under limited bandwidth. Finally, proximity-aware mechanism is proposed to extend the base adaptive protocol with latency-based optimizations. Experimental analysis is conducted on a set of scientific applications from the SPLASH-2 and NAS parallel benchmark suites. The proposed mechanisms were shown to reduce coherence misses by up to 48% and in return speed up application performance up to 30%. Bandwidth-adaptive mechanism is proven to perform well under varying levels of available bandwidth. Results from our proposed proximity-aware extension demonstrated up to 6% performance gain over the base adaptive protocol for 64-core tiled CMP runs. In addition, the analytical model provided good estimates for performance gains from our adaptive protocols.

References

[1]

D. Geer, “Industry trends: Chip makers turn to multicore processors”, in Computer, vol. 38, no. 5, pp. 11–13, May 2005.

Digital Library

[2]

P. F. Gorder, “Multicore processors for science and engineering”, in Comput. Sci. Eng., vol. 9, no. 2, pp. 3–7, 2007.

Digital Library

[3]

Intel Many Integrated Cores Architecture, Nov. 2013, [Online]. Available: http://www.intel.com/content/www/us/en/high-performance-computing/high-performance-xeon-phi-coprocessor-brief.html.

[4]

Tilera Corp, “TILE-Gx100, a 100-Core Microprocessor from Tilera Corp.,”, Oct. 2012, [Online]. Available: http://www.tilera.com.

[5]

A. Kayi, E. Kornkven, T. A. El-Ghazawi, and G. Newby, “Application performance tuning for clusters with ccNUMA nodes,” in Proc. 11th IEEE Int. Conf. Comput. Sci. Eng. (CSE’08), 2008, pp. 245–252.

[6]

M. R. Marty, “Cache coherence techniques for multicore processors,”, PhD dissertation Computer Science Univ. Madison, WI: 2008.

[7]

K. D. Bosschere, W. Luk, X. Martorell, N. Navarro, M. F. P. O’Boyle, D. N. Pnevmatikatos, A. Ramírez, P. Sainrat, A. Seznec, P. Stenström, and O. Temam High-Performance embedded architecture and compilation roadmap Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC) P. Stenstrom, Ed.,: Springer, 2007, vol. 4050, pp. 5–29.

Digital Library

[8]

J. A. Brown, R. Kumar, and D. M. Tullsen, “Proximity-Aware directory-based coherence for multi-core processor architectures,” in Proc. ACM Symp. Parallelism Algorithms Architectures (SPAA) P. B. Gibbons and C. Scheideler, Eds., 2007, pp. 126–134.

[9]

A. Kayi, O. Serres, and T. A. El-Ghazawi, “Bandwidth adaptive write-update optimizations for chip multiprocessors,” in Proc. IEEE 10th Int. Symp. Parallel Distrib. Process. Appl. (ISPA), 2012, pp. 199–206.

[10]

A. L. Cox and R. J. Fowler, “Adaptive cache coherency for detecting migratory shared data,” in Proc. 20th Annu. Int. Symp. Comput. Architecture (ISCA), 1993, pp. 98–108.

[11]

M. Acacio, J. González, J. García, and J. Duato, “Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture,” in Proc. ACM/IEEE Conf. Supercomput., 2002, pp. 1–12.

[12]

N. Eisley, L.-S. Peh, and L. Shang, “In-network cache coherence,” in Proc. 39th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO 39), 2006, pp. 321–332.

[13]

M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. Wood, “Improving multiple-CMP systems using token coherence,” in Proc. 32nd Annu. Int. Symp. Comput. Architecture (ISCA’05), 2005, pp. 328–339.

[14]

A. Raynaud, Z. Zhang, and J. Torrellas, “Distance-adaptive update protocols for scalable shared-memory multiprocessors,” in Proc. 2nd Int'l Symp. High-Perform. Comput. Architecture (HPCA’96), Feb. 1996, pp. 323–334.

[15]

H. K. Grahn and P. Stenström, “Evaluation of a competitive-update cache coherence protocol with migratory data detection”, in J. Parallel Distrib. Comput., vol. 39, pp. 2–39, 1996.

[16]

L. Cheng and J. B. Carter, “Extending cc-NUMA systems to support write update optimizations,” in Proc. ACM/IEEE Conf. Supercomput. (SC’08), 2008, p. 30.

[17]

M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood, “Bandwidth adaptive snooping,” in Proc. Int. Symp. High-Perform. Comput. Architecture (HPCA), 2002, pp. 251–262.

[18]

A. R. Karlin, M. S. Manasse, L. Rudolph, and D. D. Sleator, “Competitive snoopy caching”, in Algorithmica, vol. 3, pp. 77–119, 1988.

Digital Library

[19]

P. Stenström, M. Brorsson, and L. Sandberg, “An adaptive cache coherence protocol optimized for migratory sharing,” in Proc. Int. Symp. Comput. Architecture (ISCA), 1993, pp. 109–118.

[20]

H. Nilsson and P. Stenström, “An adaptive update-based cache coherence protocol for reduction of miss rate and traffic,” in Proc. Parallel Architectures Languages Eur. (PARLE) Conf., 1994, pp. 363–374.

[21]

F. Dahlgren and P. Stenström, “Reducing the write traffic for a hybrid cache protocol,” in Proc. Int. Conf. Parallel Process. (ICPP), 1994, pp. 166–173.

[22]

F. Dahlgren, “Boosting the performance of hybrid snooping cache protocols,” in Proc. 22nd Annu. Int. Symp. Comput. Architecture (ISCA’95), 1995, pp. 60–69.

[23]

C. Anderson and A. R. Karlin, “Two adaptive hybrid cache coherency protocols,” in Proc. Int. Symp. High-Perform. Comput. Architecture (HPCA), 1996, pp. 303–313.

[24]

J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, “A tagless coherence directory,” in Proc. 42nd Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO 42), 2009, pp. 423–434.

[25]

A. Ros and S. Kaxiras, “Complexity-effective multicore coherence,” in Proc. 21st Int. Conf. Parallel Architectures Compilation Tech. (PACT’12), 2012, pp. 241–252.

[26]

L. Cheng, J. B. Carter, and D. Dai, “An adaptive cache coherence protocol optimized for producer-consumer sharing,” in Proc. Int. Symp. High-Perform. Comput. Architecture (HPCA), 2007, pp. 328–339.

[27]

A. Ros, M. E. Acacio, and J. M. García, “DiCo-CMP: Efficient cache coherency in tiled CMP architectures,” in Proc. 22nd IEEE Int. Symp. Parallel Distrib. Process. (IPDPS‘08), 2008, pp. 1–11.

[28]

T. M. Chaves, E. A. Carara, and F. G. Moraes, “Exploiting multicast messages in cache-coherence protocols for NoC-based MPSoCs,” in Proc. 6th Int. Workshop Reconfigurable Commun. Centric Systems-on-Chip (ReCoSoC), 2011, pp. 1–6.

[29]

S. Ma, N. D. E. Jerger, and Z. Wang, “Supporting efficient collective communication in NoCs,” in Proc. IEEE 18th Int. Symp. High Perform. Comput. Architecture (HPCA), 2012, pp. 165–176.

[30]

C. Fensch and M. Cintra, “An OS-based alternative to full hardware coherence on tiled CMPs,” in Proc. IEEE 14th Int'l Symp. High Perform. Comput. Architecture (HPCA’08), Feb. 2008, pp. 355–366.

[31]

M. M. K. Martin, “Formal verification and its impact on the snooping versus directory protocol debate,” in Proc. Int. Conf. Comput. Des. (ICCD’05), 2005, pp. 543–449.

[32]

A. Raghavan, C. Blundell, and M. M. K. Martin, “TOKEN tenure: PATCHing token counting using directory-based cache coherence,” in Proc. 41st IEEE/ACM Int. Symp. Microarchitecture (MICRO-41), 2008, pp. 47–58.

[33]

CACTI An Integrated Cache and Memory Access Time, Cycle Time, Area, Leakage, and Dynamic Power Model, [Online]. Available: http://www.hpl.hp.com/research/cacti/ HP Labs.

[34]

S. Srbljic, Z. Vranesic, M. Stumm, and L. Budin, “Analytical prediction of performance for cache coherence protocols”, in IEEE Trans. Comput., vol. 46, no. 11, pp. 1155–1173, Nov. 1997.

Digital Library

[35]

S. Leventhal and M. Franklin, “Perceptron based consumer prediction in shared-memory multiprocessors,” in Proc. Int. Conf. Comput. Des. (ICCD’06), Oct. 2006, pp. 148–154.

[36]

M. Dubois and J.-C. Wang, “Shared block contention in a cache coherence protocol”, in IEEE Trans. Comput., vol. 40, no. 5, pp. 640–644, May 1991.

Digital Library

[37]

P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simulation platform”, in Computer, vol. 35, no. 2, pp. 50–58, Feb. 2002.

Digital Library

[38]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The SPLASH-2 programs: characterization and methodological considerations,” in Proc. 22nd Annu. Int. Symp. Comput. Architecture (ISCA’95), 1995, pp. 24–36.

[39]

“NASA”, in NAS Parallel Benchmarks, Aug. 2013, [Online]. Available: http://www.nas.nasa.gov/Resources/Software/npb.html.

[40]

Omni Group NAS Parallel Benchmarks—OpenMP Version, Aug. 2013, [Online]. Available: http://www.hpcs.cs.tsukuba.ac.jp/omni-openmp.

Cited By

Chirkov GWentzlaff DGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Seizing the Bandwidth Scaling of On-Package Interconnect in a Post-Moore's Law WorldProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593702(410-422)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593702
Uma VMarimuthu R(2023)D-wash – A dynamic workload aware adaptive cache coherance protocol for multi-core processor systemMicroelectronics Journal10.1016/j.mejo.2022.105675132:COnline publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.mejo.2022.105675
Gerzhoy DYeung D(2021)Pipelined CPU-GPU Scheduling to Reduce Main Memory AccessesProceedings of the International Symposium on Memory Systems10.1145/3488423.3519319(1-10)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3488423.3519319
Show More Cited By

Index Terms

Adaptive Cache Coherence Mechanisms with Producer–Consumer Sharing Optimization for Chip Multiprocessors

Index terms have been assigned to the content through auto-classification.

Recommendations

Reusability-aware cache memory sharing for chip multiprocessors with private L2 caches

In this paper, we propose a novel on-chip L2 cache organization for chip multiprocessors (CMPs) with private L2 caches. The proposed approach, called reusability-aware cache sharing (RACS), combines the advantages of both a private L2 cache and a shared ...
Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors
ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture

With the ability to place large numbers of transistors on a single silicon chip, manufacturers have begun developing chip multiprocessors (CMPs) containing multiple processor cores, varying amounts of level 1 and level 2 caching, and on-chip directory ...
Adaptive Mechanisms and Policies for Managing Cache Hierarchies in Chip Multiprocessors
ISCA 2005

With the ability to place large numbers of transistors on a single silicon chip, manufacturers have begun developing chip multiprocessors (CMPs) containing multiple processor cores, varying amounts of level 1 and level 2 caching, and on-chip directory ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers

IEEE Transactions on Computers Volume 64, Issue 2

Feb. 2015

297 pages

ISSN:0018-9340

Issue’s Table of Contents

Copyright © 2013.

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 February 2015

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chirkov GWentzlaff DGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Seizing the Bandwidth Scaling of On-Package Interconnect in a Post-Moore's Law WorldProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593702(410-422)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593702
Uma VMarimuthu R(2023)D-wash – A dynamic workload aware adaptive cache coherance protocol for multi-core processor systemMicroelectronics Journal10.1016/j.mejo.2022.105675132:COnline publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1016/j.mejo.2022.105675
Gerzhoy DYeung D(2021)Pipelined CPU-GPU Scheduling to Reduce Main Memory AccessesProceedings of the International Symposium on Memory Systems10.1145/3488423.3519319(1-10)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3488423.3519319
Gade SDeb S(2021)A Novel Hybrid Cache Coherence with Global Snooping for Many-core ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/346277527:1(1-31)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1145/3462775

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents