Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Adaptive Cache Coherence Mechanisms with Producer–Consumer Sharing Optimization for Chip Multiprocessors

Published: 01 February 2015 Publication History
  • Get Citation Alerts
  • Abstract

    In chip multiprocessors (CMPs), maintaining cache coherence can account for a major performance overhead. Write-invalidate protocols adapted by most CMPs generate high cache-to-cache misses under producer-consumer sharing patterns. Accordingly, this paper presents three cache coherence mechanisms optimized for CMPs. First, to reduce coherence misses observed in write-invalidate-based protocols, we propose a dynamic write-update mechanism augmented on top of a write-invalidate protocol. This mechanism is specifically triggered at the detection of a producer-consumer sharing pattern. Second, we extend this adaptive protocol with a bandwidth-adaptive mechanism to eliminate performance degradation from write-updates under limited bandwidth. Finally, proximity-aware mechanism is proposed to extend the base adaptive protocol with latency-based optimizations. Experimental analysis is conducted on a set of scientific applications from the SPLASH-2 and NAS parallel benchmark suites. The proposed mechanisms were shown to reduce coherence misses by up to 48% and in return speed up application performance up to 30%. Bandwidth-adaptive mechanism is proven to perform well under varying levels of available bandwidth. Results from our proposed proximity-aware extension demonstrated up to 6% performance gain over the base adaptive protocol for 64-core tiled CMP runs. In addition, the analytical model provided good estimates for performance gains from our adaptive protocols.

    References

    [1]
    D. Geer, “Industry trends: Chip makers turn to multicore processors”, in Computer, vol. 38, no. 5, pp. 11–13, May 2005.
    [2]
    P. F. Gorder, “Multicore processors for science and engineering”, in Comput. Sci. Eng., vol. 9, no. 2, pp. 3–7, 2007.
    [4]
    Tilera Corp, “TILE-Gx100, a 100-Core Microprocessor from Tilera Corp.,”, Oct. 2012, [Online]. Available: http://www.tilera.com.
    [5]
    A. Kayi, E. Kornkven, T. A. El-Ghazawi, and G. Newby, “Application performance tuning for clusters with ccNUMA nodes,” in Proc. 11th IEEE Int. Conf. Comput. Sci. Eng. (CSE’08), 2008, pp. 245–252.
    [6]
    M. R. Marty, “Cache coherence techniques for multicore processors,”, PhD dissertation Computer Science Univ. Madison, WI: 2008.
    [7]
    K. D. Bosschere, W. Luk, X. Martorell, N. Navarro, M. F. P. O’Boyle, D. N. Pnevmatikatos, A. Ramírez, P. Sainrat, A. Seznec, P. Stenström, and O. Temam High-Performance embedded architecture and compilation roadmap Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC) P. Stenstrom, Ed.,: Springer, 2007, vol. 4050, pp. 5–29.
    [8]
    J. A. Brown, R. Kumar, and D. M. Tullsen, “Proximity-Aware directory-based coherence for multi-core processor architectures,” in Proc. ACM Symp. Parallelism Algorithms Architectures (SPAA) P. B. Gibbons and C. Scheideler, Eds., 2007, pp. 126–134.
    [9]
    A. Kayi, O. Serres, and T. A. El-Ghazawi, “Bandwidth adaptive write-update optimizations for chip multiprocessors,” in Proc. IEEE 10th Int. Symp. Parallel Distrib. Process. Appl. (ISPA), 2012, pp. 199–206.
    [10]
    A. L. Cox and R. J. Fowler, “Adaptive cache coherency for detecting migratory shared data,” in Proc. 20th Annu. Int. Symp. Comput. Architecture (ISCA), 1993, pp. 98–108.
    [11]
    M. Acacio, J. González, J. García, and J. Duato, “Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture,” in Proc. ACM/IEEE Conf. Supercomput., 2002, pp. 1–12.
    [12]
    N. Eisley, L.-S. Peh, and L. Shang, “In-network cache coherence,” in Proc. 39th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO 39), 2006, pp. 321–332.
    [13]
    M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. K. Martin, and D. A. Wood, “Improving multiple-CMP systems using token coherence,” in Proc. 32nd Annu. Int. Symp. Comput. Architecture (ISCA’05), 2005, pp. 328–339.
    [14]
    A. Raynaud, Z. Zhang, and J. Torrellas, “Distance-adaptive update protocols for scalable shared-memory multiprocessors,” in Proc. 2nd Int'l Symp. High-Perform. Comput. Architecture (HPCA’96), Feb. 1996, pp. 323–334.
    [15]
    H. K. Grahn and P. Stenström, “Evaluation of a competitive-update cache coherence protocol with migratory data detection”, in J. Parallel Distrib. Comput., vol. 39, pp. 2–39, 1996.
    [16]
    L. Cheng and J. B. Carter, “Extending cc-NUMA systems to support write update optimizations,” in Proc. ACM/IEEE Conf. Supercomput. (SC’08), 2008, p. 30.
    [17]
    M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood, “Bandwidth adaptive snooping,” in Proc. Int. Symp. High-Perform. Comput. Architecture (HPCA), 2002, pp. 251–262.
    [18]
    A. R. Karlin, M. S. Manasse, L. Rudolph, and D. D. Sleator, “Competitive snoopy caching”, in Algorithmica, vol. 3, pp. 77–119, 1988.
    [19]
    P. Stenström, M. Brorsson, and L. Sandberg, “An adaptive cache coherence protocol optimized for migratory sharing,” in Proc. Int. Symp. Comput. Architecture (ISCA), 1993, pp. 109–118.
    [20]
    H. Nilsson and P. Stenström, “An adaptive update-based cache coherence protocol for reduction of miss rate and traffic,” in Proc. Parallel Architectures Languages Eur. (PARLE) Conf., 1994, pp. 363–374.
    [21]
    F. Dahlgren and P. Stenström, “Reducing the write traffic for a hybrid cache protocol,” in Proc. Int. Conf. Parallel Process. (ICPP), 1994, pp. 166–173.
    [22]
    F. Dahlgren, “Boosting the performance of hybrid snooping cache protocols,” in Proc. 22nd Annu. Int. Symp. Comput. Architecture (ISCA’95), 1995, pp. 60–69.
    [23]
    C. Anderson and A. R. Karlin, “Two adaptive hybrid cache coherency protocols,” in Proc. Int. Symp. High-Perform. Comput. Architecture (HPCA), 1996, pp. 303–313.
    [24]
    J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, “A tagless coherence directory,” in Proc. 42nd Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO 42), 2009, pp. 423–434.
    [25]
    A. Ros and S. Kaxiras, “Complexity-effective multicore coherence,” in Proc. 21st Int. Conf. Parallel Architectures Compilation Tech. (PACT’12), 2012, pp. 241–252.
    [26]
    L. Cheng, J. B. Carter, and D. Dai, “An adaptive cache coherence protocol optimized for producer-consumer sharing,” in Proc. Int. Symp. High-Perform. Comput. Architecture (HPCA), 2007, pp. 328–339.
    [27]
    A. Ros, M. E. Acacio, and J. M. García, “DiCo-CMP: Efficient cache coherency in tiled CMP architectures,” in Proc. 22nd IEEE Int. Symp. Parallel Distrib. Process. (IPDPS‘08), 2008, pp. 1–11.
    [28]
    T. M. Chaves, E. A. Carara, and F. G. Moraes, “Exploiting multicast messages in cache-coherence protocols for NoC-based MPSoCs,” in Proc. 6th Int. Workshop Reconfigurable Commun. Centric Systems-on-Chip (ReCoSoC), 2011, pp. 1–6.
    [29]
    S. Ma, N. D. E. Jerger, and Z. Wang, “Supporting efficient collective communication in NoCs,” in Proc. IEEE 18th Int. Symp. High Perform. Comput. Architecture (HPCA), 2012, pp. 165–176.
    [30]
    C. Fensch and M. Cintra, “An OS-based alternative to full hardware coherence on tiled CMPs,” in Proc. IEEE 14th Int'l Symp. High Perform. Comput. Architecture (HPCA’08), Feb. 2008, pp. 355–366.
    [31]
    M. M. K. Martin, “Formal verification and its impact on the snooping versus directory protocol debate,” in Proc. Int. Conf. Comput. Des. (ICCD’05), 2005, pp. 543–449.
    [32]
    A. Raghavan, C. Blundell, and M. M. K. Martin, “TOKEN tenure: PATCHing token counting using directory-based cache coherence,” in Proc. 41st IEEE/ACM Int. Symp. Microarchitecture (MICRO-41), 2008, pp. 47–58.
    [33]
    CACTI An Integrated Cache and Memory Access Time, Cycle Time, Area, Leakage, and Dynamic Power Model, [Online]. Available: http://www.hpl.hp.com/research/cacti/ HP Labs.
    [34]
    S. Srbljic, Z. Vranesic, M. Stumm, and L. Budin, “Analytical prediction of performance for cache coherence protocols”, in IEEE Trans. Comput., vol. 46, no. 11, pp. 1155–1173, Nov. 1997.
    [35]
    S. Leventhal and M. Franklin, “Perceptron based consumer prediction in shared-memory multiprocessors,” in Proc. Int. Conf. Comput. Des. (ICCD’06), Oct. 2006, pp. 148–154.
    [36]
    M. Dubois and J.-C. Wang, “Shared block contention in a cache coherence protocol”, in IEEE Trans. Comput., vol. 40, no. 5, pp. 640–644, May 1991.
    [37]
    P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner, “Simics: A full system simulation platform”, in Computer, vol. 35, no. 2, pp. 50–58, Feb. 2002.
    [38]
    S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The SPLASH-2 programs: characterization and methodological considerations,” in Proc. 22nd Annu. Int. Symp. Comput. Architecture (ISCA’95), 1995, pp. 24–36.
    [39]
    “NASA”, in NAS Parallel Benchmarks, Aug. 2013, [Online]. Available: http://www.nas.nasa.gov/Resources/Software/npb.html.
    [40]
    Omni Group NAS Parallel Benchmarks—OpenMP Version, Aug. 2013, [Online]. Available: http://www.hpcs.cs.tsukuba.ac.jp/omni-openmp.

    Cited By

    View all
    • (2023)Seizing the Bandwidth Scaling of On-Package Interconnect in a Post-Moore's Law WorldProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593702(410-422)Online publication date: 21-Jun-2023
    • (2023)D-wash – A dynamic workload aware adaptive cache coherance protocol for multi-core processor systemMicroelectronics Journal10.1016/j.mejo.2022.105675132:COnline publication date: 1-Feb-2023
    • (2021)Pipelined CPU-GPU Scheduling to Reduce Main Memory AccessesProceedings of the International Symposium on Memory Systems10.1145/3488423.3519319(1-10)Online publication date: 27-Sep-2021
    • Show More Cited By

    Index Terms

    1. Adaptive Cache Coherence Mechanisms with Producer–Consumer Sharing Optimization for Chip Multiprocessors
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image IEEE Transactions on Computers
          IEEE Transactions on Computers  Volume 64, Issue 2
          Feb. 2015
          297 pages

          Publisher

          IEEE Computer Society

          United States

          Publication History

          Published: 01 February 2015

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 10 Aug 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)Seizing the Bandwidth Scaling of On-Package Interconnect in a Post-Moore's Law WorldProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593702(410-422)Online publication date: 21-Jun-2023
          • (2023)D-wash – A dynamic workload aware adaptive cache coherance protocol for multi-core processor systemMicroelectronics Journal10.1016/j.mejo.2022.105675132:COnline publication date: 1-Feb-2023
          • (2021)Pipelined CPU-GPU Scheduling to Reduce Main Memory AccessesProceedings of the International Symposium on Memory Systems10.1145/3488423.3519319(1-10)Online publication date: 27-Sep-2021
          • (2021)A Novel Hybrid Cache Coherence with Global Snooping for Many-core ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/346277527:1(1-31)Online publication date: 13-Sep-2021

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media