Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2830772.2830832acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

Coherence domain restriction on large scale systems

Published: 05 December 2015 Publication History
  • Get Citation Alerts
  • Abstract

    Designing massive scale cache coherence systems has been an elusive goal. Whether it be on large-scale GPUs, future thousand-core chips, or across million-core warehouse scale computers, having shared memory, even to a limited extent, improves programmability. This work sidesteps the traditional challenges of creating massively scalable cache coherence by restricting coherence to flexible subsets (domains) of a system's total cores and home nodes. This paper proposes Coherence Domain Restriction (CDR), a novel coherence framework that enables the creation of thousand to million core systems that use shared memory while maintaining low storage and energy overhead. Inspired by the observation that the majority of cache lines are only shared by a subset of cores either due to limited application parallelism or limited page sharing, CDR restricts the coherence domain from global cache coherence to VM-level, application-level, or page-level. We explore two types of restriction, one which limits the total number of sharers that can access a coherence domain and one which limits the number and location of home nodes that partake in a coherence domain. Each independent coherence domain only tracks the cores in its domain instead of the whole system, thereby removing the need for a coherence scheme built on top of CDR to scale. Sharer Restriction achieves constant storage overhead as core count increases while Home Restriction provides localized communication enabling higher performance. Unlike previous systems, CDR is flexible and does not restrict the location of the home nodes or sharers within a domain. We evaluate CDR in the context of a 1024-core chip and in the novel application of shared memory to a 1,000,000-core warehouse scale computer. Sharer Restriction results in significant area savings, while Home Restriction in the 1024-core chip and 1,000,000-core system increases performance by 29% and 23.04x respectively when comparing with global home placement. We implemented the entire CDR framework in a 25-core processor taped out in IBM's 32nm SOI process and present a detailed area characterization.

    References

    [1]
    F. Mueller, "A library implementation of POSIX threads under UNIX," in In Proceedings of the USENIX Conference, pp. 29--41, 1993.
    [2]
    L. Dagum and R. Menon, "OpenMP: an industry standard API for shared-memory programming," Computational Science Engineering, IEEE, vol. 5, no. 1, pp. 46--55, 1998.
    [3]
    D. Johnson, M. Johnson, J. Kelm, W. Tuohy, S. S. Lumetta, and S. Patel, "Rigel: A 1,024-core single-chip accelerator architecture," Micro, IEEE, vol. 31, pp. 30--41, July 2011.
    [4]
    D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. F. Brown, and A. Agarwal, "On-chip interconnection architecture of the tile processor," Micro, IEEE, vol. 27, no. 5, pp. 15--31, 2007.
    [5]
    D. Wentzlaff, C. J. Jackson, P. Griffin, and A. Agarwal, "Configurable fine-grain protection for multicore processor virtualization," in Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, pp. 464--475, 2012.
    [6]
    C. Ramey, "TILE-Gx ManyCore Processor: Acceleration Interfaces and Architecture," in Hot Chips 23, 2011.
    [7]
    J. Howard et al., "A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), IEEE International, pp. 108--109, 2010.
    [8]
    G. Chrysos, "Knights Corner, Intel's first Many Integrated Core (MIC) Architecture Product," in Hot Chips 24, 2012.
    [9]
    L. M. Censier and P. Feautrier, "A new solution to coherence problems in multicache systems," IEEE Transactions on Computers, vol. 100, no. 12, pp. 1112--1118, 1978.
    [10]
    A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, "An evaluation of directory schemes for cache coherence," in ACM SIGARCH Computer Architecture News, vol. 16, pp. 280--298, IEEE Computer Society Press, 1988.
    [11]
    D. Chaiken, J. Kubiatowicz, and A. Agarwal, "LimitLESS Directories: A Scalable Cache Coherence Scheme," in Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS IV, pp. 224--234, ACM, 1991.
    [12]
    A. Gupta, W. dietrich Weber, and T. Mowry, "Reducing Memory and traffic Requirements for Scalable Directory-Based Cache Coherence Schemes," in In International Conference on Parallel Processing, pp. 312--321, 1990.
    [13]
    J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos, "A tagless coherence directory," in Microarchitecture (MICRO), 42nd Annual IEEE/ACM International Symposium on, pp. 423--434, IEEE, 2009.
    [14]
    H. Zhao, A. Shriraman, and S. Dwarkadas, "SPACE: sharing pattern-based directory coherence for multicore scalability," in Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT '10, pp. 135--146, ACM, 2010.
    [15]
    D. Sanchez and C. Kozyrakis, "SCD: A Scalable Coherence Directory with Flexible Sharer Set Encoding," in Proceedings of the 18th international symposium on High Performance Computer Architecture (HPCA-18), 2012.
    [16]
    X. Fan, W.-D. Weber, and L. A. Barroso, "Power Provisioning for a Warehouse-Sized Computer," in Proceedings of the 34th Annual International Symposium on computer Architecture, ISCA '07, pp. 13--23, 2007.
    [17]
    M. D. Hill and M. R. Marty, "Amdahl's law in the multicore era," Computer, vol. 41, no. 7, pp. 33--38, 2008.
    [18]
    M. Bhadauria, V. M. Weaver, and S. A. McKee, "Understanding parsec performance on contemporary cmps," in Workload Characterization (IISWC), IEEE International Symposium on, pp. 98--107, IEEE, 2009.
    [19]
    P. D. Bryan, J. G. Beu, T. M. Conte, P. Faraboschi, and D. Ortega, "Our many-core benchmarks do not use that many cores," in Workshop on Duplicating, Deconstructing, and Debunking (WDDD), vol. 6, p. 8, 2009.
    [20]
    N. Ioannou and M. Cintra, "Complementing user-level coarse-grain parallelism with implicit speculative parallelism," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 284--295, ACM, 2011.
    [21]
    P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Ozer, and B. Falsafi, "Scale-out processors," in Computer Architecture (ISCA), 39th Annual International Symposium on, pp. 500--511, June 2012.
    [22]
    M. R. Marty and M. D. Hill, "Virtual hierarchies to support server consolidation," in Proceedings of the 34th annual international symposium on Computer architecture, ISCA '07, pp. 46--56, 2007.
    [23]
    C. Bienia, Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.
    [24]
    N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-optimal Block Placement and Replication in Distributed Caches," in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pp. 184--195, 2009.
    [25]
    A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania, "The Multikernel: A New OS Architecture for Scalable Multicore Systems," in Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP '09, pp. 29--44, ACM, 2009.
    [26]
    D. Wentzlaff and A. Agarwal, "Factored operating systems (fos): the case for a scalable operating system for multicores," SIGOPS Oper. Syst. Rev., vol. 43, no. 2, pp. 76--85, 2009.
    [27]
    D. Wentzlaff, C. Gruenwald, III, N. Beckmann, K. Modzelewski, A. Belay, L. Youseff, J. Miller, and A. Agarwal, "An operating system for multicore and clouds: Mechanisms and implementation," in Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10, pp. 3--14, 2010.
    [28]
    D. R. Engler, M. F. Kaashoek, and J. O'Toole, Jr., "Exokernel: An Operating System Architecture for Application-level Resource Management," in Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, SOSP '95, pp. 251--266, ACM, 1995.
    [29]
    S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang, "Corey: An operating system for many cores," in Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI'08, pp. 43--57, USENIX Association, 2008.
    [30]
    E. Bugnion, S. Devine, and M. Rosenblum, "Disco: Running commodity operating systems on scalable multiprocessors," in Proceedings of the ACM Symposium on Operating System Principles, pp. 143--156, 1997.
    [31]
    K. Govil, D. Teodosiu, Y. Huang, and M. Rosenblum, "Cellular Disco: Resource management using virtual clusters on shared-memory multiprocessors," in Proceedings of the ACM Symposium on Operating System Principles, pp. 154--169, 1999.
    [32]
    Y. Fu and D. Wentzlaff, "PriME: A parallel and distributed simulator for thousand-core chips," in Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on, pp. 116--125, March 2014.
    [33]
    "Intel®Xeon®Processor E5-2670 v2 (25M Cache, 2.50 GHz)." http://ark.intel.com/products/75275/Intel-Xeon-Processor-E5-2670-v2-25M-Cache-2_50-GHz.
    [34]
    P. Kongetira, K. Aingaran, and K. Olukotun, "Niagara: A 32-way multithreaded sparc processor," Micro, IEEE, vol. 25, no. 2, pp. 21--29, 2005.
    [35]
    L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese, "Piranha: a scalable architecture based on single-chip multiprocessing," ACM SIGARCH Computer Architecture News, vol. 28, no. 2, pp. 282--293, 2000.
    [36]
    R. T. Simoni, Cache coherence directories for scalable multiprocessors. PhD thesis, to the Department of Electrical Engineering. Stanford University, 1992.
    [37]
    H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan, "SPATL: Honey, I Shrunk the Coherence Directory," in Parallel Architectures and Compilation Techniques (PACT), International Conference on, pp. 33--44, 2011.
    [38]
    M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi, "Cuckoo directory: A scalable directory for many-core systems," in High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pp. 169--180, IEEE, 2011.
    [39]
    M. M. K. Martin, M. D. Hill, and D. J. Sorin, "Why on-chip cache coherence is here to stay," Communications of the ACM, vol. 55, pp. 78--89, July 2012.
    [40]
    C. Kim, D. Burger, and S. W. Keckler, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS X, pp. 211--222, ACM, 2002.
    [41]
    P. Keleher, A. L. Cox, and W. Zwaenepoel, "Lazy release consistency for software distributed shared memory," in Proceedings of the 19th Annual International Symposium on Computer Architecture, ISCA '92, pp. 13--21, 1992.
    [42]
    P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel, "TreadMarks: distributed shared memory on standard workstations and operating systems," in Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, WTEC'94, 1994.
    [43]
    X. Zhou, H. Chen, S. Luo, Y. Gao, S. Yan, W. Liu, B. Lewis, and B. Saha, "A case for software managed coherence in manycore processors," in Poster session presented in th 2nd USENIX Workshop on Hot Topics in Parallelism, 2010.
    [44]
    C. Fensch and M. Cintra, "An OS-based alternative to full hardware coherence on tiled CMPs," in High Performance Computer Architecture (HPCA), IEEE 14th International Symposium on, pp. 355--366, IEEE, 2008.
    [45]
    J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel, "Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator," in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pp. 140--151, 2009.
    [46]
    B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. Adve, V. Adve, N. Carter, and C.-T. Chou, "DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism," in Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pp. 155--166, Oct 2011.
    [47]
    H. Sung, R. Komuravelli, and S. V. Adve, "DeNovoND: efficient Hardware Support for Disciplined Non-determinism," in Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pp. 13--26, ACM, 2013.
    [48]
    H. Sung and S. V. Adve, "DeNovoSync: efficient Support for Arbitrary Synchronization Without Writer-Initiated Invalidations," in Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pp. 545--559, ACM, 2015.
    [49]
    T. G. Mattson, M. Riepen, T. Lehnig, P. Brett, W. Haas, P. Kennedy, J. Howard, S. Vangal, N. Borkar, G. Ruhl, et al., "The 48-core SCC Processor: the programmer's view," in Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1--11, 2010.
    [50]
    G. Kurian, O. Khan, and S. Devadas, "The locality-aware adaptive cache coherence protocol," in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pp. 523--534, ACM, 2013.
    [51]
    P. Foglia, C. A. Prete, M. Solinas, and G. Monni, "Re-NUCA: Boosting CMP performance through block replication," in Digital System Design: Architectures, Methods and Tools (DSD), 13th Euromicro Conference on, pp. 199--206, IEEE, 2010.
    [52]
    Ros, Alberto and Acacio, Manuel E. and García, José M., "Direct coherence: Bringing together performance and scalability in shared-memory multiprocessors," in Proceedings of International Conference on High Performance Computing, pp. 147--160, 2007.
    [53]
    Ros, Alberto and Acacio, Manuel E. and García, José M., "DiCo-CMP: efficient cache coherency in tiled CMP architectures," in Parallel and Distributed Processing, IEEE International Symposium on, pp. 147--160, April 2008.
    [54]
    H. Hossain, S. Dwarkadas, and M. C. Huang, "POPS: Coherence Protocol Optimization for Both Private and Shared Data," in Parallel Architectures and Compilation Techniques (PACT), International Conference on, pp. 45--55, Oct 2011.
    [55]
    J. Zhou and B. Demsky, "Memory management for many-core processors with software configurable locality policies," in Proceedings of the 2012 International Symposium on Memory Management, ISMM '12, pp. 3--14, ACM, 2012.
    [56]
    M. Schuchhardt, A. Das, N. Hardavellas, G. Memik, and A. Choudhary, "The impact of dynamic directories on multicore interconnects," Computer, vol. 46, pp. 32--39, October 2013.
    [57]
    J. Laudon and D. Lenoski, "The SGI Origin: A ccNUMA Highly Scalable Server," in Computer Architecture, 1997. Conference Proceedings. The 24th Annual International Symposium on, pp. 241--251, June 1997.
    [58]
    S. Corporation, "White Paper: Technical Advances in the SGI®UV Architecture," June 2012.
    [59]
    C. Dubnicki, A. Bilas, K. Li, and J. Philbin, "Design and Implementation of Virtual Memory-mapped Communication on Myrinet," in Proceedings of the International Parallel Processing Symposium, pp. 388--396, 1997.
    [60]
    R. Kessler and J. Schwarzmeier, "Cray T3D: a new dimension for Cray Research," in Compcon Spring '93, Digest of Papers., pp. 176--182, Feb 1993.
    [61]
    J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, D. Ongaro, G. Parulkar, M. Rosenblum, S. M. Rumble, E. Stratmann, and R. Stutsman, "The Case for RAMCloud," Communications of the ACM, vol. 54, pp. 121--130, July 2011.
    [62]
    G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, "Dynamo: Amazon's highly available key-value store," in Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, SOSP '07, pp. 205--220, 2007.
    [63]
    R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, D. Stafford, T. Tung, and V. Venkataramani, "Scaling memcache at facebook," in The 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pp. 385--398, USENIX, 2013.

    Cited By

    View all
    • (2023)Affinity Alloc: Taming Not-So Near-Data ComputingProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623778(784-799)Online publication date: 28-Oct-2023
    • (2023)SMAPPIC: Scalable Multi-FPGA Architecture Prototype Platform in the CloudProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575753(733-746)Online publication date: 27-Jan-2023
    • (2023)Tag-Sharer-Fusion Directory: A Scalable Coherence Directory With Flexible Entry FormatsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321795634:1(262-274)Online publication date: 1-Jan-2023
    • Show More Cited By

    Index Terms

    1. Coherence domain restriction on large scale systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture
      December 2015
      787 pages
      ISBN:9781450340342
      DOI:10.1145/2830772
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 05 December 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cache coherence
      2. home placement
      3. shared memory

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      MICRO-48
      Sponsor:

      Acceptance Rates

      MICRO-48 Paper Acceptance Rate 61 of 283 submissions, 22%;
      Overall Acceptance Rate 484 of 2,242 submissions, 22%

      Upcoming Conference

      MICRO '24

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)105
      • Downloads (Last 6 weeks)7

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Affinity Alloc: Taming Not-So Near-Data ComputingProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623778(784-799)Online publication date: 28-Oct-2023
      • (2023)SMAPPIC: Scalable Multi-FPGA Architecture Prototype Platform in the CloudProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575753(733-746)Online publication date: 27-Jan-2023
      • (2023)Tag-Sharer-Fusion Directory: A Scalable Coherence Directory With Flexible Entry FormatsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321795634:1(262-274)Online publication date: 1-Jan-2023
      • (2022)täkōProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527379(42-58)Online publication date: 18-Jun-2022
      • (2022)A Case for Second-Level Software Cache Coherency on Many-Core Accelerators2022 IEEE International Workshop on Rapid System Prototyping (RSP)10.1109/RSP57251.2022.10038999(29-35)Online publication date: 13-Oct-2022
      • (2021)IntAct: A 96-Core Processor With Six Chiplets 3D-Stacked on an Active Interposer With Distributed Interconnects and Integrated Power ManagementIEEE Journal of Solid-State Circuits10.1109/JSSC.2020.303634156:1(79-97)Online publication date: Jan-2021
      • (2021)Stream Floating: Enabling Proactive and Decentralized Cache Optimizations2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00060(640-653)Online publication date: Feb-2021
      • (2021)WiDir: A Wireless-Enabled Directory Cache Coherence Protocol2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00034(304-317)Online publication date: Feb-2021
      • (2021)Developing a Multicore Platform Utilizing Open RISC-V CoresIEEE Access10.1109/ACCESS.2021.31084759(120010-120023)Online publication date: 2021
      • (2021)DynaCo: Dynamic Coherence Management for Tiled Manycore ArchitecturesInternational Journal of Parallel Programming10.1007/s10766-020-00688-6Online publication date: 3-Jan-2021
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media