Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Published: 01 May 2005 Publication History
  • Get Citation Alerts
  • Abstract

    To main coherence in conventional shared-memory multiprocessor systems, processors first check other proessors' caches before obtaining data from memory. This coherence checking adds latency to memory requests and leads to large amounts of interconnect traffic in broadcast-based systems. Our results for a set of commercial, scientific and multiprogrammed workloads show that on average 67% (and up to 94%) of broadcasts are unnecessary. Coarse-Grain Coherence Tracking is a new technique that supplements a conventional coherence mechanism and optimizes the performance of coherence enforcement. The Coarse-Grain Coherence mechanism monitors the coherence status of large regions of memory, and uses that information to avoid unnecessary broadcasts. Coarse-Grain Coherence Tracking is shown to eliminate 55-97% of the unnecessary broadcasts, and improve performance by 8.8% on average (and up to 21.7%).

    References

    [1]
    {1} Charlesworth, A. The Sun Fireplane System Interconnect. Proceedings of SC2001.
    [2]
    {2} Tendler, J., Dodson, S., and Fields, S. IBM eServer Power4 System Microarchitecture, Technical White Paper, IBM Server Group, 2001.
    [3]
    {3} Kalla, R., Sinharoy, B., and Tendler, J. IBM Power5 Chip: A Dual-Core Multithreaded Processor IEEE Micro, 2004.
    [4]
    {4} Weber, F., Opteron and AMD64, A Commodity 64 bit x86 SOC. Presentation. Advanced Micro Devices, 2003.
    [5]
    {5} Sweazy, P., and Smith A., A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus . Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA), 1986.
    [6]
    {6} Liptay, S., Structural Aspects of the System/360 Model 85, Part II: The Cache. IBM Systems Journal, Vol. 7, pp. 15- 21, 1968.
    [7]
    {7} Hill, M., Smith, A., Experimental Evaluation of On-Chip Microprocessor Cache Memories. Proceedings of the 15th International Symposium on Computer Architecture, 1984.
    [8]
    {8} Rothman, J., and Smith, A., The Pool of Subsectors Cache Design. Proceedings of the 13th International Conference on Supercomputing (ICS), 1999.
    [9]
    {9} Seznec, A., Decoupled Sectored Caches: conciliating low tag implementation cost and low miss ratio. Proceedings of the 21st Annual International Symposium on Computer Architecture(ISCA), 1994.
    [10]
    {10} Kadiyala, M., and Bhuyan, L. A Dynamic Cache Sub-block Design to Reduce False Sharing. International Conference on Computer Design, VLSI in Computers and Processors, 1995.
    [11]
    {11} Anderson, C., and Baer, J-L. Design and Evaluation of a Subblock Cache Coherence Protocol for Bus-Based Multiprocessors . Technical Report UW CSE TR 94-05-02, University of Washington, 1994.
    [12]
    {12} Dubnicki, C., and LeBlanc, T. Adjustable Block Size Coherent Caches. Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA), 1992.
    [13]
    {13} May, C., Silha, E., Simpson, R., and Warren, H. (Eds). The PowerPC Architecture: A Specification for a New Family of RISC Processors (2nd Edition). Morgan Kaufmann Publishers, Inc., 1994.
    [14]
    {14} Steven R. Kunkel, Personal Communication, March 2004.
    [15]
    {15} Moshovos, A., Memik, G., Falsafi, B., and Choudhary, A. JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers. Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA), 2001.
    [16]
    {16} Moshovos, A., RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA). 2005.
    [17]
    {17} Saldanha, C., and Lipasti, M., Power Efficient Cache Coherence. Workshop on Memory Performance Issues, in conjunction with the International Symposium on Computer Architecture, (ISCA), 2001.
    [18]
    {18} Ekman, M., Dahlgren, F., and Stenström, P. TLB and Snoop Energy-Reduction using Virtual Caches in Low-Power Chip-Multiprocessors. Proceedings of ISLPED, 2002.
    [19]
    {19} Reynolds, P., Williams, C., and Wagner, R., Isotach Networks . IEEE Transactions on Parallel and Distributed Systems. Vol. 8, No. 4, 1997.
    [20]
    {20} Martin, M., Sorin, D., Ailamaki, A., Alameldeen A., Dickson, R., Mauer C., Moore K., Plakal M., Hill, M., and Wood, D. Timestamp Snooping: An Approach for Extending SMPs. Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2000.
    [21]
    {21} Martin, M, Hill, M, Wood, D. Token Coherence: Decoupling Performance and Correctness. Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), 2003.
    [22]
    {22} Martin, M., Harper, P., Sorin, D., Hill, M., and Wood, D., Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors. Proceedings of the 30th International Symposium on Computer Architecture, 2003.
    [23]
    {23} Lebeck, A., and Wood, D. Dynamic Self-Invalidation: Reducing Coherence Overhead in Shared-Memory Multiprocessors . Proceedings of the 22nd International Symposium on Computer Architecture (ISCA), 1995.
    [24]
    {24} UltraSPARC IV Processor, User's Manual Supplement, Sun Microsystems Inc, 2004.
    [25]
    {25} Cain, H., Lepak, K., Schwartz, B., and Lipasti, M., Precise and Accurate Processor Simulation. Proceedings of the 5th Workshop on Computer Architecture Evaluation Using Commercial Workloads, pp. 13-22, 2002.
    [26]
    {26} Keller, T., Maynard, A., Simpson, R., and Bohrer, P. Simos-ppc Full System Simulator. http://www.cs.utexas.edu/users/cart/simOS.
    [27]
    {27} Alameldeen, A., Martin, M., Mauer, C., Moore, K., Xu, M., Hill, M., and Wood, D. Simulating a $2M Commercial Server on a $2K PC. IEEE Computer, 2003.
    [28]
    {28} Gharachorloo, K., Gupta, A., and Hennessy, J. Two Techniques to Enhance the Performance of Memory Consistency Models. Proceedings of the International Conference on Parallel Processing (ICPP), 1991.

    Cited By

    View all
    • (2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171(40-53)Online publication date: Jan-2023
    • (2021)DvéProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00048(526-539)Online publication date: 14-Jun-2021
    • (2021)Efficient classification of private memory blocksJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.005Online publication date: Jul-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 33, Issue 2
    ISCA 2005
    May 2005
    531 pages
    ISSN:0163-5964
    DOI:10.1145/1080695
    Issue’s Table of Contents
    • cover image ACM Conferences
      ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture
      June 2005
      541 pages
      ISBN:076952270X

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 May 2005
    Published in SIGARCH Volume 33, Issue 2

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171(40-53)Online publication date: Jan-2023
    • (2021)DvéProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00048(526-539)Online publication date: 14-Jun-2021
    • (2021)Efficient classification of private memory blocksJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.005Online publication date: Jul-2021
    • (2020)TLB-based Block-Grain Classification of Private Data2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00025(122-130)Online publication date: Mar-2020
    • (2017)Designs of Low Power Snoop for Multiprocessor System on ChipJournal of Signal Processing Systems10.1007/s11265-016-1135-488:1(83-89)Online publication date: 1-Jul-2017
    • (2014)Tree‐based scheme for reducing shared cache miss rate leveraging regional, statistical and temporal similaritiesIET Computers & Digital Techniques10.1049/iet-cdt.2011.00668:1(30-48)Online publication date: Jan-2014
    • (2013)Generating efficient data movement code for heterogeneous architectures with distributed-memoryProceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT.2013.6618826(375-386)Online publication date: Oct-2013
    • (2013)Built‐in fast gather control network for efficient support of coherence protocolsIET Computers & Digital Techniques10.1049/iet-cdt.2012.00567:2(69-80)Online publication date: Mar-2013
    • (2012)Spatiotemporal Coherence TrackingProceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2012.39(341-350)Online publication date: 1-Dec-2012
    • (2012)Switch-based packing technique to reduce traffic and latency in token coherenceJournal of Parallel and Distributed Computing10.1016/j.jpdc.2011.11.01072:3(409-423)Online publication date: 1-Mar-2012
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media