research-article

Software Assisted Hardware Cache Coherence for Heterogeneous Processors

Authors:

Arkaprava Basu,

Sooraj Puthoor,

Bradford M. BeckmannAuthors Info & Claims

MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

Pages 279 - 288

https://doi.org/10.1145/2989081.2989092

Published: 03 October 2016 Publication History

Abstract

Current trends suggest that future computing platforms will be increasingly heterogeneous. While these heterogeneous processors physically integrate disparate computing elements like CPUs and GPUs on a single chip, their programmability critically depends upon the ability to efficiently support cache coherence and shared virtual memory across tightly-integrated CPUs and GPUs. However, throughput-oriented GPUs easily overwhelm existing hardware coherence mechanisms that long kept the cache hierarchies in multi-core CPUs coherent.

This paper proposes a novel solution called Software Assisted Hardware Coherence (SAHC) to scale cache coherence to future heterogeneous processors. We observe that the system software (Operating system and runtime) often has semantic knowledge about sharing patterns of data across the CPU and the GPU. This high-level knowledge can be utilized to effectively provide cache coherence across throughput-oriented GPUs and latency-sensitive CPUs in a heterogeneous processor. SAHC thus proposes a hybrid software-hardware mechanism that judiciously uses hardware coherence only when needed while using software's knowledge to filter out most of the unnecessary coherence traffic. Our evaluation suggests that SAHC can often eliminate up to 98-100% of the hardware coherence lookups, resulting up to 49% reduction in runtime.

References

[1]

N. Agarwal, L.-S. Peh, and N. K. Jha, "In-Network Coherence Filtering: Snoopy coherence without broadcasts," in 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. MICRO-42, 2009, pp. 232--243.

Digital Library

[2]

M. Alisafaee, "Spatiotemporal Coherence Tracking," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, Washington, DC, USA, 2012, pp. 341--350 {Online}. Available: http://dx.doi.org/10.1109/MICRO.2012.39.

Digital Library

[3]

AMD Radeon Graphics Technology, "{AMD Graphics Cores Next (GCN) Architecture White Paper}," Jun. 2012.

[4]

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The Gem5 Simulator," SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1--7, Aug. 2011.

Digital Library

[5]

S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, "A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads," in 2010 IEEE International Symposium on Workload Characterization (IISWC), 2010, pp. 1--11.

Digital Library

[6]

N. D. Enright Jerger, L.-S. Peh, and M. H. Lipasti, "Virtual Tree Coherence: Leveraging Regions and In-network Multicast Trees for Scalable Cache Coherence," in Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, Washington, DC, USA, 2008, pp. 35--46 {Online}. Available: http://dx.doi.org/10.1109/MICRO.2008.4771777.

Digital Library

[7]

I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W. W. Hwu, "An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems," in Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, New York, NY, USA, 2010, pp. 347--358 {Online}. Available: http://doi.acm.org/10.1145/1736020.1736059.

Digital Library

[8]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-optimal Block Placement and Replication in Distributed Caches," in Proceedings of the 36th Annual International Symposium on Computer Architecture, New York, NY, USA, 2009, pp. 184--195 {Online}. Available: http://doi.acm.org/10.1145/1555754.1555779.

Digital Library

[9]

N. Jayasena, M. Erez, J. H. Ahn, and W. J. Dally, "Stream register files with indexed access," in Software, IEE Proceedings-, 2004, pp. 60--72.

Digital Library

[10]

J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel, "Cohesion: A Hybrid Memory Model for Accelerators," in Proceedings of the 37th Annual International Symposium on Computer Architecture, New York, NY, USA, 2010, pp. 429--440 {Online}. Available: http://doi.acm.org/10.1145/1815961.1816019.

Digital Library

[11]

D. Kim, J. Ahn, J. Kim, and J. Huh, "Subspace Snooping: Filtering Snoops with Operating System Support," in Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, New York, NY, USA, 2010, pp. 111--122 {Online}. Available: http://doi.acm.org/10.1145/1854273.1854292.

Digital Library

[12]

P. Lotfi-Kamran, M. Ferdman, D. Crisan, and B. Falsafi, "TurboTag: Lookup Filtering to Reduce Coherence Directory Power," in Proceedings of the 16th ACM/IEEE International Symposium on Low Power Electronics and Design, New York, NY, USA, 2010, pp. 377--382 {Online}. Available: http://doi.acm.org/10.1145/1840845.1840929. {Accessed: 25-Nov-2014}

Digital Library

[13]

A. Moshovos, "RegionScout: exploiting coarse grain sharing in snoop-based coherence," in 32nd International Symposium on Computer Architecture, 2005. ISCA '05. Proceedings, 2005, pp. 234--245.

Digital Library

[14]

A. Moshovos, G. Memik, A. Choudhary, and B. Falsafi, "JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers," in Proceedings of the 7th International Symposium on High-Performance Computer Architecture, Washington, DC, USA, 2001, p. 85-- {Online}. Available: http://dl.acm.org/citation.cfm?id=580550.876432. {Accessed: 25-Nov-2014}

Digital Library

[15]

J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous System Coherence for Integrated CPU-GPU Systems," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, New York, NY, USA, 2013, pp. 457--467 {Online}. Available: http://doi.acm.org/10.1145/2540708.2540747. {Accessed: 20-Nov-2014}

Digital Library

[16]

C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel, "PTask: Operating System Abstractions to Manage GPUs As Compute Devices," in Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, New York, NY, USA, 2011, pp. 233--248 {Online}. Available: http://doi.acm.org/10.1145/2043556.2043579.

Digital Library

[17]

I. Singh, A. Shriraman, W. W. L. Fung, M. O'Connor, and T. M. Aamodt, "Cache Coherence for GPU Architectures," IEEE Micro, vol. 34, no. 3, pp. 69--79, May 2014.

[18]

J. Zebchuk, B. Falsafi, and A. Moshovos, "Multi-grain Coherence Directories," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, New York, NY, USA, 2013, pp. 359--370 {Online}. Available: http://doi.acm.org/10.1145/2540708.2540739.

Digital Library

[19]

J. Zebchuk, E. Safi, and A. Moshovos, "A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy," in Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, Washington, DC, USA, 2007, pp. 314--327 {Online}. Available: http://dx.doi.org/10.1109/MICRO.2007.5.

Digital Library

[20]

"AMD App SDK" {Online}. Available: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/

[21]

"CUDA:Unified Memory." {Online}. Available: http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

[22]

"HSA Foundation." {Online}. Available: http://www.hsafoundation.com/

Cited By

Puthoor SLipasti M(2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3593054
Muthukrishnan HLustig DVilla OWenisch TNellans D(2023)FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070949(516-529)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070949
Muthukrishnan HNellans DLustig DFessler JWenisch TMartínez JDuato JJohn L(2021)Efficient multi-GPU shared memory via automatic optimization of fine-grained transfersProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00020(139-152)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00020

Recommendations

Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing Frontiers

Chip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

To meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of ...
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
ISCA '11

To meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

October 2016

463 pages

ISBN:9781450343053

DOI:10.1145/2989081

General Chair:
Bruce Jacob
University of Maryland

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MEMSYS '16

MEMSYS '16: The Second International Symposium on Memory Systems

October 3 - 6, 2016

VA, Alexandria, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
256
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)2

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Puthoor SLipasti M(2023)Turn-based Spatiotemporal Coherence for GPUsACM Transactions on Architecture and Code Optimization10.1145/359305420:3(1-27)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3593054
Muthukrishnan HLustig DVilla OWenisch TNellans D(2023)FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10070949(516-529)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10070949
Muthukrishnan HNellans DLustig DFessler JWenisch TMartínez JDuato JJohn L(2021)Efficient multi-GPU shared memory via automatic optimization of fine-grained transfersProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00020(139-152)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00020

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten