Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2897937.2897966acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article
Public Access

A model-driven approach to warp/thread-block level GPU cache bypassing

Published: 05 June 2016 Publication History

Abstract

The high amount of memory requests from massive threads may easily cause cache contention and cache-miss-related resource congestion on GPUs. This paper proposes a simple yet effective performance model to estimate the impact of cache contention and resource congestion as a function of the number of warps/thread blocks (TBs) to bypass the cache. Then we design a hardware-based dynamic warp/thread-block level GPU cache bypassing scheme, which achieves 1.68x speedup on average on a set of memory-intensive benchmarks over the baseline. Compared to prior works, our scheme achieves 21.6% performance improvement over SWL-best [29] and 11.9% over CBWT-best [4] on average.

References

[1]
AMD GCN Architecture White paper, 2012.
[2]
A. Bakhoda et al. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of ISPASS, 2009.
[3]
M. Burtscher et al. A quantitative study of irregular programs on GPUs. In Proceedings of IISWC, 2012.
[4]
X. Chen et al. Adaptive cache management for energy-efficient GPU computing. In Proceedings of MICRO, 2014.
[5]
S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of IISWC, 2009.
[6]
C. K Chow. Determination of cache's capacity and its matching storage hierarchy. Computers, IEEE Transactions on 100, 1976.
[7]
N. Duong et al. Improving cache management policies using dynamic reuse distances. In Proceedings of MICRO, 2012.
[8]
J. Gaur et al. Bypass and insertion algorithms for exclusive last-level caches. In Proceedings of ISCA, 2011.
[9]
A. González et al. Eliminating cache conflict misses through XOR-based placement functions. In Proceedings of ICS, 1997.
[10]
S. Grauer-Gray et al. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of InPar, 2012.
[11]
Z. Guz et al. Many-core vs. many-thread machines: Stay away from the valley. Computer Architecture Letters 8, no. 1, 2009.
[12]
A. Hartstein, et al. Cache miss behavior: is it√ 2?." Proceedings of the 3rd conference on Computing frontiers, 2006.
[13]
S. Hong et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of ISCA, 2009.
[14]
J. Huang et al. GPUMech: GPU Performance Modeling Technique Based on Interval Analysis. In Proceedings of MICRO, 2014.
[15]
W. Jia et al. MRPB: Memory request prioritization for massively parallel processors. In Proceedings of HPCA, 2014.
[16]
O. Kayιran et al. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of PACT, 2013.
[17]
M. Kharbutli et al. Using prime numbers for cache indexing to eliminate conflict misses. In Software, IEE Proceedings-, 2004.
[18]
A. Li et al. Adaptive and transparent cache bypassing for GPUs. In Proceedings of SC, 2015.
[19]
C. Li et al. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of ICS, 2015.
[20]
C. Li et al. Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs. In Proceedings of ISPASS, 2014.
[21]
J. Leng et al. GPUWattch: enabling energy optimizations in GPGPUs. In Proceedings of ISCA, 2013.
[22]
D. Li et al. Priority-based cache allocation in throughput processors. In Proceedings of HPCA, 2015.
[23]
V. Narasiman et al. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of MICRO, 2011.
[24]
NVIDIA Kepler GK110 white paper. 2012.
[25]
NVIDIA, "CUDA C/C++ SDK code samples," 2011.
[26]
NVIDIA's CUDA compute architecture: Fermi. 2009.
[27]
NVIDIA Parallel Thread Execution ISA Version 4.2.
[28]
M. Qureshi et al. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches." In Proceedings of MICRO, 2006.
[29]
T. Rogers et al. Cache-conscious wavefront scheduling." In Proceedings MICRO, 2012.
[30]
I. Singh et al. Cache coherence for GPU architectures. In Proceedings of HPCA, 2013.
[31]
Y. Tian et al. Adaptive GPU cache bypassing. In Proceedings of the 8th Workshop on General Purpose Processing using GPUs, 2015.
[32]
S. Wilton et al. "CACTI: An enhanced cache access and cycle time model." Solid-State Circuits, IEEE Journal of 31, no. 5 (1996).
[33]
X. Xie et al. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of HPCA, 2015.
[34]
Y. Zhang et al. A quantitative performance analysis model for GPU architectures. In Proceedings of HPCA, 2011.

Cited By

View all
  • (2023)A Survey of Memory-Centric Energy Efficient Computer ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329759534:10(2657-2670)Online publication date: Oct-2023
  • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUsProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
  • (2022)DTexL: Decoupled Raster Pipeline for Texture Locality2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00028(213-227)Online publication date: Oct-2022
  • Show More Cited By
  1. A model-driven approach to warp/thread-block level GPU cache bypassing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    DAC '16: Proceedings of the 53rd Annual Design Automation Conference
    June 2016
    1048 pages
    ISBN:9781450342360
    DOI:10.1145/2897937
    © 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    DAC '16

    Acceptance Rates

    Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)90
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 23 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Survey of Memory-Centric Energy Efficient Computer ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329759534:10(2657-2670)Online publication date: Oct-2023
    • (2023)Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUsProceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques10.1109/PACT58117.2023.00019(124-136)Online publication date: 21-Oct-2023
    • (2022)DTexL: Decoupled Raster Pipeline for Texture Locality2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00028(213-227)Online publication date: Oct-2022
    • (2021)Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPUMicromachines10.3390/mi1210126212:10(1262)Online publication date: 17-Oct-2021
    • (2021)MIPSGPU: Minimizing Pipeline Stalls for GPUs With Non-Blocking ExecutionIEEE Transactions on Computers10.1109/TC.2020.302604370:11(1804-1816)Online publication date: 1-Nov-2021
    • (2020)MDM: The GPU Memory Divergence Model2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00085(1009-1021)Online publication date: Oct-2020
    • (2019)MACProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337867(1-10)Online publication date: 5-Aug-2019
    • (2019)Modeling Emerging Memory-Divergent GPU ApplicationsIEEE Computer Architecture Letters10.1109/LCA.2019.292361818:2(95-98)Online publication date: 1-Jul-2019
    • (2018)Efficient Cache Performance Modeling in GPUs Using Reuse Distance AnalysisACM Transactions on Architecture and Code Optimization10.1145/329105115:4(1-24)Online publication date: 19-Dec-2018
    • (2018)Memory Coalescing for Hybrid Memory CubeProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225062(1-10)Online publication date: 13-Aug-2018
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media