research-article

Snake: A Variable-length Chain-based Prefetching for GPUs

Authors:

Hajar Falahati,

Pejman Lotfi-Kamran,

Hamid Sarbazi-AzadAuthors Info & Claims

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 728 - 741

https://doi.org/10.1145/3613424.3623782

Published: 08 December 2023 Publication History

Abstract

Graphics Processing Units (GPUs) utilize memory hierarchy and Thread-Level Parallelism (TLP) to tolerate off-chip memory latency, which is a significant bottleneck for memory-bound applications. However, parallel threads generate a large number of memory requests, which increases the average memory latency and degrades cache performance due to high contention. Prefetching is an effective technique to reduce memory access latency, and prior research shows the positive impact of stride-based prefetching on GPU performance. However, existing prefetching methods only rely on fixed strides. To address this limitation, this paper proposes a new prefetching technique, Snake, which is built upon chains of variable strides, using throttling and memory decoupling strategies. Snake achieves 80% coverage and 75% accuracy in prefetching demand memory requests, resulting in a 17% improvement in total GPU performance and energy consumption for memory-bound General-Purpose Graphics Processing Unit (GPGPU) applications.

References

[1]

2000. Design Compiler, Synopsys inc.

[2]

2015. Cadence SoC Encounter. https://www.cadence.com/.

[3]

Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H Loh, Chita R Das, Mahmut T Kandemir, and Onur Mutlu. 2015. Exploiting inter-warp heterogeneity to improve gpgpu performance. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 25–38.

Digital Library

[4]

Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J Rossbach, and Onur Mutlu. 2017. Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 136–150.

Digital Library

[5]

Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE international symposium on performance analysis of systems and software. IEEE, 163–174.

[6]

Mohammad Bakhshalipour, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2018. Domino temporal data prefetcher. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 131–142.

[7]

Mohammad Bakhshalipour, Mehran Shakerinava, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Bingo spatial data prefetcher. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 399–411.

[8]

Chandranil Chakraborttii and Heiner Litz. 2022. Deep Learning based Prefetching for Flash. In Nonvolatile Memory Workshop (NVMW).

[9]

Niladrish Chatterjee, Mike O’Connor, Gabriel H Loh, Nuwan Jayasena, and Rajeev Balasubramonia. 2014. Managing DRAM latency divergence in irregular GPGPU applications. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 128–139.

Digital Library

[10]

Sina Darabi, Negin Mahani, Hazhir Baxishi, Ehsan Yousefzadeh-Asl-Miandoab, Mohammad Sadrosadati, and Hamid Sarbazi-Azad. 2022. NURA: A framework for supporting non-uniform resource accesses in GPUs. Proceedings of the ACM on Measurement and Analysis of Computing Systems 6, 1 (2022), 1–27.

Digital Library

[11]

Sina Darabi, Ehsan Yousefzadeh-Asl-Miandoab, Negar Akbarzadeh, Hajar Falahati, Pejman Lotfi-Kamran, Mohammad Sadrosadati, and Hamid Sarbazi-Azad. 2022. OSM: Off-chip shared memory for GPUs. IEEE Transactions on Parallel and Distributed Systems 33, 12 (2022), 3415–3429.

Digital Library

[12]

Hajar Falahati, Mania Abdi, Amirali Baniasadi, and Shaahin Hessabi. 2013. ISP: Using idle SMs in hardware-based prefetching. In The 17th CSI International Symposium on Computer Architecture & Digital Systems (CADS 2013). IEEE, 3–8.

[13]

Hajar Falahati, Shaahin Hessabi, Mania Abdi, and Amirali Baniasadi. 2015. Power-efficient prefetching on GPGPUs. The Journal of Supercomputing 71 (2015), 2808–2829.

Digital Library

[14]

Hajar Falahati, Masoud Peyro, Hossein Amini, Mehran Taghian, Mohammad Sadrosadati, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2021. Data-Aware compression of neural networks. IEEE Computer Architecture Letters 20, 2 (2021), 94–97.

Digital Library

[15]

Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory. In Proceedings of the 46th International Symposium on Computer Architecture. 224–235.

Digital Library

[16]

Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben Van Werkhoven, and Henri E Bal. 2023. Optimization techniques for GPU programming. Comput. Surveys 55, 11 (2023), 1–81.

Digital Library

[17]

Adwait Jog, Onur Kayiran, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 332–343.

Digital Library

[18]

Shoaib Kamil, Alvin Cheung, Shachar Itzhaky, and Armando Solar-Lezama. 2016. Verified lifting of stencil computations. ACM SIGPLAN Notices 51, 6 (2016), 711–726.

Digital Library

[19]

Vijay Kandiah, Scott Peverelle, Mahmoud Khairy, Junrui Pan, Amogh Manjunath, Timothy G Rogers, Tor M Aamodt, and Nikos Hardavellas. 2021. AccelWattch: A Power Modeling Framework for Modern GPUs. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture. 738–753.

[20]

Mohammad Mahdi Keshtegar, Hajar Falahati, and Shaahin Hessabi. 2015. Cluster-based approach for improving graphics processing unit performance by inter streaming multiprocessors locality. IET Computers & Digital Techniques 9, 5 (2015), 275–282.

[21]

Mahmoud Khairy, Zhesheng Shen, Tor M Aamodt, and Timothy G Rogers. 2020. Accel-Sim: An extensible simulation framework for validated GPU modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 473–486.

Digital Library

[22]

Mohsen Kiani and Amir Rajabzadeh. 2018. Efficient cache performance modeling in GPUs using reuse distance analysis. ACM Transactions on Architecture and Code Optimization (TACO) 15, 4 (2018), 1–24.

Digital Library

[23]

Jinchun Kim, Seth H Pugsley, Paul V Gratz, AL Narasimha Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Path confidence based lookahead prefetching. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–12.

[24]

Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, Gunjae Koo, Won Woo Ro, and Murali Annavaram. 2016. Warped-preexecution: A GPU pre-execution approach for improving latency hiding. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 163–175.

[25]

Gunjae Koo, Hyeran Jeon, Zhenhong Liu, Nam Sung Kim, and Murali Annavaram. 2018. Cta-aware prefetching and scheduling for GPU. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 137–148.

[26]

Gunjae Koo, Yunho Oh, Won Woo Ro, and Murali Annavaram. 2017. Access pattern-aware cache management for improving data utilization in GPU. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 307–319.

Digital Library

[27]

Nagesh B Lakshminarayana and Hyesoon Kim. 2014. Spare register aware prefetching for graph algorithms on GPUs. In 2014 IEEE 20th international symposium on high performance computer architecture (HPCA). IEEE, 614–625.

[28]

Monica D Lam, Edward E Rothberg, and Michael E Wolf. 1991. The cache performance and optimizations of blocked algorithms. ACM SIGOPS Operating Systems Review 25, Special Issue (1991), 63–74.

[29]

Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 213–224.

Digital Library

[30]

Bingchao Li, Jizeng Wei, Jizhou Sun, Murali Annavaram, and Nam Sung Kim. 2019. An efficient GPU cache architecture for applications with irregular memory access patterns. ACM Transactions on Architecture and Code Optimization (TACO) 16, 3 (2019), 1–24.

Digital Library

[31]

Zheng-Xiang Li, SVb Bogdanova, AS Collins, Anthony Davidson, Bert De Waele, RE Ernst, ICW Fitzsimons, RA Fuck, DP Gladkochub, J Jacobs, 2008. Assembly, configuration, and break-up history of Rodinia: a synthesis. Precambrian research 160, 1-2 (2008), 179–210.

[32]

Shih-wei Liao, Tzu-Han Hung, Donald Nguyen, Chinyen Chou, Chiaheng Tu, and Hucheng Zhou. 2009. Machine learning-based prefetch optimization for data center applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. 1–10.

[33]

Pejman Lotfi-Kamrana and Hamid Sarbazi-Azadb. 2022. Introduction to data prefetching. Data Prefetching Techniques in Computer Systems (2022), 1.

[34]

Negin Nematollahi, Mohammad Sadrosadati, Hajar Falahati, Marzieh Barkhordar, Mario Paulo Drumond, Hamid Sarbazi-Azad, and Babak Falsafi. 2020. Efficient Nearest-Neighbor Data Sharing in GPUs. ACM Transactions on Architecture and Code Optimization (TACO) 18, 1 (2020), 1–26.

[35]

Negin Nematollahi, Mohammad Sadrosadati, Hajar Falahati, Marzieh Barkhordar, and Hamid Sarbazi-Azad. 2018. Neda: Supporting direct inter-core neighbor data exchange in GPUs. IEEE Computer Architecture Letters 17, 2 (2018), 225–229.

Digital Library

[36]

Tesla NVIDIA. [n. d.]. V100 Volta Architecture. URL http://www. nvidia. com/object/volta-architecture-whitepaper. html ([n. d.]).

[37]

Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun Park, Yongjun Park, Won Woo Ro, and Murali Annavaram. 2016. APRES: Improving cache efficiency by exploiting load characteristics on GPUs. ACM SIGARCH computer architecture news 44, 3 (2016), 191–203.

Digital Library

[38]

Elizabeth J O’neil, Patrick E O’neil, and Gerhard Weikum. 1993. The LRU-K page replacement algorithm for database disk buffering. Acm Sigmod Record 22, 2 (1993), 297–306.

Digital Library

[39]

Samuel Pakalapati and Biswabandan Panda. 2020. Bouquet of instruction pointers: Instruction pointer classifier-based spatial hardware prefetching. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–131.

Digital Library

[40]

Reena Panda, Yasuko Eckert, Nuwan Jayasena, Onur Kayiran, Michael Boyer, and Lizy Kurian John. 2016. Prefetching techniques for near-memory throughput processors. In Proceedings of the 2016 International Conference on Supercomputing. 1–14.

Digital Library

[41]

Mohammad Sadrosadati, Seyed Borna Ehsani, Hajar Falahati, Rachata Ausavarungnirun, Arash Tavakkol, Mojtaba Abaee, Lois Orosa, Yaohua Wang, Hamid Sarbazi-Azad, and Onur Mutlu. 2019. ITAP: Idle-time-aware power management for GPU execution units. ACM Transactions on Architecture and Code Optimization (TACO) 16, 1 (2019), 1–26.

Digital Library

[42]

Mohammad Sadrosadati, Amirhossein Mirhosseini, Ali Hajiabadi, Seyed Borna Ehsani, Hajar Falahati, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2021. Highly concurrent latency-tolerant register files for GPUs. ACM Transactions on Computer Systems (TOCS) 37, 1-4 (2021), 1–36.

Digital Library

[43]

Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris Wilkerson, Seth H Pugsley, and Zeshan Chishti. 2015. Efficiently prefetching complex address patterns. In Proceedings of the 48th International Symposium on Microarchitecture. 141–152.

Digital Library

[44]

John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012), 27.

[45]

Steven P Vander Wiel and David J Lilja. 1997. When caches aren’t enough: Data prefetching techniques. Computer 30, 7 (1997), 23–30.

Digital Library

[46]

Bin Wang, Yue Zhu, and Weikuan Yu. 2016. OAWS: memory occlusion aware warp scheduling. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT). IEEE, 45–55.

Digital Library

[47]

Haonan Wang, Fan Luo, Mohamed Ibrahim, Onur Kayiran, and Adwait Jog. 2018. Efficient and fair multi-programming in GPUs via effective bandwidth management. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 247–258.

[48]

Steven JE Wilton and Norman P Jouppi. 1996. CACTI: An enhanced cache access and cycle time model. IEEE Journal of solid-state circuits 31, 5 (1996), 677–688.

[49]

Ping Xiang, Yi Yang, and Huiyang Zhou. 2014. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 284–295.

Cited By

Falahati HSadrosadati MXu QGómez-Luna JLatibari BJeon HHesaabi SSarbazi-Azad HMutlu OAnnavaram MPedram M(2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3653019

Index Terms

Snake: A Variable-length Chain-based Prefetching for GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Prefetching Techniques for Near-memory Throughput Processors
ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Near-memory processing or processing-in-memory (PIM) is regaining a lot of interest recently as a viable solution to overcome the challenges imposed by memory wall. This trend has been mainly fueled by the emergence of 3D-stacked memories. GPUs are ...
Stealth prefetching
Proceedings of the 2006 ASPLOS Conference

Prefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...
Stealth prefetching
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems

Prefetching in shared-memory multiprocessor systems is an increasingly difficult problem. As system designs grow to incorporate larger numbers of faster processors, memory latency and interconnect traffic increase. While aggressive prefetching ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

October 2023

1528 pages

ISBN:9798400703294

DOI:10.1145/3613424

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MICRO '23

Sponsor:

SIGMICRO

MICRO '23: 56th Annual IEEE/ACM International Symposium on Microarchitecture

October 28 - November 1, 2023

ON, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
447
Total Downloads

Downloads (Last 12 months)447
Downloads (Last 6 weeks)77

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Falahati HSadrosadati MXu QGómez-Luna JLatibari BJeon HHesaabi SSarbazi-Azad HMutlu OAnnavaram MPedram M(2024)Cross-Core Data Sharing for Energy-Efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/3653019Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3653019

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents