research-article

Open access

Efficient Nearest-Neighbor Data Sharing in GPUs

Authors:

Negin Nematollahi,

Mohammad Sadrosadati,

Hajar Falahati,

Marzieh Barkhordar,

Mario Paulo Drumond,

Hamid Sarbazi-Azad,

Babak FalsafiAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 18, Issue 1

Article No.: 6, Pages 1 - 26

https://doi.org/10.1145/3429981

Published: 30 December 2020 Publication History

All formats PDF

Abstract

Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes exhibit a high degree of data sharing between nearest-neighbor threads. Sharing is typically implemented through shared memories, shuffle instructions, and on-chip caches and often incurs performance overheads due to the redundancy in memory accesses.

In this article, we propose Neighbor Data (NeDa), a direct nearest-neighbor data sharing mechanism that uses two registers embedded in each streaming processor (SP) that can be accessed by nearest-neighbor SP cores. The registers are compiler-allocated and serve as a data exchange mechanism to eliminate nearest-neighbor shared accesses. NeDa is embedded carefully with local wires between SP cores so as to minimize the impact on density. We place and route NeDa in an open-source GPU and show a small area overhead of 1.3%. The cycle-accurate simulation indicates an average performance improvement of 21.8% and power reduction of up to 18.3% for stencil codes in General-Purpose Graphics Processing Unit (GPGPU) standard benchmark suites. We show that NeDa’s performance is within 13.2% of an ideal GPU with no overhead for nearest-neighbor data exchange.

References

[1]

Homa Aghilinasab, Mohammad Sadrosadati, Mohammad Hossein Samavatian, and Hamid Sarbazi-Azad. 2016. Reducing power consumption of GPGPUs through instruction reordering. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design. 356--361

Digital Library

[2]

Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). IEEE, 163--174.

[3]

Raghuraman Balasubramanian, Vinay Gangadhar, Ziliang Guo, Chen-Han Ho, Cherin Joseph, Jaikrishnan Menon, Mario Paulo Drumond, Robin Paul, Sharath Prasad, Pradip Valathol, et al. 2015. Enabling GPGPU low-level hardware explorations with MIAOW: An open-source RTL implementation of a GPGPU. ACM Transactions on Architecture and Code Optimization (TACO) 12, 2 Article 21 (2015).

[4]

Siqi Bao and Albert C. S. Chung. 2018. Multi-scale structured CNN with label consistency for brain MR image segmentation. Computer Methods in Biomechanics and Biomedical Engineering: Imaging 8 Visualization 6, 1 (2018), 113--117.

[5]

Nanchini Chandramoorthy, Giuseppe Tagliavini, Kevin Irick, Antonio Pullini, Siddharth Advani, Sulaiman Al Habsi, Matthew Cotter, John Sampson, Vijaykrishnan Narayanan, and Luca Benini. 2015. Exploring architectural heterogeneity in intelligent vision systems. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 1--12.

[6]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, 44--54.

Digital Library

[7]

Li-Jhan Chen, Hsiang-Yun Cheng, Po-Han Wang, and Chia-Lin Yang. 2017. Improving GPGPU performance via cache locality aware thread block scheduling. IEEE Computer Architecture Letters 16, 2 (2017), 127--131.

[8]

Hyojin Choi, Jaewoo Ahn, and Wonyong Sung. 2012. Reducing off-chip memory traffic by selective cache management scheme in GPGPUs. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units. ACM, 110--119.

Digital Library

[9]

Jason Clemons, Chih-Chi Cheng, Iuri Frosio, Daniel Johnson, and Stephen W. Keckler. 2016. A patch memory system for image processing and computer vision. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--13.

[10]

Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 63--74.

[11]

Halit Dogan, Masab Ahmad, Jose Joao, and Omer Khan. 2018. Accelerating synchronization in graph analytics using moving compute to data model on Tilera TILE-Gx72. In Proceedings of the 2018 IEEE 36th International Conference on Computer Design (ICCD). IEEE, 496--505.

[12]

Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2016. Cooperative caching for GPUs. ACM Transactions on Architecture and Code Optimization (TACO) 13, 4 (2016), 1--25.

Digital Library

[13]

Mamdouh F. Fahmy and Omar M. Fahmy. 2018. Efficient bivariate image denoising technique using new orthogonal CWT filter design. IET Image Processing 12, 8 (2018), 1354--1360.

[14]

Mrugesh Gajjar and Ismayil Guracar. 2016. Efficient rate conversion filtering on GPUs with shared memory access pattern scrambling. In Proceedings of the IEEE International Workshop on Signal Processing Systems (SiPS’16). IEEE, 285--290.

[15]

Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). IEEE, 235--246.

Digital Library

[16]

Michaël Gharbi, Jiawen Chen, Jonathan T. Barron, Samuel W. Hasinoff, and Frédo Durand. 2017. Deep bilateral learning for real-time image enhancement. ACM Transactions on Graphics (TOG) 36, 4 (2017), 118.

Digital Library

[17]

Claudia I. Gonzalez, Patricia Melin, Juan R. Castro, and Oscar Castillo. 2017. Edge detection methods and filters used on digital image processing. In Edge Detection Methods Based on Generalized Type-2 Fuzzy Logic. Springer, 11--16.

[18]

Benjiman L. Goodman, Adam T. Moerschell, and James S. Blomgren. 2017. Texture state cache. US Patent 9,811,875.

[19]

Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar). IEEE, 1--10.

[20]

Mohamed Assem Ibrahim, Hongyuan Liu, Onur Kayiran, and Adwait Jog. 2019. Analyzing and leveraging remote-core bandwidth for enhanced performance in GPUs. In Proceedings of the 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 258--271.

[21]

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826.

[22]

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, and Samuel Williams. 2010. An auto-tuning framework for parallel multicore stencil computations. In Proceedings of the 2010 IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS). IEEE, 1--12.

[23]

Shoaib Kamil, Alvin Cheung, Shachar Itzhaky, and Armando Solar-Lezama. 2016. Verified lifting of stencil computations. ACM SIGPLAN Notices 51, 6 (2016), 711--726.

Digital Library

[24]

Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU concurrency in heterogeneous architectures. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14). IEEE, 114--126.

[25]

Farzad Khorasani, Hodjat Asghari Esfeden, Amin Farmahini-Farahani, Nuwan Jayasena, and Vivek Sarkar. 2018. Regmutex: Inter-warp GPU register time-sharing. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 816--828.

Digital Library

[26]

Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu. 2012. Performance analysis and tuning for general purpose graphics processing units (GPGPU). Synthesis Lectures on Computer Architecture 7, 2 (2012), 1--96.

[27]

Martha Mercaldi Kim, John D. Davis, Mark Oskin, and Todd Austin. 2008. Polymorphic on-chip networks. In Proceedings of the 2008 International Symposium on Computer Architecture. IEEE, 101--112.

Digital Library

[28]

John Kloosterman, Jonathan Beaumont, D. Anoushe Jamshidi, Jonathan Bailey, Trevor Mudge, and Scott Mahlke. 2017. Regless: Just-in-time operand staging for GPUs. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 151--164.

Digital Library

[29]

Robin Kobus, Christian Hundt, André Müller, and Bertil Schmidt. 2017. Accelerating metagenomic read classification on CUDA-enabled GPUs. BMC Bioinformatics 18, 1 (2017), 1--10.

[30]

Gunjae Koo, Hyeran Jeon, Zhenhong Liu, Nam Sung Kim, and Murali Annavaram. 2018. Cta-aware prefetching and scheduling for GPU. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 137--148.

[31]

Gunjae Koo, Yunho Oh, Won Woo Ro, and Murali Annavaram. 2017. Access pattern-aware cache management for improving data utilization in GPU. ACM SIGARCH Computer Architecture News 45, 2 (2017), 307--319.

Digital Library

[32]

Ahmad Lashgar, Amirali Baniasadi, and Ahmad Khonsari. 2013. Warp size impact in GPUs: Large or small? In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units. 146--152.

Digital Library

[33]

Shin-Ying Lee, Akhil Arunkumar, and Carole-Jean Wu. 2015. CAWA: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. ACM SIGARCH Computer Architecture News 43, 3S (2015), 515--527.

Digital Library

[34]

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. ACM SIGARCH Computer Architecture News 41, 3 (2013), 487--498.

Digital Library

[35]

Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, and Henk Corporaal. 2017. Locality-aware CTA clustering for modern GPUs. ACM SIGOPS Operating Systems Review 51, 2 (2017), 297--311.

[36]

Mostafa Mahmoud, Bojian Zheng, Alberto Delmás Lascorz, Felix Heide, Jonathan Assouline, Paul Boucher, Emmanuel Onzon, and Andreas Moshovos. 2017. IDEAL: Image denoising accelerator. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 82--95.

Digital Library

[37]

Naoya Maruyama and Takayuki Aoki. 2014. Optimizing stencil computations for NVIDIA Kepler GPUs. In Proceedings of the 1st International Workshop on High-Performance Stencil Computations, Vienna. 89--95.

[38]

Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 72--86.

Digital Library

[39]

Andreas Meister and Gunter Saake. 2016. Challenges for a GPU-accelerated dynamic programming approach for join-order optimization. In GvD. 86--81.

[40]

Hiroko Midorikawa, Hideyuki Tan, and Toshio Endo. 2014. An evaluation of the potential of flash SSD as large and slow memory for stencil computations. In Proceedings of the 2014 International Conference on High Performance Computing 8 Simulation (HPCS). IEEE, 268--277.

[41]

Nobuhiro Miki, Fumihiko Ino, and Kenichi Hagihara. 2019. PACC: A directive-based programming framework for out-of-core stencil computation on accelerators. International Journal of High Performance Computing and Networking 13, 1 (2019), 19--34.

[42]

Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, Hamid Sarbazi-Azad, and Thomas F. Wenisch. 2017. POSTER: Elastic reconfiguration for heterogeneous NoCs with BiNoCHS. In Proceedings of the 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 164--165.

[43]

Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 308--317.

[44]

Negin Nematollahi, Mohammad Sadrosadati, Hajar Falahati, Marzieh Barkhordar, and Hamid Sarbazi-Azad. 2018. Neda: Supporting direct inter-core neighbor data exchange in GPUs. IEEE Computer Architecture Letters 17, 2 (2018), 225--229.

Digital Library

[45]

NVIDIA. 2018. CUDA SDK code samples. Retrieved from https://docs.nvidia.com/cuda/archive/9.1/pdf/CUDA_Samples.pdf.

[46]

NVIDIA. 2018. Cuda9.0 prgramming guide. Retrieved from https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.

[47]

NVIDIA. 2018. GeForce GTX 980 Whitepaper—NVIDIA File Downloads. Retrieved from https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF.

[48]

NVIDIA. 2018. Profiler user guide. Retrieved from http://docs.nvidia.com/cuda/profiler-users-guide/index.html.

[49]

Stefan Olsson. 2017. Image processing method for detail enhancement and noise reduction. US Patent 9,595,087.

[50]

Pablo José Pavan, Matheus da Silva Serpa, Víctor Martínez, Edson Luiz Padoin, Jairo Panetta, and Philippe O. A. Navaux. 2018. Strategies to improve the performance and energy efficiency of stencil computations for NVIDIA GPUs. In Proceedings of the 17th Workshop em Desempenho de Sistemas Computacionais e de Comunicação (WPerformance 2018), Vol. 17. SBC.

[51]

Aaditya Prakash, Nick Moran, Solomon Garber, Antonella Dilillo, and James Storer. 2017. Semantic perceptual image compression using deep convolution networks. In Proceedings of the Data Compression Conference (DCC), 2017. IEEE, 250--259.

[52]

Prashant Rawat, Miheer Vaidya, Aravind Sukumaran-Rajam, Atanas Rountev, Louis-Noël Pouchet, and P. Sadayappan. 2019. On optimizing complex stencils on GPUs. In Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 641--652.

[53]

Prashant Singh Rawat, Miheer Vaidya, Aravind Sukumaran-Rajam, Mahesh Ravishankar, Vinod Grover, Atanas Rountev, Louis-Noël Pouchet, and P. Sadayappan. 2018. Domain-specific optimization and generation of high-performance GPU code for stencil computations. In Proceedings of the IEEE 106, 11 (2018), 1902--1920.

[54]

David Richie, James Ross, Song Park, and Dale Shires. 2015. Threaded MPI programming model for the Epiphany RISC array processor. Journal of Computational Science 9 (2015), 94--100.

[55]

Timothy G. Rogers, Daniel R. Johnson, Mike O’Connor, and Stephen W. Keckler. 2015. A variable warp size architecture. ACM SIGARCH Computer Architecture News 43, 3S (2015), 489--501.

Digital Library

[56]

Mohammad Sadrosadati, Seyed Borna Ehsani, Hajar Falahati, Rachata Ausavarungnirun, Arash Tavakkol, Mojtaba Abaee, Lois Orosa, Yaohua Wang, Hamid Sarbazi-Azad, and Onur Mutlu. 2019. ITAP: Idle-time-aware power management for GPU execution units. ACM Transactions on Architecture and Code Optimization (TACO) 16, 1 (2019), 1--26.

Digital Library

[57]

Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling high-capacity register files for GPUs via hardware/software cooperative register prefetching. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 489--502.

Digital Library

[58]

Mohammad Sadrosadati, Amirhossein Mirhosseini, Shahin Roozkhosh, Hazhir Bakhishi, and Hamid Sarbazi-Azad. 2017. Effective cache bank placement for GPUs. In Design, Automation 8 Test in Europe Conference 8 Exhibition (DATE), 2017. IEEE, 31--36.

[59]

Pradip Kumar Sahu and Santanu Chattopadhyay. 2013. A survey on application mapping strategies for network-on-chip design. Journal of Systems Architecture 59, 1 (2013), 60--76.

Digital Library

[60]

Mostafa S. Sayed, Ahmed Shalaby, Mohamed El-Sayed, and Victor Goulart. 2012. Flexible router architecture for network-on-chip. Computers 8 Mathematics with Applications 64, 5 (2012), 1301--1310.

[61]

Shreesha Srinath, Berkin Ilbeyi, Mingxing Tan, Gai Liu, Zhiru Zhang, and Christopher Batten. 2014. Architectural specialization for inter-iteration loop dependence patterns. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 583--595.

Digital Library

[62]

John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing 127 (2012).

[63]

Abdulaziz Tabbakh, Murali Annavaram, and Xuehai Qian. 2017. Power efficient sharing-aware GPU data management. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 698--707.

[64]

Didem Unat, Xing Cai, and Scott B. Baden. 2011. Mint: Realizing CUDA performance in 3D stencil methods with annotated C. In Proceedings of the International Conference on Supercomputing. ACM, 214--224.

[65]

Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, and Onur Mutlu. 2018. The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs. ISCA.

[66]

Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, and Onur Mutlu. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--14.

Digital Library

[67]

Alejandro Villegas, Rafael Asenjo, Angeles Navarro, Oscar Plata, and David Kaeli. 2018. Lightweight hardware transactional memory for GPU scratchpad memory. IEEE Transactions on Computers 67, 6 (2018), 816--829.

[68]

Dani Voitsechov and Yoav Etsion. 2018. Inter-thread communication in multithreaded, reconfigurable coarse-grain arrays. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society.

Digital Library

[69]

Hasitha Muthumala Waidyasooriya and Masanori Hariyama. 2019. Multi-FPGA accelerator architecture for stencil computation exploiting spacial and temporal scalability. IEEE Access 7 (2019), 53188--53201.

[70]

Jianfei Wang, Li Jiang, Jing Ke, Xiaoyao Liang, and Naifeng Jing. 2019. A sharing-aware L1. 5D cache for data reuse in GPGPUs. In Proceedings of the 24th Asia and South Pacific Design Automation Conference. 388--393.

Digital Library

[71]

Kai Wang, Don Fussell, and Calvin Lin. 2019. Fast fine-grained global synchronization on GPUs. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 793--806.

Digital Library

[72]

Yipeng Wang, Ren Wang, Andrew Herdrich, James Tsai, and Yan Solihin. 2016. Caf: Core to core communication acceleration framework. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, 351--362.

Digital Library

[73]

Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395--406.

Digital Library

[74]

Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 76--88.

[75]

Payman Zarkesh-Ha, George BP Bezerra, Stephanie Forrest, and Melanie Moses. 2010. Hybrid network on chip (HNoC) local buses with a global mesh architecture. In Proceedings of the 12th ACM/IEEE International Workshop on System Level Interconnect Prediction. 9--14.

Digital Library

[76]

Guangwei Zhang and Yinliang Zhao. 2016. Modeling the performance of 2.5 D blocking of 3D stencil code on GPUs. In IEEE High Performance Extreme Computing Conference, HPEC.

[77]

Tuowen Zhao, Samuel Williams, Mary Hall, and Hans Johansen. 2018. Delivering performance-portable stencil computations on CPUs and GPUs using Bricks. In Proceedings of the 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 59--70.

[78]

Hamid Reza Zohouri, Artur Podobas, and Satoshi Matsuoka. 2018. Combined spatial and temporal blocking for high-performance stencil computation on fpgas using opencl. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 153--162.

Digital Library

Cited By

Darabi SYousefzadeh-Asl-Miandoab EAkbarzadeh NFalahati HLotfi-Kamran PSadrosadati MSarbazi-Azad H(2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3154315
Darabi SSadrosadati MAkbarzadeh NLindegger JHosseini MPark JGomez-Luna JMutlu OSarbazi-Azad H(2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00029

Index Terms

Efficient Nearest-Neighbor Data Sharing in GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Implementing implicit OpenMP data sharing on GPUs
LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC

OpenMP is a shared memory programming model which supports the offloading of target regions to accelerators such as NVIDIA GPUs. The implementation in Clang/LLVM aims to deliver a generic GPU compilation toolchain that supports both the native CUDA C/C++...
Energy-efficient stencil computations on distributed GPUs using dynamic parallelism and GPU-controlled communication
E2SC '14: Proceedings of the 2nd International Workshop on Energy Efficient Supercomputing

GPUs are widely used in high performance computing, due to their high computational power and high performance per Watt. Still, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only ...
How GPUs Work

GPUs have moved away from the traditional fixed-function 3D graphics pipeline toward aflexible general-purpose computational engine.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 18, Issue 1

March 2021

402 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3446348

Editor:
David Kaeli
Northeastern University, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 December 2020

Accepted: 01 October 2020

Revised: 01 September 2020

Received: 01 March 2020

Published in TACO Volume 18, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
1,228
Total Downloads

Downloads (Last 12 months)322
Downloads (Last 6 weeks)65

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Darabi SYousefzadeh-Asl-Miandoab EAkbarzadeh NFalahati HLotfi-Kamran PSadrosadati MSarbazi-Azad H(2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3154315
Darabi SSadrosadati MAkbarzadeh NLindegger JHosseini MPark JGomez-Luna JMutlu OSarbazi-Azad H(2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00029

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents