Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3307650.3322235acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Adaptive memory-side last-level GPU caching

Published: 22 June 2019 Publication History

Abstract

Emerging GPU applications exhibit increasingly high computation demands which has led GPU manufacturers to build GPUs with an increasingly large number of streaming multiprocessors (SMs). Providing data to the SMs at high bandwidth puts significant pressure on the memory hierarchy and the Network-on-Chip (NoC). Current GPUs typically partition the memory-side last-level cache (LLC) in equally-sized slices that are shared by all SMs. Although a shared LLC typically results in a lower miss rate, we find that for workloads with high degrees of data sharing across SMs, a private LLC leads to a significant performance advantage because of increased bandwidth to replicated cache lines across different LLC slices.
In this paper, we propose adaptive memory-side last-level GPU caching to boost performance for sharing-intensive workloads that need high bandwidth to read-only shared data. Adaptive caching leverages a lightweight performance model that balances increased LLC bandwidth against increased miss rate under private caching. In addition to improving performance for sharing-intensive workloads, adaptive caching also saves energy in a (co-designed) hierarchical two-stage crossbar NoC by power-gating and bypassing the second stage if the LLC is configured as a private cache. Our experimental results using 17 GPU workloads show that adaptive caching improves performance by 28.1% on average (up to 38.1%) compared to a shared LLC for sharing-intensive workloads. In addition, adaptive caching reduces NoC energy by 26.6% on average (up to 29.7%) and total system energy by 6.1% on average (up to 27.2%) when configured as a private cache. Finally, we demonstrate through a GPU NoC design space exploration that a hierarchical two-stage crossbar is both more power- and area-efficient than full and concentrated crossbars with the same bisection bandwidth, thus providing a low-cost cooperative solution to exploit workload sharing behavior in memory-side last-level caches.

References

[1]
J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, "GPU Computing," Proceedings of the IEEE, vol. 96, pp. 879--899, May 2008.
[2]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional Architecture for Fast Feature Embedding," in Proceedings of the the international conference on Multimedia (ICMR), pp. 675--678, April 2014.
[3]
R. Collobert, C. Farabet, K. Kavukcuoglu, and S. Chintala, "Torch." http://torch.ch/.
[4]
Nvidia, "NVIDIA's Next Generation CUDA Compute Architecture:Fermi." http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA's_Fermi-The_First_Complete_GPU_Architecture.pdf, 2009.
[5]
Nvidia, "NVIDIA Tesla V100 GPU Architecture The World's Most Advanced Data Center GPU. White paper." http://www.nvidia.com/object/volta-architecture-whitepaper.html, 2017.
[6]
C. Kim, D. Burger, and S. W. Keckler, "An Adaptive, Non-uniform Cache Structure for Wire-delay Dominated On-chip Caches," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 211--222, October 2002.
[7]
J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler, "A NUCA Substrate for Flexible CMP Cache Sharing," in Proceedings of the International Conference on Supercomputing (ICS), pp. 31--40, June 2005.
[8]
J. Chang and G. S. Sohi, "Cooperative Caching for Chip Multiprocessors," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 264--276, June 2006.
[9]
Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimizing Replication, Communication, and Capacity Allocation in CMPs," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 357--368, June 2005.
[10]
M. Zhangand K. Asanovic,"Victim Replication: Maximizing Capacity While Hiding Wire Delay in Tiled Chip Multiprocessors," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 336--345, June 2005.
[11]
B. M. Beckmann, M. R. Marty, and D. A. Wood, "ASR: Adaptive Selective Replication for CMP Caches," in Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 443--454, December 2006.
[12]
C. Hughes, C. Kim, and Y. Chen, "Performance and Energy Implications of Many-Core Caches for Throughput Computing," IEEE Micro, vol. 30, pp. 25--35, November 2010.
[13]
T. Y. Yeh and G. Reinman, "Fast and Fair: Data-stream Quality of Service," in Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), pp. 237--248, September 2005.
[14]
J. Merino, V. Puente, P. Prieto, and J. A. Gregorio, "SP-NUCA: A Cost Effective Dynamic Non-uniform Cache Architecture," SIGARCH Comput. Archit. News, vol. 36, pp. 64--71, May 2008.
[15]
Z. Guz, I. Keidar, A. Kolodny, and U. C. Weiser, "Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture," in Proceedings of the International Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 1--10, June 2008.
[16]
L. Zhao, R. Iyer, M. Upton, and D. Newell, "Towards Hybrid Last Level Caches for Chip-multiprocessors," SIGARCH Comput. Archit. News, vol. 36, pp. 56--63, May 2008.
[17]
H. Dybdahl and P. Stenstrom, "An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp. 2--12, February 2007.
[18]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-optimal Block Placement and Replication in Distributed Caches," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 184--195, June 2009.
[19]
S. Cho and L. Jin, "Managing Distributed, Shared L2 Caches Through OS-Level Page Allocation," in Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 455--468, December 2006.
[20]
L. Jin, H. Lee, and S. Cho, "A Flexible Data to L2 Cache Mapping Approach for Future Multicore Processors," in Proceedings of the Workshop on Memory System Performance and Correctness (MSPC), pp. 92--101, October 2006.
[21]
Y. Zhang, W. Ding, M. Kandemir, J. Liu, and O. Jang, "A Data Layout Optimization Framework for NUCA-based Multicores," in Proceedings of International Symposium on Microarchitecture (MICRO), pp. 489--500, December 2011.
[22]
M. Kandemir, F. Li, M. J. Irwin, and S. W. Son, "A Novel Migration-based NUCA Design for Chip Multiprocessors," in Proceedings of the Conference on Supercomputing (SC), pp. 1--12, November 2008.
[23]
J. Lira, C. Molina, and A. González, "The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures," in Proceedings of the International Conference on Supercomputing (ICS), pp. 37--47, June 2010.
[24]
N. Beckmann and D. Sanchez, "Jigsaw: Scalable software-defined caches," in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 213--224, September 2013.
[25]
H. Kasture and D. Sanchez, "Ubik: Efficient Cache Sharing with Strict Qos for Latency-critical Workloads," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 729--742, March 2014.
[26]
N. E. Jerger, T. Krishna, and L. Peh, On-Chip Networks: Second Edition. Morgan & Claypool Publishers, 2017.
[27]
P. Kumar, Y. Pan, J. Kim, G. Memik, and A. Choudhary, "Exploring Concentration and Channel Slicing in On-chip Network Router," in Proceedings of the International Symposium on Networks-on-Chip (NOCS), pp. 276--285, May 2009.
[28]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in Proceeding of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 163--174, April 2009.
[29]
C. Sun, C. H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L. S. Peh, and V. Stojanovic, "DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling," in Proceedings of the International Symposium on Networks-on-Chip (NOCS), pp. 201--210, May 2012.
[30]
XILINX, "AXI High Bandwidth Memory Controller v1.0." https://www.xilinx.com/support/documentation/ip_documentation/hbm/v1_0/pg276-axi-hbm.pdf, 2018.
[31]
U. Milic, O. Villa, E. Bolotin, A. Arunkumar, E. Ebrahimi, A. Jaleel, A. Ramirez, and D. Nellans, "Beyond the Socket: NUMA-aware GPUs," in Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 123--135, October 2017.
[32]
A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa, A. Jaleel, C.-J. Wu, and D. Nellans, "MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 320--332, June 2017.
[33]
T. M. Aamodt, W. W. L. Fung, and T. G. Rogers, General-Purpose Graphics Processor Architectures. Morgan & Claypool Publishers, 2018.
[34]
D. B. Glasco, P. B. Holmqvist, G. R. Lynch, P. R. Marchand, K. Mehra, and J. Roberts, "Cache-based Control of Atomic Operations in Conjunction With an External ALU Block," Google Patents, March 2012.
[35]
J. Lee and H. Kim, "TAP: A TLP-Aware Cache Management Policy For a CPU-GPU Heterogeneous Architecture," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp. 1--12, February 2012.
[36]
L. Chen and T. M. Pinkston, "NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers," in Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 270--281, December 2012.
[37]
H. Zheng and A. Louri, "EZ-Pass: An Energy & Performance-Efficient Power-Gating Router Architecture for Scalable NoCs," IEEE Computer Architecture Letters, vol. 17, pp. 88--91, January 2018.
[38]
H. Farrokhbakht, H. M. Kamali, N. E. Jerger, and S. Hessabi, "SPONGE: A Scalable Pivot-based On/Off Gating Engine for Reducing Static Power in NoC Routers," in Proceedings of the International Symposium on Low Power Electronics and Design, ISLPED, pp. 17:1--17:6, July 2018.
[39]
L. Chen, L. Zhao, R. Wang, and T. M. Pinkston, "MP3: Minimizing performance penalty for power-gating of Clos network-on-chip," in Proceedings of the Symposium on High Performance Computer Architecture (HPCA), pp. 296--307, February 2014.
[40]
M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, "A Case for MLP-Aware Cache Replacement," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 167--178, June 2006.
[41]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in Proceedings of the International Symposium on Workload Characterization (IISWC), pp. 44--54, October 2009.
[42]
M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in Proceedings of the International Symposium on Workload Characterization (IISWC), pp. 141--151, November 2012.
[43]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a High-Level Language Targeted to GPU Codes," in Innovative Parallel Computing (InPar), pp. 1--10, May 2012.
[44]
"Tango: A Deep Neural Network Benchmark Suite for Various Accelerators." https://gitlab.com/Tango-DNNbench/Tango.
[45]
"NVIDIA CUDA SDK Code Samples." https://developer.nvidia.com/cuda-downloads.
[46]
Y. Liu, X. Zhao, M. Jahre, Z. Wang, X. Wang, Y. Luo, and L. Eeckhout, "Get Out of the Valley: Power-Efficient Address Mapping for GPUs," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 166--179, June 2018.
[47]
Nvidia, "NVIDIA GP100 Pascal Architecture. White paper." http://www.nvidia.com/object/pascal-architecture-whitepaper.html, 2016.
[48]
Y. H. Kao, N. Alfaraj, M. Yang, and H. J. Chao, "Design of High-Radix Clos Network-on-Chip," in Proceedings of the International Symposium on Networks-on-Chip (NOCS), pp. 181--188, May 2010.
[49]
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, "GPUWattch: Enabling Energy Optimizations in GPGPUs," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 487--498, June 2013.
[50]
A. Arunkumar, S. Y. Lee, V. Soundararajan, and C. J. Wu, "LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy Efficient GPUs," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp. 221--234, February 2018.
[51]
G. Koo, Y. Oh, W. W. Ro, and M. Annavaram, "Access Pattern-Aware Cache Management for Improving Data Utilization in GPU," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 307--319, June 2017.
[52]
S. Eyerman and L. Eeckhout, "System-Level Performance Metrics for Multiprogram Workloads," IEEE Micro, vol. 28, pp. 42--53, May 2008.
[53]
"Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0." http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf, 2009. Hynix.
[54]
M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu, "Improving GPGPU Resource Utilization Through Alternative Thread Block Scheduling," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp. 260--271, February 2014.
[55]
M. K. Qureshi, "Adaptive Spill-Receive for Robust High-Performance Caching in CMPs," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp. 45--54, February 2009.
[56]
W.-C. Kwon, T. Krishna, and L.-S. Peh, "Locality-oblivious Cache Organization Leveraging Single-cycle Multi-hop NoCs," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 715--728, March 2014.
[57]
A. K. Abousamra, A. K. Jones, and R. G. Melhem, "NoC-aware Cache Design for Multithreaded Execution on Tiled Chip Multiprocessors," in Proceedings of the International Conference on High Performance and Embedded Architectures and Compilers (HIPEAC), pp. 197--205, January 2011.
[58]
N. D. E. Jerger, L. S. Peh, and M. H. Lipasti, "Circuit-Switched Coherence," in Proceedings o the International Symposium on Networks-on-Chip (NOCS), pp. 193--202, April 2008.
[59]
J. Yin, P. Zhou, S. S. Sapatnekar, and A. Zhai, "Energy-Efficient Time-Division Multiplexed Hybrid-Switched NoC for Heterogeneous Multicore Systems," in Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pp. 293--303, May 2014.
[60]
M. B. Stensgaard and J. Sparsø, "ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology," in Proceedings of the International Symposium on Networks-on-Chip (NOCS), pp. 55--64, April 2008.
[61]
H. Kwon, A. Samajdar, and T. Krishna, "MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 461--475, March 2018.
[62]
A. Mirhosseini, M. Sadrosadati, B. Soltani, H. Sarbazi-Azad, and T. F. Wenisch, "BiNoCHS: Bimodal Network-on-Chip for CPU-GPU Heterogeneous Systems," in Proceedings of the International Symposium on Networks-on-Chip (NOCS), pp. 7:1--7:8, October 2017.
[63]
Y. Jin, E. J. Kim, and K. H. Yum, "A Domain-Specific On-Chip Network Design for Large Scale Cache Systems," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp. 318--327, February 2007.
[64]
A. Bakhoda, J. Kim, and T. M. Aamodt, "Throughput-Effective On-Chip Networks for Manycore Accelerators," in Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 421--432, December 2010.
[65]
H. Kim, J. Kim, W. Seo, Y. Cho, and S. Ryu, "Providing Cost-effective On-Chip Network Bandwidth in GPGPUs," in Proceedings of the International Conference on Computer Design (ICCD), pp. 407--412, September 2012.
[66]
X. Zhao, S. Ma, Y. Liu, L. Eeckhout, and Z. Wang, "A Low-Cost Conflict-Free NoC for GPGPUs," in Proceedings of the Design Automation Conference (DAC), pp. 34:1--34:6, June 2016.
[67]
X. Zhao, S. Ma, C. Li, L. Eeckhout, and Z. Wang, "A Heterogeneous Low-cost and Low-latency Ring-Chain Network for GPGPUs," in Proceedings of the International Conference on Computer Design (ICCD), pp. 472--479, October 2016.
[68]
A. K. Ziabari, J. L. Abellán, Y. Ma, A. Joshi, and D. Kaeli, "Asymmetric NoC Architectures for GPU Systems," in Proceedings of the International Symposium on Networks-on-Chip (NOCS), pp. 25:1--25:8, July 2015.
[69]
Y. Liu, Z. Yu, L. Eeckhout, V. J. Reddi, Y. Luo, X. Wang, Z. Wang, and C. Xu, "Barrier-Aware Warp Scheduling for Throughput Processors," in Proceedings of the International Conference on Supercomputing (ICS), pp. 42:1--42:12, June 2016.
[70]
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, "Orchestrated Scheduling and Prefetching for GPGPUs," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 332--343, June 2013.
[71]
B. Wang, Y. Zhu, and W. Yu, "OAWS: Memory Occlusion Aware Warp Scheduling," in Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 45--55, September 2016.
[72]
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-conscious Wavefront Scheduling," in Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 72--83, December 2012.
[73]
S.-Y. Lee, A. Arunkumar, and C.-J. Wu, "CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 515--527, June 2015.
[74]
B. Wang, W. Yu, X.-H. Sun, and X. Wang, "DaCache: Memory Divergence-Aware GPU Cache Management," in Proceedings of the International Conference on Supercomputing (ICS), pp. 89--98, June 2015.
[75]
A. Sethia, D. A. Jamshidi, and S. Mahlke, "Mascar: Speeding up GPU Warps by Reducing Memory Pitstops," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp. 174--185, February 2015.
[76]
X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang, "Coordinated Static and Dynamic Cache Bypassing for GPUs," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp. 76--88, February 2015.
[77]
W. Jia, K. A. Shaw, and M. Martonosi, "MRPB: Memory Request Prioritization for Massively Parallel Processors," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp. 272--283, February 2014.
[78]
H. Jeon, G. S. Ravi, N. S. Kim, and M. Annavaram, "GPU Register File Virtualization," in Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 420--432, December 2015.
[79]
M. Abdel-Majeed and M. Annavaram, "Warped Register File: A Power Efficient Register File for GPGPUs," in Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), pp. 412--423, February 2013.
[80]
N. Jing, Y. Shen, Y. Lu, S. Ganapathy, Z. Mao, M. Guo, R. Canal, and X. Liang, "An Energy-efficient and Scalable eDRAM-based Register File Architecture for GPGPU," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 344--355, June 2013.
[81]
M. K. Yoon, K. Kim, S. Lee, W. W. Ro, and M. Annavaram, "Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit," in Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 609--621, June 2016.
[82]
N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog, P. B. Gibbons, and O. Mutlu, "Zorua: A Holistic Approach to Resource Virtualization in GPUs," in Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 1--14, October 2016.

Cited By

View all
  • (2023)Characterizing Multi-Chip GPU Data SharingACM Transactions on Architecture and Code Optimization10.1145/362952120:4(1-24)Online publication date: 20-Oct-2023
  • (2023)SAC: Sharing-Aware Caching in Multi-Chip GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589078(1-13)Online publication date: 17-Jun-2023
  • (2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture
June 2019
849 pages
ISBN:9781450366694
DOI:10.1145/3307650
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2019

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

ISCA '19
Sponsor:

Acceptance Rates

ISCA '19 Paper Acceptance Rate 62 of 365 submissions, 17%;
Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)178
  • Downloads (Last 6 weeks)22
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Characterizing Multi-Chip GPU Data SharingACM Transactions on Architecture and Code Optimization10.1145/362952120:4(1-24)Online publication date: 20-Oct-2023
  • (2023)SAC: Sharing-Aware Caching in Multi-Chip GPUsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589078(1-13)Online publication date: 17-Jun-2023
  • (2023)NUBA: Non-Uniform Bandwidth GPUsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575745(544-559)Online publication date: 27-Jan-2023
  • (2023)Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)10.1109/ICCAD57390.2023.10323722(01-09)Online publication date: 28-Oct-2023
  • (2023)WSMP: a warp scheduling strategy based on MFQ and PPFThe Journal of Supercomputing10.1007/s11227-023-05127-079:11(12317-12340)Online publication date: 10-Mar-2023
  • (2022)Delegated Replies: Alleviating Network Clogging in Heterogeneous Architectures2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00078(1014-1028)Online publication date: Apr-2022
  • (2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
  • (2021)MIPSGPU: Minimizing Pipeline Stalls for GPUs With Non-Blocking ExecutionIEEE Transactions on Computers10.1109/TC.2020.302604370:11(1804-1816)Online publication date: 1-Nov-2021
  • (2021)Improving Inter-kernel Data Reuse With CTA-Page Coordination in GPGPU2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)10.1109/ICCAD51958.2021.9643535(1-9)Online publication date: 1-Nov-2021
  • (2021)Analyzing and Leveraging Decoupled L1 Caches in GPUs2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00047(467-478)Online publication date: Feb-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media