Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3575693.3575745acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

NUBA: Non-Uniform Bandwidth GPUs

Published: 30 January 2023 Publication History

Abstract

The parallel execution model of GPUs enables scaling to hundreds of thousands of threads, which is a key capability that many modern high-performance applications exploit. GPU vendors are hence increasing the compute and memory resources with every GPU generation — resulting in the need to efficiently stitch together a plethora of Symmetric Multiprocessors (SMs), Last-Level Cache (LLC) slices and memory controllers while maximizing bandwidth and keeping power consumption and design complexity in check. Conventional GPUs are Uniform Bandwidth Architectures (UBAs) as they provide equal bandwidth between all SMs and all LLC slices. UBA GPUs require a uniform high-bandwidth Network-on-Chip (NoC), and our key observation is that provisioning a NoC to match the LLC slice bandwidth incurs a hefty power and complexity overhead. We propose the Non-Uniform Bandwidth Architecture (NUBA), a GPU system architecture aimed at fully utilizing LLC slice bandwidth. A NUBA GPU consists of partitions that each feature a few SMs and LLC slices as well as a memory controller — hence exposing the complete LLC bandwidth to the SMs within a partition since they can be connected with point-to-point links — and a NoC between partitions — to enable access to remote data.Exploiting the potential of NUBA GPUs however requires carefully co-designing system software, the compiler and architectural policies. The critical system software component is our Local-And-Balanced (LAB) page placement policy which enables the GPU driver to place data in local partitions while avoiding load imbalance. Moreover, we propose Model-Driven Replication (MDR) which identifies read-only shared data with data-flow analysis at compile time. At run time, MDR leverages an architectural mechanism that replicates read-only shared data across LLC slices when this can be done without pressuring cache capacity. With LAB and MDR, our NUBA GPU improves average performance by 23.1% and 22.2% (and up to 183.9% and 182.4%) compared to iso-resource memory-side and SM-side UBA GPUs, respectively. When the NUBA concept is leveraged to reduce overhead while maintaining similar performance, NUBA reduces NoC power consumption by 12.1× and 9.4×, respectively.

References

[1]
Tor M. Aamodt, Wilson W. L. Fung, and Timothy G. Rogers. 2018. General-Purpose Graphics Processor Architectures. Morgan & Claypool Publishers.
[2]
AMD. 2012. AMD Graphics Core Next. https://www.techpowerup.com/gpu-specs/docs/amd-gcn1-architecture.pdf
[3]
AMD. 2019. Introducing RDNA Architecture. https://www.amd.com/system/files/documents/rdna-whitepaper.pdf
[4]
AMD. 2020. Introducing AMD CDNA Architecture. https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf
[5]
AMD. 2021. AMD Radeon PRO V620. https://www.amd.com/en/products/server-accelerators/amd-radeon-pro-v620
[6]
Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability. In Proceedings of the International Symposium on Computer Architecture (ISCA). 320–332.
[7]
Akhil Arunkumar, Shin-Ying Lee, Vignesh Soundararajan, and Carole-Jean Wu. 2018. LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy Efficient GPUs. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 221–234.
[8]
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J. Rossbach, and Onur Mutlu. 2017. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes. In Proceedings of the International Symposium on Microarchitecture (MICRO). 136–150.
[9]
Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J. Rossbach, and Onur Mutlu. 2018. MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 503–518. isbn:9781450349116
[10]
Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian, and Al Davis. 2010. Handling the Problems and Opportunities Posed by Multiple On-chip Memory Controllers. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 319–330.
[11]
Reza Azimi, David K. Tam, Livio Soares, and Michael Stumm. 2009. Enhancing Operating System Support for Multicore Processors by Using Hardware Performance Monitoring. ACM SIGOPS Operating Systems Review, 43, 2 (2009), 56–65.
[12]
Ali Bakhoda, John Kim, and Tor M. Aamodt. 2010. Throughput-Effective On-Chip Networks for Manycore Accelerators. In Proceedings of the International Symposium on Microarchitecture (MICRO). 421–432.
[13]
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In Proceeding of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 163–174.
[14]
Trinayan Baruah, Yifan Sun, Ali Tolga Dinçer, Saiful A. Mojumder, José L. Abellán, Yash Ukidave, Ajay Joshi, Norman Rubin, John Kim, and David Kaeli. 2020. Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 596–609.
[15]
Bradford M. Beckmann, Michael R. Marty, and David A. Wood. 2006. ASR: Adaptive Selective Replication for CMP Caches. In Proceedings of the International Symposium on Microarchitecture (MICRO). 443–454.
[16]
François Broquedis, Olivier Aumage, Brice Goglin, Samuel Thibault, Pierre-Andr Wacrenier, and Raymond Namyst. 2010. Structuring the Execution of OpenMP Applications for Multicore Architectures. In Proceedings of the International Symposium on Parallel Distributed Processing (IPDPS). 1–10.
[17]
François Broquedis, Nathalie Furmento, Brice Goglin, Pierre-André Wacrenier, and Raymond Namyst. 2010. ForestGOMP: an Efficient OpenMP Environment for NUMA Architectures. International Journal of Parallel Programming, 38, 5 (2010), 418–439.
[18]
Rohit Chandra, Scott Devine, Ben Verghese, Anoop Gupta, and Mendel Rosenblum. 1994. Scheduling and Page Migration for Multiprocessor Compute Servers. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 12–24.
[19]
Jichuan Chang and Gurindar S. Sohi. 2006. Cooperative Caching for Chip Multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA). 264–276.
[20]
Niladrish Chatterjee, Mike O’Connor, Donghyuk Lee, Daniel R. Johnson, Stephen W. Keckler, Minsoo Rhu, and William J. Dally. 2017. Architecting an Energy-Efficient DRAM System for GPUs. In International Symposium on High Performance Computer Architecture (HPCA). 73–84.
[21]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the International Symposium on Workload Characterization (IISWC). 44–54.
[22]
Gregory K. Chen, Mark A. Anders, and Himanshu Kaul. 2017. Scalable crossbar apparatus and method for arranging crossbar circuits. US Patent 9,577,634
[23]
Hu Chen, Wenguang Chen, Jian Huang, Bob Robert, and H. Kuhn. 2006. MPIPP: An Automatic Profile-Guided Parallel Process Placement Toolset for SMP Clusters and Multiclusters. In Proceedings of the International Conference on Supercomputing (ICS). 353–360.
[24]
Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2005. Optimizing Replication, Communication, and Capacity Allocation in CMPs. In Proceedings of the International Symposium on Computer Architecture (ISCA). 357–368.
[25]
Jack Choquette, Edward Lee, Ronny Krashinsky, Vishnu Balan, and Brucek Khailany. 2021. 3.2 The A100 Datacenter GPU and Ampere Architecture. In Proceedings of the International Solid- State Circuits Conference (ISSCC). 64, 48–50.
[26]
Ilaria D. Gennaro, Alessandro Pellegrini, and Francesco Quaglia. 2016. OS-Based NUMA Optimization: Tackling the Case of Truly Multi-thread Applications with Non-partitioned Virtual Page Accesses. In Proceedings of the International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 291–300.
[27]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 381–394.
[28]
Matthias Diener, Eduardo H.M. Cruz, Philippe O.A. Navaux, Anselm Busse, and Hans-Ulrich Heiß. 2014. KMAF: Automatic Kernel-Level Management of Thread and Data Affinity. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT). 277–288.
[29]
Matthias Diener, Felipe Madruga, Eduardo Rodrigues, Marco Alves, Jorg Schneider, Philippe Navaux, and Hans-Ulrich Heiss. 2010. Evaluating Thread Placement Based on Memory Access Patterns for Multi-core Processors. In Proceedings of International Conference on High Performance Computing and Communications (HPCC). 491–496.
[30]
Wei Ding, Yuanrui Zhang, Mahmut Kandemir, Jithendra Srinivas, and Praveen Yedlapalli. 2013. Locality-aware Mapping and Scheduling for Multicores. In Proceedings of the International Symposium on Code Generation and Optimization (CGO). 1–12.
[31]
Haakon Dybdahl and Per Stenstrom. 2007. An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 2–12.
[32]
Fabien Gaud, Baptiste Lepers, Justin Funston, Mohammad Dashti, Alexandra Fedorova, Vivien Quéma, Renaud Lachaize, and Mark Roth. 2015. Challenges of Memory Management on Modern NUMA Systems. Commun. ACM, 58, 12 (2015), 59–66.
[33]
David B. Glasco, Peter B. Holmqvist, George R. Lynch, Patrick R. Marchand, Karan Mehra, and James Roberts. 2012. Cache-based Control of Atomic Operations in Conjunction With an External ALU Block. US Patent 8,135,926 B1
[34]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a High-Level Language Targeted to GPU Codes. In Proceedings of Innovative Parallel Computing (InPar). 1–10.
[35]
Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal Block Placement and Replication in Distributed Caches. In Proceedings of the International Symposium on Computer Architecture (ISCA). 184–195.
[36]
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce Framework on Graphics Processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 260–269.
[37]
Christopher Hughes, Changkyu Kim, and Yen-Kuang Chen. 2010. Performance and Energy Implications of Many-Core Caches for Throughput Computing. IEEE Micro, 30, 6 (2010), November, 25–35.
[38]
Joshua Hursey, Jeffrey M. Squyres, and Terry Dontje. 2011. Locality-Aware Parallel Process Mapping for Multi-core HPC Systems. In Proceedings of International Conference on Cluster Computing (CLUSTER). 527–531.
[39]
Mohamed A. Ibrahim, Onur Kayiran, Yasuko Eckert, Gabriel H. Loh, and Adwait Jog. 2020. Analyzing and Leveraging Shared L1 Caches in GPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 161–173.
[40]
Mohamed A. Ibrahim, Onur Kayiran, Yasuko Eckert, Gabriel H. Loh, and Adwait Jog. 2021. Analyzing and Leveraging Decoupled L1 Caches in GPUs. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 2–12.
[41]
Mahmut Kandemir, Feihui Li, Mary Jane Irwin, and Seung Woo Son. 2008. A Novel Migration-based NUCA Design for Chip Multiprocessors. In Proceedings of the Conference on Supercomputing (SC). 1–12.
[42]
Mahmoud Khairy, Vadim Nikiforov, David Nellans, and Timothy G. Rogers. 2020. Locality-Centric Data and Threadblock Management for Massive GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO). 1022–1036.
[43]
Changkyu Kim, Doug Burger, and Stephen W. Keckler. 2002. An Adaptive, Non-uniform Cache Structure for Wire-delay Dominated On-chip Caches. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 211–222.
[44]
Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016. Ramulator: A Fast and Extensible DRAM Simulator. IEEE Computer Architecture Letters, 15, 1 (2016), January, 45–49.
[45]
Gunjae Koo, Yunho Oh, Won Woo Ro, and Murali Annavaram. 2017. Access Pattern-Aware Cache Management for Improving Data Utilization in GPU. In Proceedings of the International Symposium on Computer Architecture (ISCA). 307–319.
[46]
Argonne National Laboratory. 2013. Using the Hydra Process Manager. https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
[47]
Stefan Lankes, Boris Bierbaum, and Thomas Bemmerl. 2009. Affinity-on-next-Touch: An Extension to the Linux Kernel for NUMA Architectures. In Proceedings of the International Conference on Parallel Processing and Applied Mathematics (PPAM). 576–585.
[48]
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA). 487–498.
[49]
Yuxi Liu, Xia Zhao, Magnus Jahre, Zhenlin Wang, Xiaolin Wang, Yingwei Luo, and Lieven Eeckhout. 2018. Get Out of the Valley: Power-Efficient Address Mapping for GPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA). 166–179.
[50]
Henrik Löf and Sverker Holmgren. 2005. Affinity-on-next-Touch: Increasing the Performance of an Industrial PDE Solver on a Cc-NUMA System. In Proceedings of the International Conference on Supercomputing (ICS). 387–392.
[51]
Jaydeep Marathe and Frank Mueller. 2006. Hardware Profile-Guided Automatic Page Placement for CcNUMA Systems. In Proceedings of the International Symposium on Principles and Practice of Parallel Programming (PPoPP). 90–99.
[52]
Jaydeep Marathe, Vivek Thakkar, and Frank Mueller. 2010. Feedback-directed Page Placement for ccNUMA via Hardware-generated Memory Traces. J. Parallel and Distrib. Comput., 70, 12 (2010), 1204–1219.
[53]
Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the Socket: NUMA-Aware GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO). 123–135.
[54]
Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesus Labarta, Eduard Ayguade, eacute, and . 2000. Is Data Distribution Necessary in OpenMP? In Proceedings of the Conference on Supercomputing (SC). 47–61.
[55]
NVIDIA. 2009. NVIDIA’s Next Generation CUDA Compute Architecture. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
[56]
NVIDIA. 2012. NVIDIA GeForce GTX 680. https://www.nvidia.com/content/PDF/product-specifications/GeForce_GTX_680_Whitepaper_FINAL.pdf
[57]
NVIDIA. 2014. NVIDIA GeForce GTX 980. https://www.microway.com/download/whitepaper/NVIDIA_Maxwell_GM204_Architecture_Whitepaper.pdf
[58]
NVIDIA. 2016. NVIDIA Tesla P100. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
[59]
NVIDIA. 2016. NVIDIA Turing GPU Architecture. https://images.nvidia.cn/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
[60]
NVIDIA. 2017. NVIDIA Tesla V100 Volta Architecture. http://www.nvidia.com/object/volta-architecture-whitepaper.html
[61]
NVIDIA. 2018. VOLTA Architecture and performance optimization. http://on-demand.gputechconf.com/gtc/2018/presentation/s81006-volta-architecture-and-performance-optimization.pdf
[62]
NVIDIA. 2019. Parallel Thread Execution ISA Version 6.5. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
[63]
NVIDIA. 2020. CUDA COMPILER DRIVER NVCC. https://docs.nvidia.com/pdf/CUDA_Compiler_Driver_NVCC.pdf
[64]
NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
[65]
NVIDIA. 2022. NVIDIA CUDA SDK Code Samples. https://developer.nvidia.com/cuda-downloads
[66]
NVIDIA. 2022. NVLink and NVSwitch. https://www.nvidia.com/en-us/data-center/nvlink/
[67]
Takeshi Ogasawara. 2009. NUMA-Aware Memory Manager with Dominant-Thread-Based Copying GC. In Proceedings of the International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). 377–390.
[68]
Oracle. 2010. Solaris OS Tuning Features. https://docs.oracle.com/cd/E18659_01/html/821-1381/aewda.html
[69]
Giorgos Passas, Manolis Katevenis, and Dionisis Pnevmatikatos. 2010. A 128 x 128 x 24Gb/s Crossbar Interconnecting 128 Tiles in a Single Hop and Occupying 6% of Their Area. In Proceedings of the International Symposium on Networks-on-Chip (NoCS). 87–95.
[70]
Giorgos Passas, Manolis Katevenis, and Dionisios Pnevmatikatos. 2011. VLSI Micro-Architectures for High-Radix Crossbar Schedulers. In Proceedings of the International Symposium on Networks-on-Chip (NoCS). 217–224.
[71]
Guilherme Piccoli, Henrique N. Santos, Raphael E. Rodrigues, Christiane Pousa, Edson Borin, and Fernando M. Quintão Pereira. 2014. Compiler Support for Selective Page Migration in NUMA Architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation (PACT). 369–380.
[72]
Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 743–758.
[73]
Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 Address Translation for 100s of GPU Lanes. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 568–578.
[74]
Moinuddin K Qureshi. 2009. Adaptive Spill-Receive for Robust High-Performance Caching in CMPs. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 45–54.
[75]
Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. 2006. A Case for MLP-Aware Cache Replacement. In Proceedings of the International Symposium on Computer Architecture (ISCA). 167–178.
[76]
Petar Radojković, Vladimir Čakarević, Miquel Moretó, Javier Verdú, Alex Pajuelo, Francisco J. Cazorla, Mario Nemirovsky, and Mateo Valero. 2012. Optimal Task Assignment in Multithreaded Processors: A Statistical Approach. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 235–248.
[77]
Xiaowei Ren, Daniel Lustig, Evgeny Bolotin, Aamer Jaleel, Oreste Villa, and David Nellans. 2020. HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 582–595.
[78]
Eduardo R. Rodrigues, Felipe L. Madruga, Philippe O. A. Navaux, and Jairo Panetta. 2009. Multi-core Aware Process Mapping and its Impact on Communication Overhead of Parallel Applications. In Proceedings of the International Symposium on Computers and Communications (ISCC). 811–817.
[79]
Korey Sewell, Ronald G. Dreslinski, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoffrey Blake, Michael Cieslak, Reetuparna Das, Thomas F. Wenisch, Dennis Sylvester, David Blaauw, and Trevor Mudge. 2012. Swizzle-Switch Networks for Many-Core Systems. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2, 2 (2012), June, 278–294.
[80]
Seunghee Shin, Guilherme Cox, Mark Oskin, Gabriel H. Loh, Yan Solihin, Abhishek Bhattacharjee, and Arkaprava Basu. 2018. Scheduling Page Table Walks for Irregular GPU Applications. In Proceedings of the International Symposium on Computer Architecture (ISCA). 180–192.
[81]
Seunghee Shin, Michael LeBeane, Yan Solihin, and Arkaprava Basu. 2018. Neighborhood-Aware Address Translation for Irregular GPU Applications. In Proceedings of the International Symposium on Microarchitecture (MICRO). 352–363.
[82]
John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. University of Illinois at Urbana-Champaign.
[83]
Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. 2012. DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling. In Proceedings of the International Symposium on Networks-on-Chip (NOCS). 201–210.
[84]
David Tam, Reza Azimi, and Michael Stumm. 2007. Thread Clustering: Sharing-Aware Scheduling on SMP-CMP-SMT Multiprocessors. In Proceedings of the European Conference on Computer Systems (EuroSys). 47–58.
[85]
Mustafa M. Tikir and Jeffrey K. Hollingsworth. 2004. Using Hardware Counters to Automatically Improve Memory Performance. In Proceedings of the International Conference on Supercomputing (SC). 46–46.
[86]
San Jose State University. 2019. Tango: A Deep Neural Network Benchmark Suite for Various Accelerators. https://gitlab.com/Tango-DNNbench/Tango
[87]
Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 279–289.
[88]
Jan Vesely, Arkaprava Basu, Mark Oskin, Gabriel H. Loh, and Abhishek Bhattacharjee. 2016. Observations and Opportunities in Architecting Shared Virtual Memory for Heterogeneous Systems. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 161–171. https://doi.org/10.1109/ISPASS.2016.7482091
[89]
Lu Wang, Xia Zhao, David Kaeli, Zhiying Wang, and Lieven Eeckhout. 2018. Intra-Cluster Coalescing to Reduce GPU NoC Pressure. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS). 990–999.
[90]
Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans, and Oreste Villa. 2018. Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems. In Proceedings of the International Symposium on Microarchitecture (MICRO). 339–351.
[91]
Qi Yu, Bruce Childers, Libo Huang, Cheng Qian, and Zhiying Wang. 2020. HPE: Hierarchical Page Eviction Policy for Unified Memory in GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39, 10 (2020), 2461–2474. https://doi.org/10.1109/TCAD.2019.2944790
[92]
Michael Zhang and Krste Asanovic. 2005. Victim Replication: Maximizing Capacity While Hiding Wire Delay in Tiled Chip Multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA). 336–345.
[93]
Yuanrui Zhang, Wei Ding, Mahmut Kandemir, Jun Liu, and Ohyoung Jang. 2011. A Data Layout Optimization Framework for NUCA-based Multicores. In Proceedings of International Symposium on Microarchitecture (MICRO). 489–500.
[94]
Xia Zhao, Almutaz Adileh, Zhibin Yu, Zhiying Wang, Aamer Jaleel, and Lieven Eeckhout. 2019. Adaptive Memory-Side Last-Level GPU Caching. In Proceedings of the International Symposium on Computer Architecture (ISCA). 411–423.
[95]
Xia Zhao, Magnus Jahre, and Lieven Eeckhout. 2020. Selective Replication in Memory-Side GPU Caches. In Proceedings of the International Symposium on Microarchitecture (MICRO). 967–980.
[96]
Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W. Keckler. 2016. Towards High Performance Paged Memory for GPUs. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 345–357.

Cited By

View all
  • (2023)Characterizing Multi-Chip GPU Data SharingACM Transactions on Architecture and Code Optimization10.1145/362952120:4(1-24)Online publication date: 14-Dec-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
January 2023
947 pages
ISBN:9781450399166
DOI:10.1145/3575693
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 January 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. Non-Uniform Bandwidth Architecture (NUBA)

Qualifiers

  • Research-article

Funding Sources

Conference

ASPLOS '23

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)369
  • Downloads (Last 6 weeks)20
Reflects downloads up to 12 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Characterizing Multi-Chip GPU Data SharingACM Transactions on Architecture and Code Optimization10.1145/362952120:4(1-24)Online publication date: 14-Dec-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media