research-article

Open access

UMH: A Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs

Authors:

Amir Kavyan Ziabari,

José L. Abellán,

David KaeliAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 4

Article No.: 35, Pages 1 - 25

https://doi.org/10.1145/2996190

Published: 02 December 2016 Publication History

Abstract

In this article, we describe how to ease memory management between a Central Processing Unit (CPU) and one or multiple discrete Graphic Processing Units (GPUs) by architecting a novel hardware-based Unified Memory Hierarchy (UMH). Adopting UMH, a GPU accesses the CPU memory only if it does not find its required data in the directories associated with its high-bandwidth memory, or the NMOESI coherency protocol limits the access to that data. Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92 × in comparison to alternative software-based approaches. It also allows the CPU to access GPUs modified data by at least 13 × faster.

References

[1]

Darren Abramson, Jeff Jackson, Sridhar Muthrasanallur, Gil Neiger, Greg Regnier, Rajesh Sankaran, Ioannis Schoinas, Rich Uhlig, Balaji Vembu, and John Wiegert. 2006. Intel virtualization technology for directed I/O. Intel Technol. J. 10, 3 (2006).

[2]

N. Agarwal, D. Nellans, M. O’Connor, S. W. Keckler, and T. F. Wenisch. 2015. Unlocking bandwidth for GPUs in CC-NUMA systems. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 354--365.

[3]

Nabeel Al-Saber and Milind Kulkarni. 2015. SemCache++: Semantics-aware caching for efficient multi-GPU offloading. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, 255--256.

Digital Library

[4]

AMD. 2012. AMD Graphics Cores Next (GCN) Architecture. White paper. https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf.

[5]

AMD. 2014a. AMD Launches World’s Fastest Graphics Card. Retrieved from http://www.amd.com/en-us/ press-releases/Pages/fastest-graphics-card-2014apr8.aspx.

[6]

AMD. 2014b. AMD Radeon HD 7800 Series Graphic Cards. (2014). Retrieved from http://www.amd.com/en-us/products/graphics/desktop/7000/7800.

[7]

AMD. 2015a. High Bandwidth Memory. Retrieved from http://www.amd.com/en-us/innovations/software- technologies/hbm.

[8]

AMD. 2015b. High-Bandwidth Memory (HBM): Reinventing Memory Technology. (2015). https://www.amd.com/Documents/High-Bandwidth-Memory-HBM.pdf.

[9]

AMD. 2016. AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK). Retrieved from http://developer.amd.com/sdks/amdappsdk/.

[10]

E. Bolotin, D. Nellans, O. Villa, M. O’Connor, A. Ramirez, and S. W. Keckler. 2015. Designing efficient heterogeneous memory architectures. IEEE Micro 35, 4 (July 2015), 60--68.

Digital Library

[11]

Pierre Boudier and Graham Sellers. 2011. MEMORY SYSTEM ON FUSION APUS: The Benefits of Zero Copy. AMD, June 2011. Web. Nov. 11 2016. http://developer.amd.com/wordpress/media/2013/06/1004_final.pdf.

[12]

Javier Cabezas, Lluís Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, and Wen-mei W. Hwu. 2015. Automatic parallelization of kernels in shared-memory multi-GPU nodes. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, New York, NY, 3--13.

Digital Library

[13]

Patrick Dorsey. 2010. Xilinx stacked silicon interconnect technology delivers breakthrough FPGA capacity, bandwidth, and power efficiency. Xilinx White Paper: Virtex-7 FPGAs (2010), 1--10.

[14]

NVIDIA Gupta, Sumit. 2015. NVIDIA Updates GPU Roadmap; Announces Pascal. Retrieved from http://blogs.nvidia.com/blog/2014/03/25/gpu-roadmap-pascal/.

[15]

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 37--47.

Digital Library

[16]

Pawan Harish and P. J. Narayanan. 2007. Accelerating large graph algorithms on the GPU using CUDA. In Proceedings of the 14th International Conference on High Performance Computing (HiPC’07). Springer-Verlag, Berlin, 197--208.

Digital Library

[17]

NVIDIA Harris, Mark. 2013. Unified Memory in CUDA 6. Retrieved from http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/.

[18]

Owen Harrison and John Waldron. 2007. AES encryption implementation and analysis on commodity graphics processing units. In Proceedings of the 9th International Workshop on Cryptographic Hardware and Embedded Systems (CHES’07). Springer-Verlag, Berlin, 209--226.

Digital Library

[19]

D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi. 2014. Unison cache: A scalable and effective die-stacked DRAM cache. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 25--37.

Digital Library

[20]

Djordje Jevdjic, Stavros Volos, and Babak Falsafi. 2013. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? Have it all with footprint cache. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 404--415.

Digital Library

[21]

M. Kadiyala and L. N. Bhuyan. 1995. A dynamic cache sub-block design to reduce false sharing. In Proceedings of the 1995 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD’95). 313--318.

Digital Library

[22]

Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. 2013. Memory-centric system interconnect design with hybrid memory cubes. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE Press, 145--156.

Digital Library

[23]

Gwangsun Kim, Minseok Lee, Jiyun Jeong, and John Kim. 2014a. Multi-GPU system design with memory networks. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, 484--495.

Digital Library

[24]

Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a single compute device image in OpenCL for multiple GPUs. SIGPLAN Not. 46, 8 (Feb. 2011), 277--288.

Digital Library

[25]

Jesung Kim, Sang Lyul Min, Sanghoon Jeon, Byoungchu Ahn, Deog Kyoon Jeong, and Chong Sang Kim. 1995. U-cache: A cost-effective solution to synonym problem. In Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture. IEEE, 243--252.

Digital Library

[26]

Youngsok Kim, Jaewon Lee, Jae-Eon Jo, and Jangwoo Kim. 2014b. GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 546--557.

[27]

Y. Kim, J. Lee, D. Kim, and J. Kim. 2014c. ScaleGPU: GPU architecture for memory-unaware GPU programming. IEEE Comput. Arch. Lett. 13, 2 (July 2014), 101--104.

Digital Library

[28]

R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve. 2015. Stash: Have your scratchpad and cache it too. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 707--719.

Digital Library

[29]

Adam Lake. 2014. Getting the Most from OpenCL^TM 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics. Intel 2014, Web. Nov. 11 2016. https://software.intel.com/sites/default/files/managed/f1/25/opencl-zero-copy-in-opencl-1-2.pdf.

[30]

Jason Lawley. 2014. Understanding performance of PCI express systems. Xilinx, October 28, 2014. web. Nov. 11, 2016. http://www.xilinx.com/support/documentation/white_papers/wp350.pdf.

[31]

Wenqiang Li, Guanghao Jin, Xuewen Cui, and S. See. 2015. An evaluation of unified memory technology on NVIDIA GPUs. In Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 1092--1098.

[32]

Gabriel H. Loh and Mark D. Hill. 2011. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, 454--464.

Digital Library

[33]

Milo M. K. Martin. 2003. Token Coherence. Ph.D. Dissertation. University of Wisconsin--Madison.

[34]

Siddharth Mohanty and Murray Cole. 2007. Autotuning wavefront applications for multicore multi-GPU hybrid architectures. In Proceedings of Programming Models and Applications on Multicores and Manycores (PMAM’14). ACM, New York, NY, Article 1, 9 pages.

Digital Library

[35]

Dan Negrut. 2014. Unified Memory in CUDA 6.0: A Brief Overview. Retrieved from http://www.drdobbs.com/ parallel/unified-memory-in-cuda-6-a-brief-overvie/.

[36]

A. Nere, A. Hashmi, and M. Lipasti. 2011. Profiling heterogeneous multi-GPU systems to accelerate cortically inspired learning algorithms. In Proceedings of the 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS’11). 906--920.

Digital Library

[37]

NVIDIA. 2012. NVIDIA Maximus System Builder’s Guide. Retrieved from http://www.nvidia.com/content/quadro/maximus/di-06471-001_v02.pdf.

[38]

NVIDIA. 2014. Whitepaper: Nvidia NVLink high-speed interconnect: application performance. NVIDIA Nov. 2014, web Nov. 11. 2016. http://info.nvidianews.com/rs/nvidia/images/NVIDIA%20NVLink%20High-Speed%20Interconnect%20Application%20Performance%20Brief.pdf.

[39]

NVIDIA. 2015a. NVIDIA CUDA C Programming Guide: Version 7.5. (2015). Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide/

[40]

NVIDIA. 2015b. Tesla K80 GPU Accelerator. Retrieved from https://images.nvidia.com/content/pdf/kepler/ Tesla-K80-BoardSpec-07317-001-v05.pdf.

[41]

Sreepathi Pai. 2014. Microbenchmarking Unified Memory in CUDA 6.0. Retrieved from http://users.ices. utexas.edu/sreepai/automem/.

[42]

Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural support for address translation on GPUs: Designing memory management units for CPU/GPUs with unified address spaces. ACM SIGPLAN Not. 49, 4 (2014), 743--758.

Digital Library

[43]

Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, 457--467.

Digital Library

[44]

Jonathan Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 568--578.

[45]

Moinuddin K. Qureshi and Gabe H. Loh. 2012. Fundamental latency trade-off in architecting DRAM caches: Outperforming impractical SRAM-tags with a simple and practical design. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, 235--246.

Digital Library

[46]

Dana Schaa. 2014. Improving the Cooperative Capability of Heterogeneous Processors. Ph.D. Dissertation. Northeastern University.

[47]

D. Schaa and D. Kaeli. 2009. Exploring the multiple-GPU design space. In Proceedings of the IEEE International Symposium on Parallel Distributed Processing (IPDPS’09). 1--12.

Digital Library

[48]

Tom Shanley. 2010. X86 Instruction Set Architecture. Mindshare Press.

Digital Library

[49]

Premkishore Shivakumar and Norman P. Jouppi. 2001. Cacti 3.0: An Integrated Cache Timing, Power, and Area Model. Technical Report. Technical Report 2001/2, Compaq Computer Corporation.

[50]

Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48).

Digital Library

[51]

JEDEC Standard. 2013. High bandwidth memory (HBM) dram. JESD235 (2013).

[52]

J. A. Stuart and J. D. Owens. 2011. Multi-GPU MapReduce on GPU clusters. In Proceedings of the 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS). 1068--1079.

Digital Library

[53]

Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. 2016. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing. In Proceedings of the 2016 IEEE International Symposium on Workload Characterization (IISWC). IEEE.

[54]

NVIDIA SuperMicro. 2016. Revolutionising High Performance Computing with Supermicro Solutions Using Nvidia Tesla. Retrieved from http://goo.gl/2YEKIq.

[55]

Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques.

Digital Library

[56]

Rafael Ubal and David Kaeli. 2015. The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing. Retrieved from www.multi2sim.org.

[57]

Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, and Onur Mutlu. 2015. A case for core-assisted Bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). ACM, New York, NY, 41--53.

Digital Library

Cited By

Kamatar AFriese RGioiosa R(2023)A Task Based Approach for Co-Scheduling Ensemble Workloads on Heterogeneous Nodes2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00015(6-15)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00015
Belayneh LYe HChen KBlaauw DMudge TDreslinski RTalati NKloeckner AMoreira J(2022)Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU SystemsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569649(304-316)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569649
Yao CLiu WTang WHu S(2022)EAISFuture Generation Computer Systems10.1016/j.future.2022.01.004130:C(253-268)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1016/j.future.2022.01.004
Show More Cited By

Index Terms

UMH: A Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Hardware
  1. Integrated circuits

Recommendations

Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Asymmetric multicore processors have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for ...
Design and Optimization of Large Size and Low Overhead Off-Chip Caches

Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
Experience Applying Fortran GPU Compilers to Numerical Weather Prediction
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

Graphics Processing Units (GPUs) have enabled significant improvements in computational performance compared to traditional CPUs in several application domains. Until recently, GPUs have been programmed using C/C++ based methods such as CUDA (NVIDIA) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 13, Issue 4

December 2016

648 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3012405

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 December 2016

Accepted: 01 September 2016

Revised: 01 August 2016

Received: 01 May 2016

Published in TACO Volume 13, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

NRF
Spanish MINECO

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
1,014
Total Downloads

Downloads (Last 12 months)177
Downloads (Last 6 weeks)25

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kamatar AFriese RGioiosa R(2023)A Task Based Approach for Co-Scheduling Ensemble Workloads on Heterogeneous Nodes2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00015(6-15)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00015
Belayneh LYe HChen KBlaauw DMudge TDreslinski RTalati NKloeckner AMoreira J(2022)Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU SystemsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569649(304-316)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569649
Yao CLiu WTang WHu S(2022)EAISFuture Generation Computer Systems10.1016/j.future.2022.01.004130:C(253-268)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1016/j.future.2022.01.004
Kalkhof TKoch A(2021)Efficient Physical Page Migrations in Shared Virtual Memory Reconfigurable Computing Systems2021 International Conference on Field-Programmable Technology (ICFPT)10.1109/ICFPT52863.2021.9609831(1-10)Online publication date: 6-Dec-2021
https://doi.org/10.1109/ICFPT52863.2021.9609831
Tomoutzoglou OMbakoyiannis DKornaros GCoppola M(2020)Efficient Job Offloading in Heterogeneous Systems Through Hardware-Assisted Packet-Based Dispatching and User-Level Runtime InfrastructureIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.290791239:5(1017-1030)Online publication date: May-2020
https://doi.org/10.1109/TCAD.2019.2907912
Baruah TSun YDincer AMojumder SAbellan JUkidave YJoshi ARubin NKim JKaeli D(2020)Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00055(596-609)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00055
Wei JLu JYu QLi CZhao Y(2020)Dynamic GMMU Bypass for Address Translation in Multi-GPU SystemsNetwork and Parallel Computing10.1007/978-3-030-79478-1_13(147-158)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1007/978-3-030-79478-1_13
Sun YBaruah TMojumder SDong SGong XTreadway SBao YHance SMcCardwell CZhao VBarclay HZiabari AChen ZUbal RAbellán JKim JJoshi AKaeli DManne SHunter HAltman E(2019)MGPUSimProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322230(197-209)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322230
Manian KAmmar ARuhela AChu CSubramoni HPanda D(2019)Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU ArchitecturesProceedings of the 12th Workshop on General Purpose Processing Using GPUs10.1145/3300053.3319419(43-52)Online publication date: 13-Apr-2019
https://dl.acm.org/doi/10.1145/3300053.3319419
Khavari Tavana MSun YBohm Agostini NKaeli D(2019)Exploiting Adaptive Data Compression to Improve Performance and Energy-Efficiency of Compute Workloads in Multi-GPU Systems2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2019.00075(664-674)Online publication date: May-2019
https://doi.org/10.1109/IPDPS.2019.00075
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents