Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/MICRO.2014.55acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
tutorial

Multi-GPU System Design with Memory Networks

Published: 13 December 2014 Publication History

Abstract

GPUs are being widely used to accelerate different workloads and multi-GPU systems can provide higher performance with multiple discrete GPUs interconnected together. However, there are two main communication bottlenecks in multi-GPU systems -- accessing remote GPU memory and the communication between GPU and the host CPU. Recent advances in multi-GPU programming, including unified virtual addressing and unified memory from NVIDIA, has made programming simpler but the costly remote memory access still makes multi-GPU programming difficult. In order to overcome the communication limitations, we propose to leverage the memory network based on hybrid memory cubes (HMCs) to simplify multi-GPU memory management and improve programmability. In particular, we propose scalable kernel execution (SKE) where multiple GPUs are viewed as a single virtual GPU as a single kernel can be executed across multiple GPUs without modifying the source code. To fully enable the benefits of SKE, we explore alternative memory network designs in a multi-GPU system. We propose a GPU memory network (GMN) to simplify data sharing between the discrete GPUs while a CPU memory network (CMN) is used to simplify data communication between the host CPU and the discrete GPUs. These two types of networks can be combined to create a unified memory network (UMN) where the communication bottleneck in multi-GPU can be significantly minimized as both the CPU and GPU share the memory network. We evaluate alternative network designs and propose a sliced flattened butterfly topology for the memory network that scales better than previously proposed alternative topologies by removing local HMC channels. In addition, we propose an overlay network organization for unified memory network to minimize the latency for CPU access while providing high bandwidth for the GPUs. We evaluate trade-offs between the different memory network organization and show how UMN significantly reduces the communication bottleneck in multi-GPU systems.

References

[1]
J. Owens et al., "GPU computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, 2008.
[2]
"Nvidia to Stack up DRAM on Future Volta GPUs," http://www.theregister.co.uk/2013/03/19.
[3]
"Hybrid Memory Cube Specification 1.0," Hybrid Memory Cube Consortium, 2013. {Online}. Available: http://www. hybridmemorycube.org/
[4]
J. T. Pawlowski, "Hybrid Memory Cube (HMC)," Hot Chips 23, 2011.
[5]
G. Kim et al., "Memory-centric System Interconnect Design with Hybrid Memory Cubes," in Proceedings of PACT'13.
[6]
D. R. Resnick, "Memory Network Methods, Apparatus, and Systems," US Patent Application Publication US20100211721 A1, 2010.
[7]
T. C. Schroeder, "Peer-to-Peer and Unified Virtual Addressing," GPU Technology Conference, NVIDIA, 2011.
[8]
M. Harris, "Unified Memory in CUDA 6," GTC On-Demand, NVIDIA, 2013.
[9]
D. Foley, "NVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data," http://devblogs.nvidia.com/parallelforall/nvlink-pascal-stacked-memory-feeding-appetite-big-data.
[10]
"HP ProLiant SL270s Gen8 Server Quickspecs," HP.
[11]
"Dell PowerEdge C410x Rack Server Technical Guide," DeLL.
[12]
J. Kim et al., "Microarchitecture of a high-radix router," in Proceedings of ISCA'05.
[13]
D. Schaa and D. Kaeli, "Exploring the multiple-GPU design space," in Processing of IPDPS'09.
[14]
P. Micikevicius, "Multi-GPU Programming," GPU Computing Webinars, NVIDIA, 2011.
[15]
D. Ziakas et al., "Intel QuickPath Interconnect Architectural Features Supporting Scalable System Architectures," in HOTI'10.
[16]
"HyperTransport I/O Technology Overview," The HyperTransport Consortium, Tech. Rep., June 2004.
[17]
P. Rosenfeld, "Performance Exploration Of the Hybrid Memory Cube," Ph.D. dissertation, the University of Maryland, 2014.
[18]
K. Sudan, "Data Placement for Efficient Main Memory Access," Ph.D. dissertation, the University of Utah, 2013.
[19]
"SLI Best Practices," White Paper, NVIDIA, 2007.
[20]
"ATI CrossFire Pro User Guide," White Paper, AMD, 2009.
[21]
E. Ayguade et al., "An Extension of the StarSs Programming Model for Platforms with Multiple GPUs," in Proceedings of Euro-Par'09.
[22]
"CUBLAS Library User Guide," NVIDIA, 2012.
[23]
S. Tomov et al., "Dense linear algebra solvers for multicore with gpu accelerators," in Proceedings of IPDPSW'10.
[24]
J. Malcolm et al., "ArrayFire: a GPU acceleration platform," in Proc. of SPIE Vol, vol. 8403, 2012, pp. 84030A-1.
[25]
"NVIDIA Multi-GPU Technology," NVIDIA. {Online}. Available: http://www.nvidia.com/object/multi-gpu-technology. html
[26]
J. Kim et al., "Achieving a single compute device image in OpenCL for multiple GPUs," in Proceedings of PPoPP'11.
[27]
J. Lee et al., "Transparent CPU-GPU collaboration for dataparallel kernels on heterogeneous systems," in Proceedings of PACT'13.
[28]
T. Diop et al., "DistCL: A framework for the distributed execution of opencl kernels," in Proceedings of MASCOTS'13.
[29]
M. Strengert et al., "CUDASA: Compute Unified Device and Systems Architecture," in Proceedings of EG PGV'08.
[30]
C. de La Lama et al., "Static Multi-device Load Balancing for OpenCL," in Proceedings of ISPA'12.
[31]
U. Dastgeer et al., "Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems," in Proceedings of IWMSE'11.
[32]
J. Kim et al., "SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters," in Proceedings of ICS'12.
[33]
R. Aoki et al., "Hybrid opencl: Enhancing opencl for distributed processing," in Proceedings of ISPA'11.
[34]
G. Kyriazis, "Heterogeneous system architecture: A technical review," AMD, 2012.
[35]
"NVIDIA's Next Generation CUDA Compute Architecture: Fermi," White Paper, NVIDIA, 2009.
[36]
S. Jones, "Introduction to Dynamic Parallelism," GPU Technology Conference, NVIDIA, 2012.
[37]
A. Bakhoda et al., "Analyzing CUDA workloads using a detailed GPU simulator," in Proceedings of ISPASS'09.
[38]
S. H. Duncan et al., "Method and apparatus for providing peer-to-peer data transfer within a computing environment," US Patent Publication US7451259 B2, 2008.
[39]
C. S. Case et al., "Multi-client virtual address translation system with translation units of variable-range size," US Patent Publication US7334108 B1, 2008.
[40]
E. Cooper-Balis et al., "Buffer-on-board memory systems," in Proceedings of ISCA'12.
[41]
D. B. Glasco et al., "Cache-based control of atomic operations in conjunction with an external ALU block," U.S. Patent 8135926, 2012.
[42]
I. Singh et al., "Cache coherence for GPU architectures," in Proceedings of HPCA'13.
[43]
P. Micikevicius, "GPU Performance Analysis and Optimization," GPU Technology Conference, 2012.
[44]
D. B. Kirk and W.-m. W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, 1sted. Morgan Kaufmann Publishers Inc., 2010.
[45]
"CUDA C/C++ SDK code samples," NVIDIA, 2011.
[46]
W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2004.
[47]
J. Kim et al., "Flattened Butterfly: a cost-efficient topology for high-radix networks," in Proceedings of ISCA'07.
[48]
S. Rixner et al., "Memory Access Scheduling," in Proceedings of ISCA '00.
[49]
J. Ahn et al., "McSimA+: A Manycore Simulator with Application-level+Simulation and Detailed Microarchitecture Modeling," in Proceedings of ISPASS '13.
[50]
M. M. K. Martin et al., "Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset," SIGARCH Computer Architecture News, vol. 33, no. 4, pp. 92--99, 2005.
[51]
N. Jiang et al., "A detailed and flexible cycle-accurate network-on-chip simulator," in Proceedings of ISPASS'13.
[52]
S. Che et al., "Rodinia: A benchmark suite for heterogeneous computing," in Proceedings of IISWC'09.
[53]
M. Burtscher et al., "A Quantitative Study of Irregular Programs on GPUs," in Proceedings of IISWC'12.
[54]
"High Performance Computing with GPUs," http://hpcgpu. codeplex.com/.
[55]
J. Stratton et al., "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," Center for Reliable and High-Performance Computing, 2012.
[56]
D. H. Bailey et al., "The NAS parallel benchmarks," International Journal of High Performance Computing Applications, vol. 5, no. 3, pp. 63--73, 1991.

Cited By

View all
  • (2022)GPU-Based In Situ Visualization for Large-Scale Discrete Element SimulationsWireless Communications & Mobile Computing10.1155/2022/34855052022Online publication date: 1-Jan-2022
  • (2022)NaviSimProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569666(333-345)Online publication date: 8-Oct-2022
  • (2022)Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU SystemsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569649(304-316)Online publication date: 8-Oct-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture
December 2014
697 pages
ISBN:9781479969982

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 13 December 2014

Check for updates

Author Tags

  1. Flattened butterfly
  2. Hybrid Memory Cubes
  3. Memory network
  4. Multi-GPU

Qualifiers

  • Tutorial
  • Research
  • Refereed limited

Conference

MICRO-47
Sponsor:

Acceptance Rates

MICRO-47 Paper Acceptance Rate 53 of 279 submissions, 19%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2022)GPU-Based In Situ Visualization for Large-Scale Discrete Element SimulationsWireless Communications & Mobile Computing10.1155/2022/34855052022Online publication date: 1-Jan-2022
  • (2022)NaviSimProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569666(333-345)Online publication date: 8-Oct-2022
  • (2022)Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU SystemsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569649(304-316)Online publication date: 8-Oct-2022
  • (2022)A traffic-aware memory-cube network using bypassingMicroprocessors & Microsystems10.1016/j.micpro.2022.10447190:COnline publication date: 1-Apr-2022
  • (2021)Ghost routing to enable oblivious computation on memory-centric networksProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00077(930-943)Online publication date: 14-Jun-2021
  • (2019)Energy Efficient Chip-to-Chip Wireless Interconnection for Heterogeneous ArchitecturesACM Transactions on Design Automation of Electronic Systems10.1145/334010924:5(1-27)Online publication date: 26-Jul-2019
  • (2019)Optimal Schedule for All-to-All Personalized Communication in Multiprocessor SystemsACM Transactions on Parallel Computing10.1145/33298676:1(1-29)Online publication date: 24-Jun-2019
  • (2019)MGPUSimProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322230(197-209)Online publication date: 22-Jun-2019
  • (2019)Quantifying the NUMA Behavior of Partitioned GPGPU ApplicationsProceedings of the 12th Workshop on General Purpose Processing Using GPUs10.1145/3300053.3319420(53-62)Online publication date: 13-Apr-2019
  • (2019)HiWayLibProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304032(153-166)Online publication date: 4-Apr-2019
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media