research-article

GPUpd: a fast and scalable multi-GPU architecture using cooperative projection and distribution

Authors:

Jangwoo KimAuthors Info & Claims

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 574 - 586

https://doi.org/10.1145/3123939.3123968

Published: 14 October 2017 Publication History

Abstract

Graphics Processing Unit (GPU) vendors have been scaling single-GPU architectures to satisfy the ever-increasing user demands for faster graphics processing. However, as it gets extremely difficult to further scale single-GPU architectures, the vendors are aiming to achieve the scaled performance by simultaneously using multiple GPUs connected with newly developed, fast inter-GPU networks (e.g., NVIDIA NVLink, AMD XDMA). With fast inter-GPU networks, it is now promising to employ split frame rendering (SFR) which improves both frame rate and single-frame latency by assigning disjoint regions of a frame to different GPUs. Unfortunately, the scalability of current SFR implementations is seriously limited as they suffer from a large amount of redundant computation among GPUs.

This paper proposes GPUpd, a novel multi-GPU architecture for fast and scalable SFR. With small hardware extensions, GPUpd introduces a new graphics pipeline stage called Cooperative Projection & Distribution (C-PD) where all GPUs cooperatively project 3D objects to 2D screen and efficiently redistribute the objects to their corresponding GPUs. C-PD not only eliminates the redundant computation among GPUs, but also incurs minimal inter-GPU network traffic by transferring object IDs instead of mid-pipeline outcomes between GPUs. To further reduce the redistribution overheads, GPUpd minimizes inter-GPU synchronizations by implementing batching and runahead-execution of draw commands. Our detailed cycle-level simulations with 8 real-world game traces show that GPUpd achieves a geomean speedup of 4.98X in single-frame latency with 16 GPUs, whereas the current SFR implementations achieve only 3.07X geomean speedup which saturates on 4 or more GPUs.

References

[1]

{n. d.}. AMD CrossFire guide for Direct3D^® 11 applications. ({n. d.}). https://gpuopen-librariesandsdks.github.io/doc/AMD-CrossFire-guide-for-Direct3D11-applications.pdf

[2]

{n. d.}. apitrace. ({n. d.}). http://apitrace.github.io/

[3]

{n. d.}. Direct3D 12 Graphics. ({n. d.}). https://msdn.microsoft.com/en-us/library/windows/desktop/dn903821(v=vs.85).aspx

[4]

{n. d.}. glmark2. ({n. d.}). https://github.com/glmark2/glmark2

[5]

{n. d.}. NVIDIA Nsight. ({n. d.}). http://www.nvidia.com/object/nsight.html

[6]

2011. SLI Best Practices. (2011). http://developer.download.nvidia.com/whitepapers/2011/SLI_Best_Practices_2011_Feb.pdf

[7]

2015. Adreno Hardware Tutorial 3: Tile Based Rendering. (2015). https://www.youtube.com/watch?v=SeySx0TkluE

[8]

2015. Modernizing multi-GPU gaming with XDMA. (2015). https://community.amd.com/community/gaming/blog/2015/05/11/modernizing-multi-gpu-gaming-with-xdma

[9]

Advanced Micro Devices, Inc. 2008. R600-Family Instruction Set Architecture. (2008). http://developer.amd.com/wordpress/media/2012/10/r600isa.pdf

[10]

Advanced Micro Devices, Inc. 2012. AMD's Graphics Core Next Technology. (2012). https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

[11]

Kurt Akeley. 1993. RealityEngine Graphics. In Proc. 20th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).

Digital Library

[12]

Bruce Anderson, Andy Stewart, Rob MacAulay, and Turner Whitted. 1997. Accommodating Memory Latency In A Low-cost Rasterizer. In Proc. ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (HWWS).

Digital Library

[13]

José-María Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis. 2012. Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor. In Proc. 39th IEEE/ACM International Symposium on Computer Architecture (ISCA).

Digital Library

[14]

Jose-Maria Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis. 2013. Parallel Frame Rendering: Trading Responsiveness for Energy on a Mobile GPU. In Proc. 22nd IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT).

Digital Library

[15]

Jose-Maria Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis. 2014. Eliminating Redundant Fragment Shader Executions on a Mobile GPU via Hardware Memoization. In Proc. 41st IEEE/ACM International Symposium on Computer Architecture (ISCA).

Digital Library

[16]

Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability. In Proc. 44th International Symposium on Computer Architecture (ISCA).

Digital Library

[17]

Steve Burke. 2016. GTX 1060 "SLI" Benchmark - Outperforms GTX 1080 with Explicit Multi-GPU. (2016). http://www.gamersnexus.net/guides/2519-gtx-1060-sli-benchmark-in-ashes-multi-gpu

[18]

Hessed Choi. 2016. Bifrost - The GPU architecture for next five billion. (2016). https://www.arm.com/files/pdf/20160628_A04_ATF_Korea_Hessed_Choi.pdf

[19]

Petrik Clarberg, Robert Toth, Jon Hasselgren, Jim Nilsson, and Tomas Akenine-Möller. 2014. AMFS: Adaptive Multi-Frequency Shading for Future Graphics Processors. In Proc. 41st International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).

Digital Library

[20]

Michael Cox, Narendra Bhandari, and Michael Shantz. 1998. Multi-Level Texture Caching for 3D Graphics Hardware. In Proc. 25th Annual International Symposium on Computer Architecture (ISCA).

Digital Library

[21]

Enrique de Lucas, Pedro Marcuello, Joan-Manuel Parcerisa, and Antonio González. 2015. Ultra-Low Power Render-Based Collision Detection for CPU/GPU Systems. In Proc. 48th International Symposium on Microarchitecture (MICRO).

Digital Library

[22]

Victor Moya del Barrio, Carlos González, Jordi Roca, Agustín Fernández, and Roger Espasa. 2006. ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures. In Proc. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[23]

Michael Doggett. 2012. Texture Caches. IEEE Micro 32 (2012).

Digital Library

[24]

Stefan Eilemann, Maxim Makhinya, and Renato Pajarola. 2009. Equalizer: A Scalable Parallel Rendering Framework. IEEE Transactions on Visualization and Computer Graphics 15 (2009). Issue 3.

Digital Library

[25]

Matthew Eldridge, Homan Igehy, and Pat Hanrahan. 2000. Pomegranate: A Fully Scalable Graphics Architecture. In Proc. 27th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).

Digital Library

[26]

Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark Silicon and the End of Multicore Scaling. In Proc. 38th International Symposium on Computer Architecture (ISCA).

Digital Library

[27]

John Eyles, Steven Molnar, John Poulton, Trey Greer, Anselmo Lastra, Nick England, and Lee Westover. 1997. PixelFlow: The Realization. In Proc. ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (GH).

Digital Library

[28]

Henry Fuchs, John Poulton, John Eyles, Trey Greer, Jack Goldfeather, David Ellsworth, Steve Molnar, Greg Turk, Brice Tebbs, and Laura Israel. 1989. Pixel-Planes 5: A Heterogeneous Multiprocessor Graphics System Using Processor-Enhanced Memories. In Proc. 16th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).

Digital Library

[29]

Ziyad S. Hakura and Anoop Gupta. 1997. The Design and Analysis of a Cache Architecture for Texture Mapping. In Proc. 24th Annual International Symposium on Computer Architecture (ISCA).

Digital Library

[30]

Peter Harris. 2014. The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering. (2014). https://community.arm.com/graphics/b/blog/posts/the-mali-gpu-an-abstract-machine-part-2---tile-based-rendering

[31]

Greg Humphreys, Ian Buck, Matthew Eldridge, and Pat Hanrahan. 2000. Distributed Rendering for Scalable Displays. In Proc. ACM/IEEE Conference on Supercomputing (SC).

Digital Library

[32]

Greg Humphreys, Matthew Eldridge, Ian Buck, Gordon Stoll, Matthew Everett, and Pat Hanrahan. 2001. WireGL: A Scalable Graphics System for Clusters. In Proc. 28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).

Digital Library

[33]

Greg Humphreys, Mike Houston, Ren Ng, Randall Frank, Sean Ahern, Peter D. Kirchner, and James T. Klosowski. 2002. Chromium: A Stream-Processing Framework for Interactive Rendering on Clusters. In Proc. 29th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).

Digital Library

[34]

Homan Igehy, Matthew Eldridge, and Kekoa Proudfoot. 1998. Prefetching in a Texture Cache Architecture. In Proc. ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (HWWS).

Digital Library

[35]

Mark J. Kilgard. 1997. Realizing OpenGL: Two Implementations of One Architecture. In Proc. ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (HWWS).

Digital Library

[36]

Christoph Kubisch. 2015. Life of a triangle - NVIDIA's logical pipeline. (2015). https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline

[37]

Samuli Laine and Tero Karras. 2011. High-Performance Software Rasterization on GPUs. In Proc. ACM SIGGRAPH Symposium on High Performance Graphics (HPG).

Digital Library

[38]

Jeremy Laird. 2016. NVIDIA GTX 1080: A Big Leap, But Not Quite A 4K Slayer. (2016). https://www.rockpapershotgun.com/2016/07/14/gtx-1080-4k-performance/

[39]

Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28 (2008). Issue 2.

Digital Library

[40]

David Luebke and Greg Humphreys. 2007. How GPUs Work. IEEE Computer 40 (2007). Issue 2.

Digital Library

[41]

Microsoft Corporation. {n. d.}. Graphics Pipeline. ({n. d.}). https://msdn.microsoft.com/en-us/library/windows/desktop/ff476882(v=vs.85).aspx

[42]

David Mitchelson. 2017. NVIDIA GTX 1080 Ti Review. (2017). https://www.vortez.net/articles_pages/nvidia_gtx_1080_ti_review,2.html

[43]

Steven Molnar, Michael Cox, David Ellsworth, and Henry Fuchs. 1994. A Sorting Classification of Parallel Rendering. IEEE Computer Graphics and Applications 14, 4 (1994).

Digital Library

[44]

Jordi Roca Monfort and Mark Grossman. 2009. Scaling of 3D Game Engine Workloads on Modern Multi-GPU Systems. In Proc. Conference on High Performance Graphics (HPG).

Digital Library

[45]

Victor Moya, Carlos González, Jordi Roca, Agustín Fernández, and Roger Espasa. 2005. A Single (Unified) Shader GPU Microarchitecture for Embedded Systems. In Proc. 1st International Conference on High Performance Embedded Architectures and Compilers (HiPEAC).

Digital Library

[46]

Victor Moya, Carlos Gonzalez, Jordi Roca, Agustin Fernandez, and Roger Espasa. 2005. Shader Performance Analysis on a Modern GPU Architecture. In Proc. 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

Digital Library

[47]

NVIDIA Corporation. 2014. NVIDIA^® NVLink^™ High-Speed Interconnect: Application Performance. (2014). http://info.nvidianews.com/rs/nvidia/images/NVIDIA%20NVLink%20High-Speed%20Interconnect%20Application%20Performance%20Brief.pdf

[48]

NVIDIA Corporation. 2016. NVIDIA Announces Financial Results for the Fourth Quarter and Fiscal 2016. (2016). http://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-the-fourth-quarter-and-fiscal-2016

[49]

NVIDIA Corporation. 2016. NVIDIA GeForce GTX 1080. (2016). http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf

[50]

Martin Randall. 1997. Talisman: Multimedia for the PC. IEEE Micro 17 (1997). Issue 2.

Digital Library

[51]

Ashu Rege. {n. d.}. An Introduction to Modern GPU Architecture. ({n. d.}). http://download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf

[52]

Jordi Roca, Victor Moya, Carlos González, Chema Solís, and Agustín Fernández. 2006. Workload Characterization of 3D Games. In Proc. IEEE International Symposium on Workload Characterization (IISWC).

[53]

Mark Segal and Kurt Akeley. 2017. The OpenGL^® Graphics System: A Specification (Version 4.5 (Core Profile) - June 29, 2017). (2017). https://khronos.org/registry/OpenGL/specs/gl/glspec45.core.pdf

[54]

Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. 2008. Larrabee: A Many-Core x86 Architecture for Visual Computing. In Proc. 35th International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).

Digital Library

[55]

B. V. N. Silpa, Anjul Patney, Tushar Krishna, Preeti Ranjan Panda, and G. S. Visweswaran. 2008. Texture Filter Memory - a power-efficient and scalable texture memory architecture for mobile graphics processors. In Proc. IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

Digital Library

[56]

Ryan Smith. 2013. The AMD Radeon R9 290X Review. (2013). http://www.anandtech.com/show/7457/the-radeon-r9-290x-review

[57]

Rys Sommefeldt. 2015. A look at the PowerVR graphics architecture: Tile-based rendering. (2015). https://www.imgtec.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/

[58]

Jay Torborg and James T. Kajiya. 1996. Talisman: Commodity Realtime 3D Graphics for the PC. In Proc. 23rd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).

Digital Library

[59]

Chenhao Xie, Shuaiwen Leon Song, and Jing Wang. 2017. Processing-in-Memory Enabled Graphics Processors for 3D Rendering. In Proc. 23rd IEEE Symposium on High Performance Computer Architecture (HPCA).

Cited By

Gothi AKolhare NKapse Y(2023)Comparative Study of Various Methods proposed to improve performance of 3D Graphics Pipeline2023 First International Conference on Advances in Electrical, Electronics and Computational Intelligence (ICAEECI)10.1109/ICAEECI58247.2023.10370813(1-6)Online publication date: 19-Oct-2023
https://doi.org/10.1109/ICAEECI58247.2023.10370813
Ma BZhang ZLi YCai WWang GLiu X(2022)SPIDER: An Effective, Efficient and Robust Load Scheduler for Real-time Split Frame Rendering2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00071(672-682)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00071
Jaroš MŘíha LStrakoš PŠpeťko M(2021)GPU Accelerated Path Tracing of Massive ScenesACM Transactions on Graphics10.1145/344780740:2(1-17)Online publication date: 27-Apr-2021
https://dl.acm.org/doi/10.1145/3447807
Show More Cited By

Index Terms

GPUpd: a fast and scalable multi-GPU architecture using cooperative projection and distribution
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors

Recommendations

MGPUSim: enabling multi-GPU performance modeling and optimization
ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture

The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU ...
Exploiting task and data parallelism in ILUPACK's preconditioned CG solver on NUMA architectures and many-core accelerators

Specialized implementations of ILUPACK's iterative solver for NUMA platforms.Specialized implementations of ILUPACK's iterative solver for many-core accelerators.Exploitation of task parallelism via OmpSs runtime (dynamic schedule).Exploitation of task ...
A GPU accelerated storage system
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

Massively multicore processors, like, for example, Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

October 2017

850 pages

ISBN:9781450349529

DOI:10.1145/3123939

General Chairs:
Hillery Hunter
IBM Research
,
Jaime Moreno
IBM Research
,
Program Chairs:
Joel Emer
NVIDIA and MIT
,
Daniel Sanchez
MIT

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Samsung Research

Conference

MICRO-50

Sponsor:

SIGMICRO
IEEE-CS\DATC

MICRO-50: The 50th Annual IEEE/ACM International Symposium on Microarchitecture

October 14 - 18, 2017

Massachusetts, Cambridge

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
972
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)11

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gothi AKolhare NKapse Y(2023)Comparative Study of Various Methods proposed to improve performance of 3D Graphics Pipeline2023 First International Conference on Advances in Electrical, Electronics and Computational Intelligence (ICAEECI)10.1109/ICAEECI58247.2023.10370813(1-6)Online publication date: 19-Oct-2023
https://doi.org/10.1109/ICAEECI58247.2023.10370813
Ma BZhang ZLi YCai WWang GLiu X(2022)SPIDER: An Effective, Efficient and Robust Load Scheduler for Real-time Split Frame Rendering2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00071(672-682)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00071
Jaroš MŘíha LStrakoš PŠpeťko M(2021)GPU Accelerated Path Tracing of Massive ScenesACM Transactions on Graphics10.1145/344780740:2(1-17)Online publication date: 27-Apr-2021
https://dl.acm.org/doi/10.1145/3447807
Xie CLi XHu YPeng HTaylor MSong SSherwood TBerger EKozyrakis C(2021)Q-VR: system-level design for future mobile collaborative virtual realityProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446715(587-599)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446715
Ren XLis M(2021)CHOPIN: Scalable Graphics Rendering in Multi-GPU Systems via Parallel Image Composition2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00065(709-722)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00065
Xie CXin FChen MSong SManne SHunter HAltman E(2019)OO-VRProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322247(53-65)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322247

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents