Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3123939.3123968acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

GPUpd: a fast and scalable multi-GPU architecture using cooperative projection and distribution

Published: 14 October 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Graphics Processing Unit (GPU) vendors have been scaling single-GPU architectures to satisfy the ever-increasing user demands for faster graphics processing. However, as it gets extremely difficult to further scale single-GPU architectures, the vendors are aiming to achieve the scaled performance by simultaneously using multiple GPUs connected with newly developed, fast inter-GPU networks (e.g., NVIDIA NVLink, AMD XDMA). With fast inter-GPU networks, it is now promising to employ split frame rendering (SFR) which improves both frame rate and single-frame latency by assigning disjoint regions of a frame to different GPUs. Unfortunately, the scalability of current SFR implementations is seriously limited as they suffer from a large amount of redundant computation among GPUs.
    This paper proposes GPUpd, a novel multi-GPU architecture for fast and scalable SFR. With small hardware extensions, GPUpd introduces a new graphics pipeline stage called Cooperative Projection & Distribution (C-PD) where all GPUs cooperatively project 3D objects to 2D screen and efficiently redistribute the objects to their corresponding GPUs. C-PD not only eliminates the redundant computation among GPUs, but also incurs minimal inter-GPU network traffic by transferring object IDs instead of mid-pipeline outcomes between GPUs. To further reduce the redistribution overheads, GPUpd minimizes inter-GPU synchronizations by implementing batching and runahead-execution of draw commands. Our detailed cycle-level simulations with 8 real-world game traces show that GPUpd achieves a geomean speedup of 4.98X in single-frame latency with 16 GPUs, whereas the current SFR implementations achieve only 3.07X geomean speedup which saturates on 4 or more GPUs.

    References

    [1]
    {n. d.}. AMD CrossFire guide for Direct3D® 11 applications. ({n. d.}). https://gpuopen-librariesandsdks.github.io/doc/AMD-CrossFire-guide-for-Direct3D11-applications.pdf
    [2]
    {n. d.}. apitrace. ({n. d.}). http://apitrace.github.io/
    [3]
    {n. d.}. Direct3D 12 Graphics. ({n. d.}). https://msdn.microsoft.com/en-us/library/windows/desktop/dn903821(v=vs.85).aspx
    [4]
    {n. d.}. glmark2. ({n. d.}). https://github.com/glmark2/glmark2
    [5]
    {n. d.}. NVIDIA Nsight. ({n. d.}). http://www.nvidia.com/object/nsight.html
    [6]
    2011. SLI Best Practices. (2011). http://developer.download.nvidia.com/whitepapers/2011/SLI_Best_Practices_2011_Feb.pdf
    [7]
    2015. Adreno Hardware Tutorial 3: Tile Based Rendering. (2015). https://www.youtube.com/watch?v=SeySx0TkluE
    [8]
    2015. Modernizing multi-GPU gaming with XDMA. (2015). https://community.amd.com/community/gaming/blog/2015/05/11/modernizing-multi-gpu-gaming-with-xdma
    [9]
    Advanced Micro Devices, Inc. 2008. R600-Family Instruction Set Architecture. (2008). http://developer.amd.com/wordpress/media/2012/10/r600isa.pdf
    [10]
    Advanced Micro Devices, Inc. 2012. AMD's Graphics Core Next Technology. (2012). https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
    [11]
    Kurt Akeley. 1993. RealityEngine Graphics. In Proc. 20th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
    [12]
    Bruce Anderson, Andy Stewart, Rob MacAulay, and Turner Whitted. 1997. Accommodating Memory Latency In A Low-cost Rasterizer. In Proc. ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (HWWS).
    [13]
    José-María Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis. 2012. Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor. In Proc. 39th IEEE/ACM International Symposium on Computer Architecture (ISCA).
    [14]
    Jose-Maria Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis. 2013. Parallel Frame Rendering: Trading Responsiveness for Energy on a Mobile GPU. In Proc. 22nd IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT).
    [15]
    Jose-Maria Arnau, Joan-Manuel Parcerisa, and Polychronis Xekalakis. 2014. Eliminating Redundant Fragment Shader Executions on a Mobile GPU via Hardware Memoization. In Proc. 41st IEEE/ACM International Symposium on Computer Architecture (ISCA).
    [16]
    Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability. In Proc. 44th International Symposium on Computer Architecture (ISCA).
    [17]
    Steve Burke. 2016. GTX 1060 "SLI" Benchmark - Outperforms GTX 1080 with Explicit Multi-GPU. (2016). http://www.gamersnexus.net/guides/2519-gtx-1060-sli-benchmark-in-ashes-multi-gpu
    [18]
    Hessed Choi. 2016. Bifrost - The GPU architecture for next five billion. (2016). https://www.arm.com/files/pdf/20160628_A04_ATF_Korea_Hessed_Choi.pdf
    [19]
    Petrik Clarberg, Robert Toth, Jon Hasselgren, Jim Nilsson, and Tomas Akenine-Möller. 2014. AMFS: Adaptive Multi-Frequency Shading for Future Graphics Processors. In Proc. 41st International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
    [20]
    Michael Cox, Narendra Bhandari, and Michael Shantz. 1998. Multi-Level Texture Caching for 3D Graphics Hardware. In Proc. 25th Annual International Symposium on Computer Architecture (ISCA).
    [21]
    Enrique de Lucas, Pedro Marcuello, Joan-Manuel Parcerisa, and Antonio González. 2015. Ultra-Low Power Render-Based Collision Detection for CPU/GPU Systems. In Proc. 48th International Symposium on Microarchitecture (MICRO).
    [22]
    Victor Moya del Barrio, Carlos González, Jordi Roca, Agustín Fernández, and Roger Espasa. 2006. ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures. In Proc. IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
    [23]
    Michael Doggett. 2012. Texture Caches. IEEE Micro 32 (2012).
    [24]
    Stefan Eilemann, Maxim Makhinya, and Renato Pajarola. 2009. Equalizer: A Scalable Parallel Rendering Framework. IEEE Transactions on Visualization and Computer Graphics 15 (2009). Issue 3.
    [25]
    Matthew Eldridge, Homan Igehy, and Pat Hanrahan. 2000. Pomegranate: A Fully Scalable Graphics Architecture. In Proc. 27th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
    [26]
    Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark Silicon and the End of Multicore Scaling. In Proc. 38th International Symposium on Computer Architecture (ISCA).
    [27]
    John Eyles, Steven Molnar, John Poulton, Trey Greer, Anselmo Lastra, Nick England, and Lee Westover. 1997. PixelFlow: The Realization. In Proc. ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (GH).
    [28]
    Henry Fuchs, John Poulton, John Eyles, Trey Greer, Jack Goldfeather, David Ellsworth, Steve Molnar, Greg Turk, Brice Tebbs, and Laura Israel. 1989. Pixel-Planes 5: A Heterogeneous Multiprocessor Graphics System Using Processor-Enhanced Memories. In Proc. 16th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
    [29]
    Ziyad S. Hakura and Anoop Gupta. 1997. The Design and Analysis of a Cache Architecture for Texture Mapping. In Proc. 24th Annual International Symposium on Computer Architecture (ISCA).
    [30]
    Peter Harris. 2014. The Mali GPU: An Abstract Machine, Part 2 - Tile-based Rendering. (2014). https://community.arm.com/graphics/b/blog/posts/the-mali-gpu-an-abstract-machine-part-2---tile-based-rendering
    [31]
    Greg Humphreys, Ian Buck, Matthew Eldridge, and Pat Hanrahan. 2000. Distributed Rendering for Scalable Displays. In Proc. ACM/IEEE Conference on Supercomputing (SC).
    [32]
    Greg Humphreys, Matthew Eldridge, Ian Buck, Gordon Stoll, Matthew Everett, and Pat Hanrahan. 2001. WireGL: A Scalable Graphics System for Clusters. In Proc. 28th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
    [33]
    Greg Humphreys, Mike Houston, Ren Ng, Randall Frank, Sean Ahern, Peter D. Kirchner, and James T. Klosowski. 2002. Chromium: A Stream-Processing Framework for Interactive Rendering on Clusters. In Proc. 29th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
    [34]
    Homan Igehy, Matthew Eldridge, and Kekoa Proudfoot. 1998. Prefetching in a Texture Cache Architecture. In Proc. ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (HWWS).
    [35]
    Mark J. Kilgard. 1997. Realizing OpenGL: Two Implementations of One Architecture. In Proc. ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (HWWS).
    [36]
    Christoph Kubisch. 2015. Life of a triangle - NVIDIA's logical pipeline. (2015). https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline
    [37]
    Samuli Laine and Tero Karras. 2011. High-Performance Software Rasterization on GPUs. In Proc. ACM SIGGRAPH Symposium on High Performance Graphics (HPG).
    [38]
    Jeremy Laird. 2016. NVIDIA GTX 1080: A Big Leap, But Not Quite A 4K Slayer. (2016). https://www.rockpapershotgun.com/2016/07/14/gtx-1080-4k-performance/
    [39]
    Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro 28 (2008). Issue 2.
    [40]
    David Luebke and Greg Humphreys. 2007. How GPUs Work. IEEE Computer 40 (2007). Issue 2.
    [41]
    Microsoft Corporation. {n. d.}. Graphics Pipeline. ({n. d.}). https://msdn.microsoft.com/en-us/library/windows/desktop/ff476882(v=vs.85).aspx
    [42]
    David Mitchelson. 2017. NVIDIA GTX 1080 Ti Review. (2017). https://www.vortez.net/articles_pages/nvidia_gtx_1080_ti_review,2.html
    [43]
    Steven Molnar, Michael Cox, David Ellsworth, and Henry Fuchs. 1994. A Sorting Classification of Parallel Rendering. IEEE Computer Graphics and Applications 14, 4 (1994).
    [44]
    Jordi Roca Monfort and Mark Grossman. 2009. Scaling of 3D Game Engine Workloads on Modern Multi-GPU Systems. In Proc. Conference on High Performance Graphics (HPG).
    [45]
    Victor Moya, Carlos González, Jordi Roca, Agustín Fernández, and Roger Espasa. 2005. A Single (Unified) Shader GPU Microarchitecture for Embedded Systems. In Proc. 1st International Conference on High Performance Embedded Architectures and Compilers (HiPEAC).
    [46]
    Victor Moya, Carlos Gonzalez, Jordi Roca, Agustin Fernandez, and Roger Espasa. 2005. Shader Performance Analysis on a Modern GPU Architecture. In Proc. 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
    [47]
    NVIDIA Corporation. 2014. NVIDIA® NVLink High-Speed Interconnect: Application Performance. (2014). http://info.nvidianews.com/rs/nvidia/images/NVIDIA%20NVLink%20High-Speed%20Interconnect%20Application%20Performance%20Brief.pdf
    [48]
    NVIDIA Corporation. 2016. NVIDIA Announces Financial Results for the Fourth Quarter and Fiscal 2016. (2016). http://nvidianews.nvidia.com/news/nvidia-announces-financial-results-for-the-fourth-quarter-and-fiscal-2016
    [49]
    NVIDIA Corporation. 2016. NVIDIA GeForce GTX 1080. (2016). http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf
    [50]
    Martin Randall. 1997. Talisman: Multimedia for the PC. IEEE Micro 17 (1997). Issue 2.
    [51]
    Ashu Rege. {n. d.}. An Introduction to Modern GPU Architecture. ({n. d.}). http://download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf
    [52]
    Jordi Roca, Victor Moya, Carlos González, Chema Solís, and Agustín Fernández. 2006. Workload Characterization of 3D Games. In Proc. IEEE International Symposium on Workload Characterization (IISWC).
    [53]
    Mark Segal and Kurt Akeley. 2017. The OpenGL® Graphics System: A Specification (Version 4.5 (Core Profile) - June 29, 2017). (2017). https://khronos.org/registry/OpenGL/specs/gl/glspec45.core.pdf
    [54]
    Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. 2008. Larrabee: A Many-Core x86 Architecture for Visual Computing. In Proc. 35th International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
    [55]
    B. V. N. Silpa, Anjul Patney, Tushar Krishna, Preeti Ranjan Panda, and G. S. Visweswaran. 2008. Texture Filter Memory - a power-efficient and scalable texture memory architecture for mobile graphics processors. In Proc. IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
    [56]
    Ryan Smith. 2013. The AMD Radeon R9 290X Review. (2013). http://www.anandtech.com/show/7457/the-radeon-r9-290x-review
    [57]
    Rys Sommefeldt. 2015. A look at the PowerVR graphics architecture: Tile-based rendering. (2015). https://www.imgtec.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/
    [58]
    Jay Torborg and James T. Kajiya. 1996. Talisman: Commodity Realtime 3D Graphics for the PC. In Proc. 23rd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH).
    [59]
    Chenhao Xie, Shuaiwen Leon Song, and Jing Wang. 2017. Processing-in-Memory Enabled Graphics Processors for 3D Rendering. In Proc. 23rd IEEE Symposium on High Performance Computer Architecture (HPCA).

    Cited By

    View all
    • (2023)Comparative Study of Various Methods proposed to improve performance of 3D Graphics Pipeline2023 First International Conference on Advances in Electrical, Electronics and Computational Intelligence (ICAEECI)10.1109/ICAEECI58247.2023.10370813(1-6)Online publication date: 19-Oct-2023
    • (2022)SPIDER: An Effective, Efficient and Robust Load Scheduler for Real-time Split Frame Rendering2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00071(672-682)Online publication date: May-2022
    • (2021)GPU Accelerated Path Tracing of Massive ScenesACM Transactions on Graphics10.1145/344780740:2(1-17)Online publication date: 27-Apr-2021
    • Show More Cited By

    Index Terms

    1. GPUpd: a fast and scalable multi-GPU architecture using cooperative projection and distribution

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
        October 2017
        850 pages
        ISBN:9781450349529
        DOI:10.1145/3123939
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 14 October 2017

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. graphics pipeline
        2. graphics processing units (GPUS)
        3. multi-GPU systems
        4. split frame rendering (SFR)

        Qualifiers

        • Research-article

        Funding Sources

        • Samsung Research

        Conference

        MICRO-50
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 484 of 2,242 submissions, 22%

        Upcoming Conference

        MICRO '24

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)71
        • Downloads (Last 6 weeks)11
        Reflects downloads up to 11 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Comparative Study of Various Methods proposed to improve performance of 3D Graphics Pipeline2023 First International Conference on Advances in Electrical, Electronics and Computational Intelligence (ICAEECI)10.1109/ICAEECI58247.2023.10370813(1-6)Online publication date: 19-Oct-2023
        • (2022)SPIDER: An Effective, Efficient and Robust Load Scheduler for Real-time Split Frame Rendering2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00071(672-682)Online publication date: May-2022
        • (2021)GPU Accelerated Path Tracing of Massive ScenesACM Transactions on Graphics10.1145/344780740:2(1-17)Online publication date: 27-Apr-2021
        • (2021)Q-VR: system-level design for future mobile collaborative virtual realityProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446715(587-599)Online publication date: 19-Apr-2021
        • (2021)CHOPIN: Scalable Graphics Rendering in Multi-GPU Systems via Parallel Image Composition2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00065(709-722)Online publication date: Feb-2021
        • (2019)OO-VRProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322247(53-65)Online publication date: 22-Jun-2019

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media