Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

GPU Accelerated Path Tracing of Massive Scenes

Published: 27 April 2021 Publication History
  • Get Citation Alerts
  • Abstract

    This article presents a solution to path tracing of massive scenes on multiple GPUs. Our approach analyzes the memory access pattern of a path tracer and defines how the scene data should be distributed across up to 16 GPUs with minimal effect on performance. The key concept is that the parts of the scene that have the highest amount of memory accesses are replicated on all GPUs.
    We propose two methods for maximizing the performance of path tracing when working with partially distributed scene data. Both methods work on the memory management level and therefore path tracer data structures do not have to be redesigned, making our approach applicable to other path tracers with only minor changes in their code. As a proof of concept, we have enhanced the open-source Blender Cycles path tracer.
    The approach was validated on scenes of sizes up to 169 GB. We show that only 1–5% of the scene data needs to be replicated to all machines for such large scenes. On smaller scenes we have verified that the performance is very close to rendering a fully replicated scene. In terms of scalability we have achieved a parallel efficiency of over 94% using up to 16 GPUs.

    References

    [1]
    Neha Agarwal, David Nellans, Mike O’Connor, Stephen W. Keckler, and Thomas F. Wenisch. 2015. Unlocking bandwidth for GPUs in CC-NUMA systems. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). IEEE, 354–365.
    [2]
    Timo Aila and Tero Karras. 2010. Architecture considerations for tracing incoherent rays. In Proceedings of the Conference on High Performance Graphics. 113–122.
    [3]
    Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In Proceedings of the Conference on High Performance Graphics. Association for Computing Machinery, 145–149.
    [4]
    Timo Aila, Samuli Laine, and Tero Karras. 2012. Understanding the Efficiency of Ray Traversal on GPUs—Kepler and Fermi Addendum. NVIDIA Technical Report NVR-2012-02. NVIDIA Corporation.
    [5]
    Nabeel Al-Saber and Milind Kulkarni. 2015. SemCache++: Semantics-aware caching for efficient multi-GPU offloading. In Proceedings of the International Conference on Supercomputing. 79–88.
    [6]
    Nabeel AlSaber and Milind Kulkarni. 2013. Semcache: Semantics-aware caching for efficient GPU offloading. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. 421–432.
    [7]
    AMD. 2017. AMD EPYC SoC Delivers Exceptional Results on the STREAM Benchmark on 2P Servers. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-SoC-Delivers-Exceptional-Results.pdf.
    [8]
    Atos. 2017. BullSequana X410 E5 Dense GPU-Accelerated Compute Node. Retrieved from https://atos.net/wp-content/uploads/2017/11/FS_BullSequana_X410E5_en1-web.pdf.
    [9]
    Trinayan Baruah, Yifan Sun, Ali Tolga Dinçer, Saiful A. Mojumder, José L. Abellán, Yash Ukidave, Ajay Joshi, Norman Rubin, John Kim, and David Kaeli. 2020. Griffin: Hardware-software support for efficient page migration in multi-GPU systems. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). IEEE, 596–609.
    [10]
    Jeremy Birn. 2015. 3dRender.com: Lighting Challenges. Retrieved from http://www.3drender.com/challenges/.
    [11]
    Blender Foundation. 2018. Cycles Open Source Production Rendering. Retrieved from https://www.cycles-renderer.org/.
    [12]
    Brian Budge, Tony Bernardin, Jeff A. Stuart, Shubhabrata Sengupta, Kenneth I. Joy, and John D. Owens. 2009. Out-of-core data management for path tracing on hybrid resources. Comput. Graphics Forum 28, 2 (2009), 385–396.
    [13]
    Brent Burley, David Adler, Matt Jen-Yuan Chiang, Hank Driskill, Ralf Habel, Patrick Kelly, Peter Kutz, Yining Karl Li, and Daniel Teece. 2018. The design and evolution of Disney’s hyperion renderer. ACM Trans. Graphics 37, 3 (2018).
    [14]
    Brent Burley and Dylan Lacewell. 2008. Ptex: Per-face texture mapping for production rendering. Comput. Graphics Forum 27, 4 (2008), 1155–1164.
    [15]
    Steven W. D. Chien, Ivy B. Peng, and Stefano Markidis. 2020. Performance evaluation of advanced features in CUDA unified memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’19).
    [16]
    Cameron Christensen, Thomas Fogal, Nathan Luehr, and Cliff Woolley. 2017. Topology-aware image compositing using NVLink. In Proceedings of the IEEE Symposium on Large Data Analysis and Visualization (LDAV’16). 93–94.
    [17]
    Per Christensen, Julian Fong, Jonathan Shade, Wayne Wooten, Brenden Schubert, Andrew Kensler, Stephen Friedman, Charlie Kilpatrick, Cliff Ramshaw, Marc Bannister et al. 2018. RenderMan: An advanced path-tracing architecture for movie rendering. ACM Trans. Graphics 37, 3 (2018).
    [18]
    Robert L. Cook, Loren Carpenter, and Edwin Catmull. 1987. The reyes image rendering architecture. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH’87). 95–102.
    [19]
    D. E. Demarle, C. P. Gribble, S. Boulos, and S. G. Parker. 2005. Memory sharing for interactive ray tracing on clusters. Parallel Comput. 31, 2 (2005), 221–242. www.scopus.com Cited By :13.
    [20]
    Luca Fascione, Johannes Hanika, Mark Leone, Marc Droske, Jorge Schwarzhaupt, Tomáš Davidovič, Andrea Weidlich, and Johannes Meng. 2018. Manuka: A batch-shading architecture for spectral path tracing in movie production. ACM Trans. Graphics 37, 3 (2018).
    [21]
    Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory. In Proceedings of the International Symposium on Computer Architecture. 224–235.
    [22]
    Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2020. Adaptive page migration for irregular data-intensive applications under GPU memory oversubscription. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’20). IEEE, 451–461.
    [23]
    Rahulkumar Gayatri, Kevin Gott, and Jack Deslippe. 2019. Comparing managed memory and ATS with and without prefetching on NVIDIA Volta GPUs. In Proceedings of the IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS’19). 41–46.
    [24]
    Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, and Wen-mei W. Hwu. 2010a. An asymmetric distributed shared memory model for heterogeneous parallel systems. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems. 347–358.
    [25]
    Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, and Wen-mei W. Hwu. 2010b. An asymmetric distributed shared memory model for heterogeneous parallel systems. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). 347–358.
    [26]
    Iliyan Georgiev, Thiago Ize, Mike Farnsworth, Ramón Montoya-Vozmediano, Alan King, Brecht Van Lommel, Angel Jimenez, Oscar Anson, Shinji Ogaki, Eric Johnston et al. 2018. Arnold: A brute-force production path tracer. ACM Trans. Graphics 37, 3 (2018).
    [27]
    Mark Harris. 2017. Unified Memory for CUDA Beginners. Retrieved from https://devblogs.nvidia.com/unified-memory-cuda-beginners/.
    [28]
    John Hennessy, Mark Heinrich, and Anoop Gupta. 1999. Cache-coherent distributed shared memory: Perspectives on its development and future challenges. Proc. IEEE 87, 3 (1999), 418–429.
    [29]
    Huynh P. Huynh, Andrei Hagiescu, Weng-Fai Wong, and Rick S. M. Goh. 2012. Scalable framework for mapping streaming applications onto multi-GPU systems. ACM SIGPLAN Notices 47, 8 (2012), 1–10.
    [30]
    IT4Innovations. 2019. Barbora supercomputer cluster. Retrieved from https://docs.it4i.cz/barbora/introduction/.
    [31]
    Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August. 2012a. Dynamically managed data for CPU-GPU architectures. In Proceedings of the 10th International Symposium on Code Generation and Optimization. 165–174.
    [32]
    Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August. 2012b. Dynamically managed data for CPU-GPU architectures. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’12). 165–174.
    [33]
    Milan Jaros, Lubomir Riha, Tomas Karasek, Petr Strakos, and Daniel Krpelik. 2017. Rendering in Blender cycles using MPI and Intel Xeon Phi. In Proceedings of the International Conference on Computer Graphics and Digital Image Processing (CGDIP 2017), A. H. Nasri (Ed.). Asia Pacific Institute of Science and Engineering.
    [34]
    Florian Kainz, Rod Bogart, and Piotr Stanczyk. 2009. Technical introduction to OpenEXR. Industrial Light Magic (2009), 21.
    [35]
    Toshiaki Kato, Hitoshi Nishimura, Tadashi Endo, Tamotsu Maruyama, Jun Saito, and Per H. Christensen. 2001. Parallel rendering and the quest for realism: The “kilauea” massively parallel ray tracer. In Alan Chalmers, Practical Parallel Processing for Today’s Rendering Challenges. SIGGRAPH 2001 Course Note #40. IV–1 to IV–59.
    [36]
    M. J. Keates and R. J. Hubbold. 1994. Accelerated Ray Tracing on the KSR1 Virtual Shared-Memory Parallel Computer. Citeseer.
    [37]
    Alexander Keller, Carsten Wächter, Matthias Raab, Daniel Seibert, Dietger van Antwerpen, Johann Korndörfer, and Lutz Kettner. 2017. The Iray Light Transport Simulation and Rendering System. Retrieved from https://arxiv:cs.GR/1705.01263.
    [38]
    Youngsok Kim, Jae-Eon Jo, Hanhwi Jang, Minsoo Rhu, Hanjun Kim, and Jangwoo Kim. 2017. GPUpd: A fast and scalable multi-GPU architecture using cooperative projection and distribution. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO’17). 574–586.
    [39]
    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Vol. 2. MIT Press, 1097–1105.
    [40]
    Christopher Kulla, Alejandro Conty, Clifford Stein, and Larry Gritz. 2018. Sony pictures imageworks Arnold. ACM Trans. Graphics 37, 3 (2018).
    [41]
    A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker. 2020. Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Trans. Parallel Distrib. Syst. 31, 1 (2020), 94–110.
    [42]
    Ang Li, Shuaiwen L. Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin Barker. 2018. Tartan: Evaluating modern GPU interconnect via a multi-GPU benchmark suite. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’18). 191–202.
    [43]
    Maxon. 2019. Redshift. Retrieved from https://www.redshift3d.com/product/features#all.
    [44]
    John D. McCalpin. 1995. Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Soc. Tech. Comm. Comput. Arch. Newslett. (Dec. 1995), 19–25.
    [45]
    Steven Molnar, Michael Cox, David Ellsworth, and Henry Fuchs. 1994. A sorting classification of parallel rendering. IEEE Comput. Graph. Appl. 14, 4 (July 1994), 23–32.
    [46]
    Patrick Mours. 2019. Accelerating Cycles using NVIDIA RTX. Retrieved from https://code.blender.org/2019/07/accelerating-cycles-using-nvidia-rtx/.
    [47]
    Paul A. Navrátil, Hank Childs, Donald S. Fussell, and Calvin Lin. 2014. Exploring the spectrum of dynamic scheduling algorithms for scalable distributed-memory ray tracing. IEEE Trans. Visual. Comput. Graph. 20, 6 (2014), 893–906.
    [48]
    Paul A. Navrátil, Donald S. Fussell, Calvin Lin, and Hank Childs. 2012. Dynamic scheduling for large-scale distributed-memory ray tracing. In Eurographics Symposium on Parallel Graphics and Visualization, Hank Childs, Torsten Kuhlen, and Fabio Marton (Eds.). The Eurographics Association.
    [49]
    NVIDIA. 2017. NVIDIA Tesla V100 GPU Architecture. Retrieved from http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
    [50]
    NVIDIA. 2018a. CUDA C Programming Guide. Retrieved from https://docs.nvidia.com/cuda/archive/10.0/pdf/CUDA_C_Programming_Guide.pdf.
    [51]
    NVIDIA. 2018b. CUDA Runtime API. Retrieved from https://docs.nvidia.com/cuda/archive/10.0/pdf/CUDA_Runtime_API.pdf.
    [52]
    NVIDIA. 2018c. NVIDIA NVSWITCH Technical Overview. Retrieved from https://images.nvidia.com/content/pdf/nvswitch-technical-overview.pdf.
    [53]
    NVIDIA. 2018d. NVIDIA Turing GPU Architecture. Retrieved from https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.
    [54]
    NVIDIA. 2019. DGX-2/2H SYSTEM User Guide. Retrieved from https://docs.nvidia.com/dgx/pdf/dgx2-user-guide.pdf. DU-09130-001_v08.1.
    [55]
    Jacopo Pantaleoni, Luca Fascione, Martin Hill, and Timo Aila. 2010. PantaRay: Fast ray-traced occlusion caching of massive scenes. ACM Trans. Graphics 29, 4 (JUL 2010).
    [56]
    Steven Parker, William Martin, Peter-Pike J. Sloan, Peter Shirley, Brian Smits, and Charles Hansen. 1999. Interactive ray tracing. In Proceedings of the Symposium on Interactive 3D Graphics. 119–126.
    [57]
    Steven Parker, Peter Shirley, Yarden Livnat, Charles Hansen, and Peter-Pike Sloan. 1998. Interactive ray tracing for Ssosurface rendering. In Proceedings of the IEEE Visualization Conference. 233–238.
    [58]
    PIXAR. 2019. OpenSUBDIV. Retrieved from http://graphics.pixar.com/opensubdiv/.
    [59]
    Jelica Protic, Milo Tomasevic, and Veljko Milutinovic. 1995. A survey of distributed shared memory systems. In Proceedings of the 28th Annual Hawaii International Conference on System Sciences, Vol. 1. IEEE, 74–84.
    [60]
    Jelica Protic, Milo Tomasevic, and Veljko Milutinovic. 1996. Distributed shared memory: Concepts and systems. IEEE Parallel Distrib. Technol.: Syst. Appl. 4, 2 (1996), 63–71.
    [61]
    Amit Sabne, Putt Sakdhnagool, and Rudolf Eigenmann. 2013. Scaling large-data computations on multi-GPU accelerators. In Proceedings of the International Conference on Supercomputing. 443–454.
    [62]
    Nikolay Sakharnykh. 2017a. Maximizing Unified Memory Performance in CUDA. Retrieved from https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/.
    [63]
    Nikolay Sakharnykh. 2017b. Unified memory on Pascal and Volta. In Proceedings of the GPU Technology Conference (GTC’17). Retrieved from http://on-demand.gputechconf.com/gtc/2017/presentation/s7285-nikolay-sakharnykh-unified-memory-on-pascal-and-volta.pdf.
    [64]
    Irani Sarosh. 2019. Accelerated computing solutions for AI and HPC workloads. Retrieved from https://developer.nvidia.com/gtc/2019/video/S9981. In Proceedings of the GPU Technology Conference (GTC’19).
    [65]
    Hyunseok Seo, Jinwook Kim, and Min-Soo Kim. 2015. GStream: A graph streaming processing method for large-scale graphs on GPUs. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP’15). 253–254.
    [66]
    Hideyuki Shamoto, Koichi Shirahata, Aleksandr Drozd, Hitoshi Sato, and Satoshi Matsuoka. 2015. Large-scale distributed sorting for GPU-based heterogeneous supercomputers. In Proceedings of the IEEE International Conference on Big Data (IEEEBigData’14). 510–518.
    [67]
    J. Pal Singh, Anoop Gupta, and Marc Levoy. 1994. Parallel visualization algorithms: Performance and architectural implications. Computer 27, 7 (1994), 45–55.
    [68]
    Myungbae Son and Sung-Eui Yoon. 2017. Timeline scheduling for out-of-core ray batching. In Proceedings of the Conference on High Performance Graphics (HPG’17).
    [69]
    Vijayaraghavan Soundararajan, Mark Heinrich, Ben Verghese, Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. 1998. Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors. In Proceedings of the 25th Annual International Symposium on Computer Architecture. IEEE, 342–355.
    [70]
    Blender Studio. 2020a. Agent 327—Blender Cloud. Retrieved from https://cloud.blender.org/films/agent-327.
    [71]
    Blender Studio. 2020b. Spring—Blender Cloud. Retrieved from https://cloud.blender.org/films/spring.
    [72]
    The Art Institute of Chicago. 2020. Discover Art & Artists. Retrieved from https://www.artic.edu/collection.
    [73]
    Threedscans. 2020. Three D Scans. Retrieved from https://threedscans.com.
    [74]
    Will Usher, Ingo Wald, Jefferson Amstutz, Johannes Gunther, Carson Brownlee, and Valerio Pascucci. 2019. Scalable ray tracing using the distributed FrameBuffer. Computer Graphics Forum 38, 3 (2019), 455–466.
    [75]
    I. Wald, C. Benthin, and P. Slusallek. 2003. Distributed interactive ray tracing of dynamic scenes. In Proceedings of the IEEE Symposium on Parallel and Large-Data Visualization and Graphics (PVG’03). 77–85.
    [76]
    Ingo Wald, Sven Woop, Carsten Benthin, Gregory S. Johnson, and Manfred Ernst. 2014. Embree: A kernel framework for efficient CPU ray tracing. ACM Trans. Graphics 33, 4 (2014), 1–8.
    [77]
    Walt Disney Animation Studios. 2018. Moana Island Scene. Retrieved from https://www.disneyanimation.com/resources/moana-island-scene/.
    [78]
    Rui Wang, Yuchi Huo, Yazhen Yuan, Kun Zhou, Wei Hua, and Hujun Bao. 2013. GPU-based out-of-core many-lights rendering. ACM Trans. Graphics 32, 6 (2013).
    [79]
    Chenhao Xie, Fu Xin, Mingsong Chen, and Shuaiwen L. Song. 2019. OO-VR: NUMA friendly object-oriented VR rendering framework for future NUMA-based multi-GPU systems. In Proceedings of the International Symposium on Computer Architecture. 53–65.
    [80]
    Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans, and Oreste Villa. 2018. Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systems. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE, 339–351.
    [81]
    Kun Zhou, Qiming Hou, Zhong Ren, Minmin Gong, Xin Sun, and Baining Guo. 2009. RenderAnts: Interactive REYES rendering on GPUs. ACM Trans. Graphics 28, 5 (Dec. 2009). ACM SIGGRAPH Asia Conference 2009, Yokohama, JAPAN, DEC 16-19, 2009.

    Cited By

    View all
    • (2022)Data Parallel Path Tracing with Object HierarchiesProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/35438615:3(1-16)Online publication date: 27-Jul-2022

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Graphics
    ACM Transactions on Graphics  Volume 40, Issue 2
    April 2021
    174 pages
    ISSN:0730-0301
    EISSN:1557-7368
    DOI:10.1145/3454118
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 April 2021
    Accepted: 01 January 2021
    Revised: 01 December 2020
    Received: 01 September 2020
    Published in TOG Volume 40, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Multi GPU path tracing
    2. NVLink
    3. CUDA unified memory
    4. data distributed path tracing
    5. distributed shared memory path tracing

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • Ministry of Education, Youth, and Sports from the Large Infrastructures for Research, Experimental Development, and Innovations project “e-Infrastructure CZ–LM2018140.”

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)809
    • Downloads (Last 6 weeks)50
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Data Parallel Path Tracing with Object HierarchiesProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/35438615:3(1-16)Online publication date: 27-Jul-2022

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media