Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Softshell: dynamic scheduling on GPUs

Published: 01 November 2012 Publication History

Abstract

In this paper we present Softshell, a novel execution model for devices composed of multiple processing cores operating in a single instruction, multiple data fashion, such as graphics processing units (GPUs). The Softshell model is intuitive and more flexible than the kernel-based adaption of the stream processing model, which is currently the dominant model for general purpose GPU computation. Using the Softshell model, algorithms with a relatively low local degree of parallelism can execute efficiently on massively parallel architectures. Softshell has the following distinct advantages: (1) work can be dynamically issued directly on the device, eliminating the need for synchronization with an external source, i.e., the CPU; (2) its three-tier dynamic scheduler supports arbitrary scheduling strategies, including dynamic priorities and real-time scheduling; and (3) the user can influence, pause, and cancel work already submitted for parallel execution. The Softshell processing model thus brings capabilities to GPU architectures that were previously only known from operating-system designs and reserved for CPU programming. As a proof of our claims, we present a publicly available implementation of the Softshell processing model realized on top of CUDA. The benchmarks of this implementation demonstrate that our processing model is easy to use and also performs substantially better than the state-of-the-art kernel-based processing model for problems that have been difficult to parallelize in the past.

Supplementary Material

ZIP File (161-153-0167.zip)
Supplemental Materials for Softshell: dynamic scheduling on GPUs

References

[1]
Advanced Micro Devices. 2011. AMD Accelerated Parallel Processing OpenCL - Programming Guide.
[2]
Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., and Tomov, S. 2010. Faster, cheaper, better -- a hybridization methodology to develop linear algebra software for gpus. In GPU Computing Gems, vol. 2. Morgan Kaufmann, Sept.
[3]
Aila, T., and Laine, S. 2009. Understanding the efficiency of ray traversal on GPUs. In Proc. High Performance Graphics, ACM, HPG '09, 145--149.
[4]
Batcher, K. E. 1968. Sorting networks and their applications. In Proc. Spring Joint Computer Conference, ACM, AFIPS '68, 307--314.
[5]
Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: stream computing on graphics hardware. ACM Trans. Graph. 23, 3 (Aug.), 777--786.
[6]
Cederman, D., and Tsigas, P. 2008. On dynamic load balancing on graphics processors. In Proc. ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, Eurographics Association, GH '08, 57--64.
[7]
Chatterjee, S., Grossman, M., Sbirlea, A., and Sarkar, V. 2011. Dynamic task parallelism with a GPU work-stealing runtime system. In Proc. Workshop on Languages and Compilers for Parallel Computing, LCPC '11.
[8]
Chen, L., Villa, O., Krishnamoorthy, S., and Gao, G. 2010. Dynamic load balancing on single- and multi-GPU systems. In Proc. Parallel Distributed Processing, IEEE, IPDPS, 1--12.
[9]
Frey, S., and Ertl, T. 2010. PaTraCo: A Framework Enabling the Transparent and Efficient Programming of Heterogeneous Compute Networks. In Proc. Eurographics Symposium on Parallel Graphics and Visualization, EGPGV10, 131--140.
[10]
Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proc. IEEE/ACM International Symposium on Microarchitecture, IEEE, MICRO 40, 407--420.
[11]
Gupta, K., Stuart, J. A., and Owens, J. D. 2012. A study of persistent threads style gpu programming for gpgpu workloads. In Innovative Parallel Computing, 14.
[12]
Hauswiesner, S., Straka, M., and Reitmayr, G. 2011. Coherent image-based rendering of real-world objects. In Proc. Symposium on Interactive 3D Graphics and Games, ACM, I3D '11, 183--190.
[13]
Hormati, A. H., Samadi, M., Woh, M., Mudge, T., and Mahlke, S. 2011. Sponge: portable stream programming on graphics engines. In Proc. Architectural support for programming languages and operating systems, ACM, ASPLOS '11, 381--392.
[14]
Hou, Q., Zhou, K., and Guo, B. 2008. BSGP: bulk-synchronous GPU programming. ACM Trans. Graph. 27, 3 (Aug.), 19:1--19:12.
[15]
Hou, Q., Zhou, K., and Guo, B. 2009. Debugging GPU stream programs through automatic dataflow recording and visualization. ACM Trans. Graph. 28, 5 (Dec.), 153:1--153:11.
[16]
Kajiya, J. T. 1986. The rendering equation. SIGGRAPH Comput. Graph. 20, 4 (Aug.), 143--150.
[17]
Kato, S., Lakshmanan, K., Rajkumar, R., and Ishikawa, Y. 2011. TimeGraph: GPU scheduling for real-time multitasking environments. In Proc. USENIX annual technical conference, USENIX Association, USENIXATC'11, 2--2.
[18]
Khronos. 2008. OpenCL The standard for heterogeneous parallel programming. Khronos OpenCL Working Group.
[19]
Liu, C. L., and Layland, J. W. 1973. Scheduling algorithms for multiprogramming in a hard-real-time environment. J. ACM 20 (January), 46--61.
[20]
Luebke, D., and Erikson, C. 1997. View-dependent simplification of arbitrary polygonal environments. In Proc. SIGGRAPH '97, ACM, 199--208.
[21]
Matusik, W., Buehler, C., Raskar, R., Gortler, S. J., and McMillan, L. 2000. Image-based visual hulls. In Proc. SIGGRAPH '00, ACM, SIGGRAPH '00, 369--374.
[22]
McCool, M. D., Qin, Z., and Popa, T. S. 2002. Shader metaprogramming. In Proc. ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS '02, 57--68.
[23]
NVIDIA, 2009. NVIDIA's next generation CUDA compute architecture: Fermi. White paper. Available online.
[24]
NVIDIA. 2011. NVIDIA CUDA Compute Unified Device Architecture - Programming Guide. NVIDIA.
[25]
NVIDIA. 2012. NVIDIAs NextGeneration CUDA Compute Architecture: Kepler TM GK110, May.
[26]
Parker, S. G., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J., Luebke, D., McAllister, D., McGuire, M., Morley, K., Robison, A., and Stich, M. 2010. OptiX: a general purpose ray tracing engine. ACM Trans. Graph. 29, 4 (July), 66:1--66:13.
[27]
Poincaré, H. 1913. The Measure of Time. New York: Science Press.
[28]
Rossbach, C. J., Currey, J., Silberstein, M., Ray, B., and Witchel, E. 2011. PTask: operating system abstractions to manage gpus as compute devices. In Proc. ACM Symposium on Operating Systems Principles, ACM, SOSP '11, 233--248.
[29]
Sanchez, D., Lo, D., Yoo, R. M., Sugerman, J., and Kozyrakis, C. 2011. Dynamic fine-grain scheduling of pipeline parallelism. In Proc. International Conference on Parallel Architectures and Compilation Techniques, IEEE, PACT '11, 22--32.
[30]
Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., and Hanrahan, P. 2008. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph. 27, 3 (Aug.), 18:1--18:15.
[31]
Sha, L., Abdelzaher, T., Arzen, K.-E., Cervin, A., Baker, T., Burns, A., Buttazzo, G., Caccamo, M., Lehoczky, J., and Mok, A. K. 2004. Real time scheduling theory: A historical perspective. Real-Time Syst. 28, 2, 101--155.
[32]
Steinberger, M., Kenzel, M., Kainz, B., and Schmalstieg, D. 2012. ScatterAlloc: Massively Parallel Dynamic Memory Allocation for the GPU. In Proceedings of Innovative Parallel Computing (InPar12).
[33]
Sugerman, J., Fatahalian, K., Boulos, S., Akeley, K., and Hanrahan, P. 2009. GRAMPS: A programming model for graphics pipelines. ACM Trans. Graph. 28, 1, 1--11.
[34]
Tanenbaum, A. S. 2007. Modern Operating Systems. Prentice Hall Press, Upper Saddle River, NJ, USA.
[35]
Tzeng, S., Patney, A., and Owens, J. D. 2010. Task management for irregular-parallel workloads on the GPU. In Proc. High Performance Graphics, Eurographics Association, HPG '10, 29--37.
[36]
Zhou, K., Hou, Q., Ren, Z., Gong, M., Sun, X., and Guo, B. 2009. RenderAnts: interactive Reyes rendering on GPUs. ACM Trans. Graph. 28, 5 (Dec.), 155:1--155:11.
[37]
Zhou, K., Gong, M., Huang, X., and Guo, B. 2011. Data-parallel octrees for surface reconstruction. IEEE Transactions on Visualization and Computer Graphics 17, 5 (May), 669--681.

Cited By

View all
  • (2024)AGAThA: Fast and Efficient GPU Acceleration of Guided Sequence Alignment for Long Read MappingProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638474(431-444)Online publication date: 2-Mar-2024
  • (2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
  • (2021)CPRIC: Collaborative Parallelism for Randomized Incremental Constructions2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00081(490-499)Online publication date: Jun-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics
ACM Transactions on Graphics  Volume 31, Issue 6
November 2012
794 pages
ISSN:0730-0301
EISSN:1557-7368
DOI:10.1145/2366145
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2012
Published in TOG Volume 31, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. dynamic parallelism
  3. persistent threads
  4. priority scheduling
  5. priority work queue
  6. real-time scheduling

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)3
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)AGAThA: Fast and Efficient GPU Acceleration of Guided Sequence Alignment for Long Read MappingProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638474(431-444)Online publication date: 2-Mar-2024
  • (2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
  • (2021)CPRIC: Collaborative Parallelism for Randomized Incremental Constructions2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00081(490-499)Online publication date: Jun-2021
  • (2021)Are van Emde Boas trees viable on the GPU?2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622837(1-7)Online publication date: 20-Sep-2021
  • (2020)OuroborosProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392742(1-12)Online publication date: 29-Jun-2020
  • (2020)A GPU-Based Parallel Reduction ImplementationHigh Performance Computing Systems10.1007/978-3-030-41050-6_11(168-182)Online publication date: 14-Feb-2020
  • (2019)Adaptive Task Aggregation for High-Performance Sparse Solvers on GPUs2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2019.00033(324-336)Online publication date: Sep-2019
  • (2019)Clinical evaluation of in silico planning and real-time simulation of hepatic radiofrequency ablation (ClinicIMPPACT Trial)European Radiology10.1007/s00330-019-06411-5Online publication date: 30-Aug-2019
  • (2018)OlympianProceedings of the 19th International Middleware Conference10.1145/3274808.3274813(53-65)Online publication date: 26-Nov-2018
  • (2018)On-the-fly Vertex Reuse for Massively-Parallel Software Geometry ProcessingProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/32333031:2(1-17)Online publication date: 24-Aug-2018
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media