research-article

Softshell: dynamic scheduling on GPUs

Authors:

Markus Steinberger,

Bernhard Kainz,

Bernhard Kerbl,

Stefan Hauswiesner,

Michael Kenzel,

Dieter SchmalstiegAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 31, Issue 6

Article No.: 161, Pages 1 - 11

https://doi.org/10.1145/2366145.2366180

Published: 01 November 2012 Publication History

Abstract

In this paper we present Softshell, a novel execution model for devices composed of multiple processing cores operating in a single instruction, multiple data fashion, such as graphics processing units (GPUs). The Softshell model is intuitive and more flexible than the kernel-based adaption of the stream processing model, which is currently the dominant model for general purpose GPU computation. Using the Softshell model, algorithms with a relatively low local degree of parallelism can execute efficiently on massively parallel architectures. Softshell has the following distinct advantages: (1) work can be dynamically issued directly on the device, eliminating the need for synchronization with an external source, i.e., the CPU; (2) its three-tier dynamic scheduler supports arbitrary scheduling strategies, including dynamic priorities and real-time scheduling; and (3) the user can influence, pause, and cancel work already submitted for parallel execution. The Softshell processing model thus brings capabilities to GPU architectures that were previously only known from operating-system designs and reserved for CPU programming. As a proof of our claims, we present a publicly available implementation of the Softshell processing model realized on top of CUDA. The benchmarks of this implementation demonstrate that our processing model is easy to use and also performs substantially better than the state-of-the-art kernel-based processing model for problems that have been difficult to parallelize in the past.

Supplementary Material

ZIP File (161-153-0167.zip)

Supplemental Materials for Softshell: dynamic scheduling on GPUs

Download
10.16 KB

References

[1]

Advanced Micro Devices. 2011. AMD Accelerated Parallel Processing OpenCL - Programming Guide.

[2]

Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., and Tomov, S. 2010. Faster, cheaper, better -- a hybridization methodology to develop linear algebra software for gpus. In GPU Computing Gems, vol. 2. Morgan Kaufmann, Sept.

[3]

Aila, T., and Laine, S. 2009. Understanding the efficiency of ray traversal on GPUs. In Proc. High Performance Graphics, ACM, HPG '09, 145--149.

Digital Library

[4]

Batcher, K. E. 1968. Sorting networks and their applications. In Proc. Spring Joint Computer Conference, ACM, AFIPS '68, 307--314.

Digital Library

[5]

Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: stream computing on graphics hardware. ACM Trans. Graph. 23, 3 (Aug.), 777--786.

Digital Library

[6]

Cederman, D., and Tsigas, P. 2008. On dynamic load balancing on graphics processors. In Proc. ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, Eurographics Association, GH '08, 57--64.

Digital Library

[7]

Chatterjee, S., Grossman, M., Sbirlea, A., and Sarkar, V. 2011. Dynamic task parallelism with a GPU work-stealing runtime system. In Proc. Workshop on Languages and Compilers for Parallel Computing, LCPC '11.

[8]

Chen, L., Villa, O., Krishnamoorthy, S., and Gao, G. 2010. Dynamic load balancing on single- and multi-GPU systems. In Proc. Parallel Distributed Processing, IEEE, IPDPS, 1--12.

[9]

Frey, S., and Ertl, T. 2010. PaTraCo: A Framework Enabling the Transparent and Efficient Programming of Heterogeneous Compute Networks. In Proc. Eurographics Symposium on Parallel Graphics and Visualization, EGPGV10, 131--140.

Digital Library

[10]

Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proc. IEEE/ACM International Symposium on Microarchitecture, IEEE, MICRO 40, 407--420.

Digital Library

[11]

Gupta, K., Stuart, J. A., and Owens, J. D. 2012. A study of persistent threads style gpu programming for gpgpu workloads. In Innovative Parallel Computing, 14.

[12]

Hauswiesner, S., Straka, M., and Reitmayr, G. 2011. Coherent image-based rendering of real-world objects. In Proc. Symposium on Interactive 3D Graphics and Games, ACM, I3D '11, 183--190.

Digital Library

[13]

Hormati, A. H., Samadi, M., Woh, M., Mudge, T., and Mahlke, S. 2011. Sponge: portable stream programming on graphics engines. In Proc. Architectural support for programming languages and operating systems, ACM, ASPLOS '11, 381--392.

Digital Library

[14]

Hou, Q., Zhou, K., and Guo, B. 2008. BSGP: bulk-synchronous GPU programming. ACM Trans. Graph. 27, 3 (Aug.), 19:1--19:12.

Digital Library

[15]

Hou, Q., Zhou, K., and Guo, B. 2009. Debugging GPU stream programs through automatic dataflow recording and visualization. ACM Trans. Graph. 28, 5 (Dec.), 153:1--153:11.

Digital Library

[16]

Kajiya, J. T. 1986. The rendering equation. SIGGRAPH Comput. Graph. 20, 4 (Aug.), 143--150.

Digital Library

[17]

Kato, S., Lakshmanan, K., Rajkumar, R., and Ishikawa, Y. 2011. TimeGraph: GPU scheduling for real-time multitasking environments. In Proc. USENIX annual technical conference, USENIX Association, USENIXATC'11, 2--2.

Digital Library

[18]

Khronos. 2008. OpenCL The standard for heterogeneous parallel programming. Khronos OpenCL Working Group.

[19]

Liu, C. L., and Layland, J. W. 1973. Scheduling algorithms for multiprogramming in a hard-real-time environment. J. ACM 20 (January), 46--61.

Digital Library

[20]

Luebke, D., and Erikson, C. 1997. View-dependent simplification of arbitrary polygonal environments. In Proc. SIGGRAPH '97, ACM, 199--208.

Digital Library

[21]

Matusik, W., Buehler, C., Raskar, R., Gortler, S. J., and McMillan, L. 2000. Image-based visual hulls. In Proc. SIGGRAPH '00, ACM, SIGGRAPH '00, 369--374.

Digital Library

[22]

McCool, M. D., Qin, Z., and Popa, T. S. 2002. Shader metaprogramming. In Proc. ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS '02, 57--68.

Digital Library

[23]

NVIDIA, 2009. NVIDIA's next generation CUDA compute architecture: Fermi. White paper. Available online.

[24]

NVIDIA. 2011. NVIDIA CUDA Compute Unified Device Architecture - Programming Guide. NVIDIA.

[25]

NVIDIA. 2012. NVIDIAs NextGeneration CUDA Compute Architecture: Kepler TM GK110, May.

[26]

Parker, S. G., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J., Luebke, D., McAllister, D., McGuire, M., Morley, K., Robison, A., and Stich, M. 2010. OptiX: a general purpose ray tracing engine. ACM Trans. Graph. 29, 4 (July), 66:1--66:13.

Digital Library

[27]

Poincaré, H. 1913. The Measure of Time. New York: Science Press.

[28]

Rossbach, C. J., Currey, J., Silberstein, M., Ray, B., and Witchel, E. 2011. PTask: operating system abstractions to manage gpus as compute devices. In Proc. ACM Symposium on Operating Systems Principles, ACM, SOSP '11, 233--248.

Digital Library

[29]

Sanchez, D., Lo, D., Yoo, R. M., Sugerman, J., and Kozyrakis, C. 2011. Dynamic fine-grain scheduling of pipeline parallelism. In Proc. International Conference on Parallel Architectures and Compilation Techniques, IEEE, PACT '11, 22--32.

Digital Library

[30]

Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., and Hanrahan, P. 2008. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph. 27, 3 (Aug.), 18:1--18:15.

Digital Library

[31]

Sha, L., Abdelzaher, T., Arzen, K.-E., Cervin, A., Baker, T., Burns, A., Buttazzo, G., Caccamo, M., Lehoczky, J., and Mok, A. K. 2004. Real time scheduling theory: A historical perspective. Real-Time Syst. 28, 2, 101--155.

Digital Library

[32]

Steinberger, M., Kenzel, M., Kainz, B., and Schmalstieg, D. 2012. ScatterAlloc: Massively Parallel Dynamic Memory Allocation for the GPU. In Proceedings of Innovative Parallel Computing (InPar12).

[33]

Sugerman, J., Fatahalian, K., Boulos, S., Akeley, K., and Hanrahan, P. 2009. GRAMPS: A programming model for graphics pipelines. ACM Trans. Graph. 28, 1, 1--11.

Digital Library

[34]

Tanenbaum, A. S. 2007. Modern Operating Systems. Prentice Hall Press, Upper Saddle River, NJ, USA.

Digital Library

[35]

Tzeng, S., Patney, A., and Owens, J. D. 2010. Task management for irregular-parallel workloads on the GPU. In Proc. High Performance Graphics, Eurographics Association, HPG '10, 29--37.

Digital Library

[36]

Zhou, K., Hou, Q., Ren, Z., Gong, M., Sun, X., and Guo, B. 2009. RenderAnts: interactive Reyes rendering on GPUs. ACM Trans. Graph. 28, 5 (Dec.), 155:1--155:11.

Digital Library

[37]

Zhou, K., Gong, M., Huang, X., and Guo, B. 2011. Data-parallel octrees for surface reconstruction. IEEE Transactions on Visualization and Computer Graphics 17, 5 (May), 669--681.

Digital Library

Cited By

Park SHong JSong JKim HKim YLee JLee IChabbi MSteuwer M(2024)AGAThA: Fast and Efficient GPU Acceleration of Guided Sequence Alignment for Long Read MappingProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638474(431-444)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638474
Zhao CGao WNie FZhou H(2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TPDS.2021.3115630
Fey FGorlatch S(2021)CPRIC: Collaborative Parallelism for Randomized Incremental Constructions2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00081(490-499)Online publication date: Jun-2021
https://doi.org/10.1109/IPDPSW52791.2021.00081
Show More Cited By

Index Terms

Softshell: dynamic scheduling on GPUs
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
  2. Parallel computing methodologies
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction techniques

Recommendations

Whippletree: task-based scheduling of dynamic workloads on the GPU

In this paper, we present Whippletree, a novel approach to scheduling dynamic, irregular workloads on the GPU. We introduce a new programming model which offers the simplicity and expressiveness of task-based parallelism while retaining all aspects of ...
Improving performance of GPU code using novel features of the NVIDIA kepler architecture

Graphics processing unit GPU computing is a popular approach to simulating complex models and performing massive calculations. GPUs have attracted a great deal of interest because they offer both high performance and energy efficiency. Efficient General-...
Free launch: optimizing GPU dynamic kernel launches through thread reuse
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 31, Issue 6

November 2012

794 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/2366145

Issue’s Table of Contents

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2012

Published in TOG Volume 31, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Austrian Science Fund

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
939
Total Downloads

Downloads (Last 12 months)44
Downloads (Last 6 weeks)3

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Park SHong JSong JKim HKim YLee JLee IChabbi MSteuwer M(2024)AGAThA: Fast and Efficient GPU Acceleration of Guided Sequence Alignment for Long Read MappingProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638474(431-444)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638474
Zhao CGao WNie FZhou H(2022)A Survey of GPU Multitasking Methods Supported by Hardware ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.311563033:6(1451-1463)Online publication date: 1-Jun-2022
https://doi.org/10.1109/TPDS.2021.3115630
Fey FGorlatch S(2021)CPRIC: Collaborative Parallelism for Randomized Incremental Constructions2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00081(490-499)Online publication date: Jun-2021
https://doi.org/10.1109/IPDPSW52791.2021.00081
Mayr BWeinrauch AParger MSteinberger M(2021)Are van Emde Boas trees viable on the GPU?2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622837(1-7)Online publication date: 20-Sep-2021
https://doi.org/10.1109/HPEC49654.2021.9622837
Winter MMlakar DParger MSteinberger MAyguadé EHwu WBadia RHofstee H(2020)OuroborosProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392742(1-12)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392742
Jradi Wdo Nascimento HMartins W(2020)A GPU-Based Parallel Reduction ImplementationHigh Performance Computing Systems10.1007/978-3-030-41050-6_11(168-182)Online publication date: 14-Feb-2020
https://doi.org/10.1007/978-3-030-41050-6_11
Helal AAji AChu MBeckmann BFeng W(2019)Adaptive Task Aggregation for High-Performance Sparse Solvers on GPUs2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2019.00033(324-336)Online publication date: Sep-2019
https://doi.org/10.1109/PACT.2019.00033
Moche MBusse HFutterer JHinestrosa CSeider DBrandmaier PKolesnik MJenniskens SBlanco Sequeiros RKomar GPollari MEibisberger MPortugaller HVoglreiter PFlanagan RMariappan PReinhardt M(2019)Clinical evaluation of in silico planning and real-time simulation of hepatic radiofrequency ablation (ClinicIMPPACT Trial)European Radiology10.1007/s00330-019-06411-5Online publication date: 30-Aug-2019
https://doi.org/10.1007/s00330-019-06411-5
Hu YRallapalli SKo BGovindan RPierre GFerreira PShrira L(2018)OlympianProceedings of the 19th International Middleware Conference10.1145/3274808.3274813(53-65)Online publication date: 26-Nov-2018
https://dl.acm.org/doi/10.1145/3274808.3274813
Kenzel MKerbl BTatzgern WIvanchenko ESchmalstieg DSteinberger M(2018)On-the-fly Vertex Reuse for Massively-Parallel Software Geometry ProcessingProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/32333031:2(1-17)Online publication date: 24-Aug-2018
https://dl.acm.org/doi/10.1145/3233303
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents