research-article

PTask: operating system abstractions to manage GPUs as compute devices

Authors:

Christopher J. Rossbach,

Mark Silberstein,

Emmett WitchelAuthors Info & Claims

SOSP '11: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles

Pages 233 - 248

https://doi.org/10.1145/2043556.2043579

Published: 23 October 2011 Publication History

Abstract

We propose a new set of OS abstractions to support GPUs and other accelerator devices as first class computing resources. These new abstractions, collectively called the PTask API, support a dataflow programming model. Because a PTask graph consists of OS-managed objects, the kernel has sufficient visibility and control to provide system-wide guarantees like fairness and performance isolation, and can streamline data movement in ways that are impossible under current GPU programming models.

Our experience developing the PTask API, along with a gestural interface on Windows 7 and a FUSE-based encrypted file system on Linux show that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5× improvement in maximum throughput for the gestural interface.

References

[1]

IBM 709 electronic data-processing system: advance description. I. B. M., White Plains, NY, 1957.

[2]

The Imagine Stream Processor, 2002.

[3]

Recommendation for block cipher modesl of operation: the xts-aes mode for confidentiality on block-oriented storage devices. National Institute of Standards and Technology, Special Publication 800-e8E, 2009.

[4]

NVIDIA GPUDirect. 2011.

[5]

NVIDIA's Next Generation CUDATM Compute Architecture: Fermi. 2011.

[6]

Top 500 supercomputer sites. 2011.

[7]

Windows Driver Foundation (WDF). 2011.

[8]

M. Andrecut. Parallel GPU Implementation of Iterative PCA Algorithms. ArXiv e-prints, Nov. 2008.

[9]

M. Andrecut. Parallel GPU Implementation of Iterative PCA Algorithms. Journal of Computational Biology, 16(11), Nov. 2009.

[10]

J. S. Auerbach, D. F. Bacon, P. Cheng, and R. M. Rabbah. Lime: a java-compatible and synthesizable language for heterogeneous architectures. In OOPSLA. ACM, 2010.

Digital Library

[11]

C. Augonnet and R. Namyst. StarPU: A Unified Runtime System for Heterogeneous Multi-core Architectures.

[12]

C. Augonnet, S. Thibault, R. Namyst, and M. Nijhuis. Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System. In SAMOS '09, pages 329--339, 2009.

Digital Library

[13]

R. M. Badia, J. Labarta, R. Sirvent, J. M. Pérez, J. M. Cela, and R. Grima. Programming Grid Applications with GRID Superscalar. Journal of Grid Computing, 1:2003, 2003.

[14]

C. Banino, O. Beaumont, L. Carter, J. Ferrante, A. Legrand, and Y. Robert. Scheduling strategies for master-slave tasking on heterogeneous processor platforms. 2004.

[15]

A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: a new OS architecture for scalable multicore systems. In SOSP, 2009.

Digital Library

[16]

A. Bayoumi, M. Chu, Y. Hanafy, P. Harrell, and G. Refai-Ahmed. Scientific and Engineering Computing Using ATI Stream Technology. Computing in Science and Engineering, 11(6):92--97, 2009.

Digital Library

[17]

P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta. CellSs: a programming model for the cell BE architecture. In SC 2006.

Digital Library

[18]

G. Berry and G. Gonthier. The esterel synchronous programming language: design, semantics, implementation. Sci. Comput. Program., 19:87--152, November 1992.

Digital Library

[19]

B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. E. Fiuczynski, D. Becker, C. Chambers, and S. Eggers. Extensibility safety and performance in the spin operating system. SIGOPS Oper. Syst. Rev., 29:267--283, December 1995.

Digital Library

[20]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. ACM TRANSACTIONS ON GRAPHICS, 2004.

Digital Library

[21]

E. Caspi, M. Chu, R. Huang, J. Yeh, J. Wawrzynek, and A. DeHon. Stream computations organized for reconfigurable execution (score). FPL '00, 2000.

Digital Library

[22]

S. C. Chiu, W.-k. Liao, A. N. Choudhary, and M. T. Kandemir. Processor-embedded distributed smart disks for I/O-intensive workloads: architectures, performance models and evaluation. J. Parallel Distrib. Comput., 65(4):532--551, 2005.

Digital Library

[23]

C. H. Crawford, P. Henning, M. Kistler, and C. Wright. Accelerating computing with the cell broadband engine processor. In CF 2008, 2008.

Digital Library

[24]

A. Currid. TCP offload to the rescue. Queue, 2(3):58--65, 2004.

Digital Library

[25]

P. Druschel and L. L. Peterson. Fbufs: a high-bandwidth cross-domain transfer facility. SIGOPS Oper. Syst. Rev., 27:189--202, December 1993.

Digital Library

[26]

M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, and V. Volkov. Parallel Computing Experiences with CUDA. Micro, IEEE, 28(4):13--27, 2008.

Digital Library

[27]

I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu. An asymmetric distributed shared memory model for heterogeneous parallel systems. ASPLOS '10, 2010.

Digital Library

[28]

S. B. Gokturk, H. Yalcin, and C. Bamji. A time-of-flight depth sensor - system description, issues and solutions. In CVPRW, 2004.

Digital Library

[29]

V. Gough. EncFs. http://www.arg0.net/encfs.

[30]

N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In ACM SIGGRAPH 2005 Courses, SIGGRAPH '05, New York, NY, USA, 2005. ACM.

Digital Library

[31]

N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous dataflow programming language lustre. In Proceedings of the IEEE, pages 1305--1320, 1991.

[32]

S. Han, K. Jang, K. Park, and S. Moon. Packetshader: a GPU-accelerated software router. SIGCOMM Comput. Commun. Rev., 40:195--206, August 2010.

Digital Library

[33]

T. D. Han and T. S. Abdelrahman. hiCUDA: a high-level directive-based language for GPU programming. In GPGPU 2009.

Digital Library

[34]

B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: a mapreduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pages 260--269, New York, NY, USA, 2008. ACM.

Digital Library

[35]

B. He, K. Yang, R. Fang, M. Lu. N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. SIGMOD '08, 2008.

Digital Library

[36]

M. Hirsch, D. Lanman, H. Holtzman, and R. Raskar. BiDi screen: a thin, depth-sensing LCD for 3D interaction using light fields. ACM Trans. Graph., 28(5):1--9, 2009.

Digital Library

[37]

A. Hormati, Y. Choi, M. Kudlur, R. M. Rabbah, T. Mudge, and S. A. Mahlke. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In PACT, pages 214--223. IEEE Computer Society, 2009.

Digital Library

[38]

A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems (ASPLOS), 2011.

Digital Library

[39]

S. S. Huang, A. Hormati, D. F. Bacon, and R. M. Rabbah. Liquid metal: Object-oriented programming across the hardware/software boundary. In ECOOP, pages 76--103, 2008.

Digital Library

[40]

Intel Corporation. Intel IXP 2855 Network Processor.

[41]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys 2007.

Digital Library

[42]

K. Jang, S. Han, S. Han, S. Moon, and K. Park. Sslshader: cheap ssl acceleration with commodity processors. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, NSDI' 11, pages 1--1, Berkeley, CA, USA, 2011. USENIX Association.

Digital Library

[43]

V. J. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro. Predictive runtime code scheduling for heterogeneous architectures. In HiPEAC 2009.

Digital Library

[44]

P. G. Joisha and P. Banerjee. Static array storage optimization in matlab. In In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 258--268, 2003.

Digital Library

[45]

S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. Timegraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX conference on USENIX annual technical conference, Berkeley, CA, USA, 2011. USENIX Association.

Digital Library

[46]

K. Keeton, D. A. Patterson, and J. M. Hellerstein. A case for intelligent disks (IDISKs). SIGMOD Rec., 27(3):42--52, 1998.

Digital Library

[47]

Khronos Group. The OpenCL Specification, Version 1.0, 2009.

[48]

S.-P. P. Kim, J. D. Simeral, L. R. Hochberg, J. P. Donoghue, and M. J. Black. Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia. Journal of neural engineering, 5(4):455--476, December 2008.

[49]

E. Kohler, R. Morris, B. Chen. J. Jannotti, and M. F. Kaashoek. The click modular router. ACM Trans. Comput. Syst., 18, August 2000.

Digital Library

[50]

E. A. Lee and D. G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Trans. Comput., 36:24--35, January 1987.

Digital Library

[51]

M. Linetsky. Programming Microsoft Directshow. Wordware Publishing Inc., Plano, TX, USA, 2001.

Digital Library

[52]

O. Loques, J. Leite, and E. V. Carrera E. P-rio: A modular parallel-programming environment. IEEE Concurrency, 6:47--57, January 1998.

Digital Library

[53]

H. Massalin and C. Pu. Threads and input/output in the synthesis kernal. In SOSP '89: Proceedings of the twelfth ACM symposium on Operating systems principles, pages 191--201, New York, NY, USA, 1989. ACM.

Digital Library

[54]

M. D. McCool and B. D' Amora. Programming using RapidMind on the Cell BE. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 222, New York, NY, USA, 2006. ACM.

Digital Library

[55]

D. Mosberger and L. L. Peterson. Making paths explicit in the scout operating system. pages 153--167, 1996.

Digital Library

[56]

P. Newton and J. C. Browne. The code 2.0 graphical parallel programming language. In Proceedings of the 6th international conference on Supercomputing, ICS '92, pages 167--177, New York, NY, USA, 1992. ACM.

Digital Library

[57]

E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and G. Hunt. Helios: heterogeneous multiprocessing with satellite kernels. In SOSP 2009.

Digital Library

[58]

NVIDIA. CUDA Toolkit 4.0 CUBLAS Library, 2011.

[59]

NVIDIA. NVIDIA CUDA Programming Guide, 2011.

[60]

V. S. Pai, P. Druschel, and W. Zwaenepoel. Io-lite: A unified i/o buffering and caching system. 1997.

[61]

G. M. Papadopoulos and D. E. Culler. Monsoon: an explicit token-store architecture. In Proceedings of the 17th annual international symposium on Computer Architecture (ISCA), 1990.

Digital Library

[62]

J. Pasquale, E. Anderson, S. Diego, I. K. Muller, T. Global, and I. Solutions. Container shipping: Operating system support for i/o-intensive applications. IEEE Computer, 27:84--93, 1994.

Digital Library

[63]

E. Riedel, C. Faloutsos, G. A. Gibson, and D. Nagle. Active disks for large-scale data processing. Computer, 34(6):68--74, 2001.

Digital Library

[64]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP 2008.

Digital Library

[65]

M. N. Thadani and Y. A. Khalidi. An efficient zero-copy i/o framework for unix. Technical report, 1995.

Digital Library

[66]

W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A Language for Streaming Applications. In CC 2002.

Digital Library

[67]

C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In ICCV 1998.

Digital Library

[68]

S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-M. W. Hwu. CUDA-Lite: Reducing GPU Programming Complexity. In LCPC 2008.

Digital Library

[69]

Y. Weinsberg, D. Dolev, T. Anker, M. Ben-Yehuda, and P. Wyckoff. Tapping into the fountain of CPUs: on operating system support for programmable devices. In ASPLOS 2008.

Digital Library

Cited By

Kim JLee SJohnston BVetter J(2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3429010
Sinha SDwivedi SAzizian M(2024)Towards Deterministic End-to-end Latency for Medical AI Systems in NVIDIA Holoscan2024 ACM/IEEE 15th International Conference on Cyber-Physical Systems (ICCPS)10.1109/ICCPS61052.2024.00028(235-246)Online publication date: 13-May-2024
https://doi.org/10.1109/ICCPS61052.2024.00028
Gupta SDwarkadas S(2024)RELIEF: Relieving Memory Pressure In SoCs Via Data Movement-Aware Accelerator Scheduling2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00084(1063-1079)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00084
Show More Cited By

Recommendations

A comparative study of an X-ray tomography reconstruction algorithm in accelerated and cloud computing systems

With the increase of resolution in medical image scanners and the need of faster reconstruction methods, new ways of exploiting the inherent parallelism of reconstruction algorithms have arisen. In this paper, we present Mangoose++, an application to ...
GPUfs: Integrating a file system with GPUs

As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. This article advocates for extending standard operating system services and abstractions to GPUs in order ...
Multi- and many-core data mining with adaptive sparse grids
CF '11: Proceedings of the 8th ACM International Conference on Computing Frontiers

Gaining knowledge out of vast datasets is a main challenge in data-driven applications nowadays. Sparse grids provide a numerical method for both classification and regression in data mining which scales only linearly in the number of data points and is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SOSP '11: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles

October 2011

417 pages

ISBN:9781450309776

DOI:10.1145/2043556

General Chair:
Ted Wobber
MSR Silicon Valley
,
Program Chair:
Peter Druschel
MPI-SWS

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

INESC: Systems and Computer Engineering Institute
SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Division of Computer and Network Systems

Conference

SOSP '11

Sponsor:

INESC
SIGOPS

SOSP '11: ACM SIGOPS 23nd Symposium on Operating Systems Principles

October 23 - 26, 2011

Cascais, Portugal

Acceptance Rates

Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '24

Sponsor:
sigops

ACM SIGOPS 30th Symposium on Operating Systems Principles

November 4 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

231
Total Citations
View Citations
1,869
Total Downloads

Downloads (Last 12 months)103
Downloads (Last 6 weeks)14

Reflects downloads up to 01 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kim JLee SJohnston BVetter J(2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
https://doi.org/10.1109/TPDS.2024.3429010
Sinha SDwivedi SAzizian M(2024)Towards Deterministic End-to-end Latency for Medical AI Systems in NVIDIA Holoscan2024 ACM/IEEE 15th International Conference on Cyber-Physical Systems (ICCPS)10.1109/ICCPS61052.2024.00028(235-246)Online publication date: 13-May-2024
https://doi.org/10.1109/ICCPS61052.2024.00028
Gupta SDwarkadas S(2024)RELIEF: Relieving Memory Pressure In SoCs Via Data Movement-Aware Accelerator Scheduling2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00084(1063-1079)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00084
Wang YYu JYu Z(2023)Resource scheduling techniques in cloud from a view of coordination: a holistic survey从协同视角论云资源调度技术：综述Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210029824:1(1-40)Online publication date: 23-Jan-2023
https://doi.org/10.1631/FITEE.2100298
Trivedi ABrunella MBaumann ACrooks NSchwarzkopf M(2023)CPU-free Computing: A Vision with a BlueprintProceedings of the 19th Workshop on Hot Topics in Operating Systems10.1145/3593856.3595906(1-14)Online publication date: 22-Jun-2023
https://dl.acm.org/doi/10.1145/3593856.3595906
Wang JWang YZhang NMeng WJensen CCremers CKirda E(2023)Secure and Timely GPU Execution in Cyber-physical SystemsProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623197(2591-2605)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3576915.3623197
Fingler HTarte IYu HSzekely AHu BAkella ARossbach CAamodt TJerger NSwift M(2023)Towards a Machine Learning-Assisted Kernel with LAKEProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575697(846-861)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575697
Zou ALi JGill CZhang X(2023)RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain UtilizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.323543934:5(1450-1465)Online publication date: May-2023
https://doi.org/10.1109/TPDS.2023.3235439
Chen HLin EChou YChou J(2023)Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst EstimationIEEE Transactions on Cloud Computing10.1109/TCC.2021.311920511:1(854-867)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TCC.2021.3119205
Yang ZPan QXu C(2023)Fine-Grained Tuple Transfer for Pipelined Query Execution on CPU-GPU CoprocessorDatabase Systems for Advanced Applications10.1007/978-3-031-30637-2_2(19-34)Online publication date: 14-Apr-2023
https://doi.org/10.1007/978-3-031-30637-2_2
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents