Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2043556.2043579acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

PTask: operating system abstractions to manage GPUs as compute devices

Published: 23 October 2011 Publication History

Abstract

We propose a new set of OS abstractions to support GPUs and other accelerator devices as first class computing resources. These new abstractions, collectively called the PTask API, support a dataflow programming model. Because a PTask graph consists of OS-managed objects, the kernel has sufficient visibility and control to provide system-wide guarantees like fairness and performance isolation, and can streamline data movement in ways that are impossible under current GPU programming models.
Our experience developing the PTask API, along with a gestural interface on Windows 7 and a FUSE-based encrypted file system on Linux show that the PTask API can provide important system-wide guarantees where there were previously none, and can enable significant performance improvements, for example gaining a 5× improvement in maximum throughput for the gestural interface.

References

[1]
IBM 709 electronic data-processing system: advance description. I. B. M., White Plains, NY, 1957.
[2]
The Imagine Stream Processor, 2002.
[3]
Recommendation for block cipher modesl of operation: the xts-aes mode for confidentiality on block-oriented storage devices. National Institute of Standards and Technology, Special Publication 800-e8E, 2009.
[4]
NVIDIA GPUDirect. 2011.
[5]
NVIDIA's Next Generation CUDATM Compute Architecture: Fermi. 2011.
[6]
Top 500 supercomputer sites. 2011.
[7]
Windows Driver Foundation (WDF). 2011.
[8]
M. Andrecut. Parallel GPU Implementation of Iterative PCA Algorithms. ArXiv e-prints, Nov. 2008.
[9]
M. Andrecut. Parallel GPU Implementation of Iterative PCA Algorithms. Journal of Computational Biology, 16(11), Nov. 2009.
[10]
J. S. Auerbach, D. F. Bacon, P. Cheng, and R. M. Rabbah. Lime: a java-compatible and synthesizable language for heterogeneous architectures. In OOPSLA. ACM, 2010.
[11]
C. Augonnet and R. Namyst. StarPU: A Unified Runtime System for Heterogeneous Multi-core Architectures.
[12]
C. Augonnet, S. Thibault, R. Namyst, and M. Nijhuis. Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System. In SAMOS '09, pages 329--339, 2009.
[13]
R. M. Badia, J. Labarta, R. Sirvent, J. M. Pérez, J. M. Cela, and R. Grima. Programming Grid Applications with GRID Superscalar. Journal of Grid Computing, 1:2003, 2003.
[14]
C. Banino, O. Beaumont, L. Carter, J. Ferrante, A. Legrand, and Y. Robert. Scheduling strategies for master-slave tasking on heterogeneous processor platforms. 2004.
[15]
A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: a new OS architecture for scalable multicore systems. In SOSP, 2009.
[16]
A. Bayoumi, M. Chu, Y. Hanafy, P. Harrell, and G. Refai-Ahmed. Scientific and Engineering Computing Using ATI Stream Technology. Computing in Science and Engineering, 11(6):92--97, 2009.
[17]
P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta. CellSs: a programming model for the cell BE architecture. In SC 2006.
[18]
G. Berry and G. Gonthier. The esterel synchronous programming language: design, semantics, implementation. Sci. Comput. Program., 19:87--152, November 1992.
[19]
B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. E. Fiuczynski, D. Becker, C. Chambers, and S. Eggers. Extensibility safety and performance in the spin operating system. SIGOPS Oper. Syst. Rev., 29:267--283, December 1995.
[20]
I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. ACM TRANSACTIONS ON GRAPHICS, 2004.
[21]
E. Caspi, M. Chu, R. Huang, J. Yeh, J. Wawrzynek, and A. DeHon. Stream computations organized for reconfigurable execution (score). FPL '00, 2000.
[22]
S. C. Chiu, W.-k. Liao, A. N. Choudhary, and M. T. Kandemir. Processor-embedded distributed smart disks for I/O-intensive workloads: architectures, performance models and evaluation. J. Parallel Distrib. Comput., 65(4):532--551, 2005.
[23]
C. H. Crawford, P. Henning, M. Kistler, and C. Wright. Accelerating computing with the cell broadband engine processor. In CF 2008, 2008.
[24]
A. Currid. TCP offload to the rescue. Queue, 2(3):58--65, 2004.
[25]
P. Druschel and L. L. Peterson. Fbufs: a high-bandwidth cross-domain transfer facility. SIGOPS Oper. Syst. Rev., 27:189--202, December 1993.
[26]
M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, and V. Volkov. Parallel Computing Experiences with CUDA. Micro, IEEE, 28(4):13--27, 2008.
[27]
I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu. An asymmetric distributed shared memory model for heterogeneous parallel systems. ASPLOS '10, 2010.
[28]
S. B. Gokturk, H. Yalcin, and C. Bamji. A time-of-flight depth sensor - system description, issues and solutions. In CVPRW, 2004.
[29]
V. Gough. EncFs. http://www.arg0.net/encfs.
[30]
N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha. Fast computation of database operations using graphics processors. In ACM SIGGRAPH 2005 Courses, SIGGRAPH '05, New York, NY, USA, 2005. ACM.
[31]
N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous dataflow programming language lustre. In Proceedings of the IEEE, pages 1305--1320, 1991.
[32]
S. Han, K. Jang, K. Park, and S. Moon. Packetshader: a GPU-accelerated software router. SIGCOMM Comput. Commun. Rev., 40:195--206, August 2010.
[33]
T. D. Han and T. S. Abdelrahman. hiCUDA: a high-level directive-based language for GPU programming. In GPGPU 2009.
[34]
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: a mapreduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pages 260--269, New York, NY, USA, 2008. ACM.
[35]
B. He, K. Yang, R. Fang, M. Lu. N. Govindaraju, Q. Luo, and P. Sander. Relational joins on graphics processors. SIGMOD '08, 2008.
[36]
M. Hirsch, D. Lanman, H. Holtzman, and R. Raskar. BiDi screen: a thin, depth-sensing LCD for 3D interaction using light fields. ACM Trans. Graph., 28(5):1--9, 2009.
[37]
A. Hormati, Y. Choi, M. Kudlur, R. M. Rabbah, T. Mudge, and S. A. Mahlke. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In PACT, pages 214--223. IEEE Computer Society, 2009.
[38]
A. H. Hormati, M. Samadi, M. Woh, T. Mudge, and S. Mahlke. Sponge: portable stream programming on graphics engines. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems (ASPLOS), 2011.
[39]
S. S. Huang, A. Hormati, D. F. Bacon, and R. M. Rabbah. Liquid metal: Object-oriented programming across the hardware/software boundary. In ECOOP, pages 76--103, 2008.
[40]
Intel Corporation. Intel IXP 2855 Network Processor.
[41]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys 2007.
[42]
K. Jang, S. Han, S. Han, S. Moon, and K. Park. Sslshader: cheap ssl acceleration with commodity processors. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, NSDI' 11, pages 1--1, Berkeley, CA, USA, 2011. USENIX Association.
[43]
V. J. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro. Predictive runtime code scheduling for heterogeneous architectures. In HiPEAC 2009.
[44]
P. G. Joisha and P. Banerjee. Static array storage optimization in matlab. In In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 258--268, 2003.
[45]
S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. Timegraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the 2011 USENIX conference on USENIX annual technical conference, Berkeley, CA, USA, 2011. USENIX Association.
[46]
K. Keeton, D. A. Patterson, and J. M. Hellerstein. A case for intelligent disks (IDISKs). SIGMOD Rec., 27(3):42--52, 1998.
[47]
Khronos Group. The OpenCL Specification, Version 1.0, 2009.
[48]
S.-P. P. Kim, J. D. Simeral, L. R. Hochberg, J. P. Donoghue, and M. J. Black. Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia. Journal of neural engineering, 5(4):455--476, December 2008.
[49]
E. Kohler, R. Morris, B. Chen. J. Jannotti, and M. F. Kaashoek. The click modular router. ACM Trans. Comput. Syst., 18, August 2000.
[50]
E. A. Lee and D. G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Trans. Comput., 36:24--35, January 1987.
[51]
M. Linetsky. Programming Microsoft Directshow. Wordware Publishing Inc., Plano, TX, USA, 2001.
[52]
O. Loques, J. Leite, and E. V. Carrera E. P-rio: A modular parallel-programming environment. IEEE Concurrency, 6:47--57, January 1998.
[53]
H. Massalin and C. Pu. Threads and input/output in the synthesis kernal. In SOSP '89: Proceedings of the twelfth ACM symposium on Operating systems principles, pages 191--201, New York, NY, USA, 1989. ACM.
[54]
M. D. McCool and B. D' Amora. Programming using RapidMind on the Cell BE. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 222, New York, NY, USA, 2006. ACM.
[55]
D. Mosberger and L. L. Peterson. Making paths explicit in the scout operating system. pages 153--167, 1996.
[56]
P. Newton and J. C. Browne. The code 2.0 graphical parallel programming language. In Proceedings of the 6th international conference on Supercomputing, ICS '92, pages 167--177, New York, NY, USA, 1992. ACM.
[57]
E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and G. Hunt. Helios: heterogeneous multiprocessing with satellite kernels. In SOSP 2009.
[58]
NVIDIA. CUDA Toolkit 4.0 CUBLAS Library, 2011.
[59]
NVIDIA. NVIDIA CUDA Programming Guide, 2011.
[60]
V. S. Pai, P. Druschel, and W. Zwaenepoel. Io-lite: A unified i/o buffering and caching system. 1997.
[61]
G. M. Papadopoulos and D. E. Culler. Monsoon: an explicit token-store architecture. In Proceedings of the 17th annual international symposium on Computer Architecture (ISCA), 1990.
[62]
J. Pasquale, E. Anderson, S. Diego, I. K. Muller, T. Global, and I. Solutions. Container shipping: Operating system support for i/o-intensive applications. IEEE Computer, 27:84--93, 1994.
[63]
E. Riedel, C. Faloutsos, G. A. Gibson, and D. Nagle. Active disks for large-scale data processing. Computer, 34(6):68--74, 2001.
[64]
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP 2008.
[65]
M. N. Thadani and Y. A. Khalidi. An efficient zero-copy i/o framework for unix. Technical report, 1995.
[66]
W. Thies, M. Karczmarek, and S. P. Amarasinghe. StreamIt: A Language for Streaming Applications. In CC 2002.
[67]
C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In ICCV 1998.
[68]
S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W.-M. W. Hwu. CUDA-Lite: Reducing GPU Programming Complexity. In LCPC 2008.
[69]
Y. Weinsberg, D. Dolev, T. Anker, M. Ben-Yehuda, and P. Wyckoff. Tapping into the fountain of CPUs: on operating system support for programmable devices. In ASPLOS 2008.

Cited By

View all
  • (2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
  • (2024)Towards Deterministic End-to-end Latency for Medical AI Systems in NVIDIA Holoscan2024 ACM/IEEE 15th International Conference on Cyber-Physical Systems (ICCPS)10.1109/ICCPS61052.2024.00028(235-246)Online publication date: 13-May-2024
  • (2024)RELIEF: Relieving Memory Pressure In SoCs Via Data Movement-Aware Accelerator Scheduling2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00084(1063-1079)Online publication date: 2-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '11: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
October 2011
417 pages
ISBN:9781450309776
DOI:10.1145/2043556
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPU
  2. GPUs
  3. OS design
  4. accelerators
  5. dataflow
  6. gestural interface
  7. operating systems

Qualifiers

  • Research-article

Funding Sources

Conference

SOSP '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 131 of 716 submissions, 18%

Upcoming Conference

SOSP '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)103
  • Downloads (Last 6 weeks)14
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.342901035:10(1796-1809)Online publication date: Oct-2024
  • (2024)Towards Deterministic End-to-end Latency for Medical AI Systems in NVIDIA Holoscan2024 ACM/IEEE 15th International Conference on Cyber-Physical Systems (ICCPS)10.1109/ICCPS61052.2024.00028(235-246)Online publication date: 13-May-2024
  • (2024)RELIEF: Relieving Memory Pressure In SoCs Via Data Movement-Aware Accelerator Scheduling2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00084(1063-1079)Online publication date: 2-Mar-2024
  • (2023)Resource scheduling techniques in cloud from a view of coordination: a holistic survey从协同视角论云资源调度技术:综述Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210029824:1(1-40)Online publication date: 23-Jan-2023
  • (2023)CPU-free Computing: A Vision with a BlueprintProceedings of the 19th Workshop on Hot Topics in Operating Systems10.1145/3593856.3595906(1-14)Online publication date: 22-Jun-2023
  • (2023)Secure and Timely GPU Execution in Cyber-physical SystemsProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623197(2591-2605)Online publication date: 15-Nov-2023
  • (2023)Towards a Machine Learning-Assisted Kernel with LAKEProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575697(846-861)Online publication date: 27-Jan-2023
  • (2023)RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain UtilizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.323543934:5(1450-1465)Online publication date: May-2023
  • (2023)Gemini: Enabling Multi-Tenant GPU Sharing Based on Kernel Burst EstimationIEEE Transactions on Cloud Computing10.1109/TCC.2021.311920511:1(854-867)Online publication date: 1-Jan-2023
  • (2023)Fine-Grained Tuple Transfer for Pipelined Query Execution on CPU-GPU CoprocessorDatabase Systems for Advanced Applications10.1007/978-3-031-30637-2_2(19-34)Online publication date: 14-Apr-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media