research-article

Public Access

GPUnet: Networking Abstractions for GPU Programs

Authors:

Mark Silberstein,

Emmett WitchelAuthors Info & Claims

ACM Transactions on Computer Systems (TOCS), Volume 34, Issue 3

Article No.: 9, Pages 1 - 31

https://doi.org/10.1145/2963098

Published: 17 September 2016 Publication History

Abstract

Despite the popularity of GPUs in high-performance and scientific computing, and despite increasingly general-purpose hardware capabilities, the use of GPUs in network servers or distributed systems poses significant challenges.

GPUnet is a native GPU networking layer that provides a socket abstraction and high-level networking APIs for GPU programs. We use GPUnet to streamline the development of high-performance, distributed applications like in-GPU-memory MapReduce and a new class of low-latency, high-throughput GPU-native network services such as a face verification server.

References

[1]

Sandeep R. Agrawal, Valentin Pistol, Jun Pang, John Tran, David Tarjan, and Alvin R. Lebeck. 2014. Rhythm: Harnessing data parallel hardware for server workloads. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14).

Digital Library

[2]

Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. 2006. Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 12, 2037--2041.

Digital Library

[3]

R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. S. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, and P. Vicini. 2012. APEnet+: A 3D Torus network optimized for GPU-based HPC systems. Journal of Physics: Conference Series 396, 1--11.

[4]

Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 29--44.

Digital Library

[5]

Nathan Z. Beckmann, Charles Gruenwald III, Christopher R. Johnson, Harshad Kasture, Filippo Sironi, Anant Agarwal, M. Frans Kaashoek, and Nickolai Zeldovich. 2014. PIKA: A Network Service for Multikernel Operating Systems. Technical Report MIT-CSAIL-TR-2014-002. Massachusetts Institute of Technology, Cambridge, MA. http://hdl.handle.net/1721.1/84608.

[6]

Theophilus Benson, Aditya Akella, and David A. Maltz. 2010. Network traffic characteristics of data centers in the wild. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. ACM, New York, NY, 267--280.

Digital Library

[7]

Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning (ICML-13). 1337--1345.

Digital Library

[8]

Feras Daoud, Amir Watad, and Mark Silberstein. 2016. GPUrdma: GPU-side library for high performance networking from GPU kernels. In Proceedings of the ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS’16). Article No. 6.

Digital Library

[9]

J. Dean and S. Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04).

Digital Library

[10]

Bryan Ford. 2007. Structured streams: A new transport abstraction. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. ACM, New York, NY, 361--372.

Digital Library

[11]

Shalini Gupta. 2013. Efficient Object Detection on GPUs Using MB-LBP Features and Random Forests. Retrieved August 21, 2016, from http://on-demand.gputechconf.com/gtc/2013/presentations/S3297-Efficient-Object-Detection-GPU-MB-LBP-Forest.pdf.

[12]

Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: A GPU-accelerated software router. ACM SIGCOMM Computer Communication Review 40, 4, 195--206.

Digital Library

[13]

Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. 2012. MegaPipe: A new programming interface for scalable network I/O. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04).

Digital Library

[14]

Sean Hefty. 2012. Rsockets. Available at https://www.openfabrics.org/index.php/resources/document-downlo ads/public-documents/doc_download/495-rsockets.html.

[15]

InfiniBand Trade Association. 2007. InfiniBand Architecture Specification, Volume 1—General Specification, Release 1.2.1. InfiniBand Trade Association.

[16]

Keon Jang, Sangjin Han, Seungyeop Han, Sue Moon, and KyoungSoo Park. 2011. SSLShader: Cheap SSL acceleration with commodity processors. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’11). http://portal.acm.org/citation.cfm?id=1972457.1972459

Digital Library

[17]

Feng Ji, Heshan Lin, and Xiaosong Ma. 2013. RSVM: A region-based software virtual memory for GPU. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE, Los Alamitos, CA, 269--278.

Digital Library

[18]

Shinpei Kato, Jason Aumiller, and Scott Brandt. 2013. Zero-copy I/O processing for low-latency GPU computing. In Proceedings of the ACM/IEEE 4th International Conference on Cyber-Physical Systems (ICCPS’13). ACM, New York, NY, 170--178.

Digital Library

[19]

Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the USENIX Annual Technical Conference. http://portal.acm.org/citation.cfm?id=2002181.2002183

Digital Library

[20]

Khronos Group. 2016. OpenCL: The Open Standard for Parallel Programming of Heterogeneous Systems. Retrieved August 21, 2016, from http://www.khronos.org/opencl.

[21]

David B. Kirk and W. Hwu Wen-mei. 2010. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann.

Digital Library

[22]

Maxwell Krohn, Eddie Kohler, and M. Frans Kaashoek. 2007. Events can make sense. In Proceedings of the USENIX Annual Technical Conference. http://dl.acm.org/citation.cfm?id=1364385.1364392

Digital Library

[23]

Felix Xiaozhu Lin, Zhen Wang, and Lin Zhong. 2014. K2: A mobile operating system for heterogeneous coherence domains. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY.

Digital Library

[24]

Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K. Panda. 2004. High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 32, 3, 167--198.

Digital Library

[25]

NVIDIA. 2015. Developing a Linux Kernel Module Using GPUDirect RDMA. Retrieved August 21, 2016, from http://docs.nvidia.com/cuda/gpudirect-rdma/index.html.

[26]

NVIDIA. 2016. GPU Applications. Retrieved August 21, 2016, from http://www.nvidia.com/object/ gpu-applications.html.

[27]

Ohio State University Network-Based Computing Laboratory. 2015. MVAPICH2: High Performance MPI over InfiniBand, iWARP and RoCE. http://mvapich.cse. ohio-state.edu. (2015).

[28]

John Ousterhout et al. 2010. The case for RAMClouds: Scalable high-performance storage entirely in DRAM. ACM Operating Systems Review 43, 4, 92--105.

Digital Library

[29]

Sreeram Potluri, Devendar Bureddy, Khaled Hamidouche, Akshay Venkatesh, Krishna Kandalla, Hari Subramoni, and Dhabaleswar K. Panda. 2013a. MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’13). ACM, New York, NY.

Digital Library

[30]

Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy, and Dhabaleswar K. Panda. 2013b. Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In Proceedings of the 2013 42nd International Conference on Parallel Processing (ICPP’13). IEEE, Los Alamitos, CA, 80--89.

Digital Library

[31]

Alexander Rasmussen, Michael Conley, Rishi Kapoor, Vinh The Lam, George Porter, and Amin Vahdat. 2012. Themis: An I/O efficient MapReduce. In Proceedings of the ACM Symposium on Cloud Computing.

Digital Library

[32]

Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating system abstractions to manage GPUs as compute devices. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). 233--248.

Digital Library

[33]

Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 49--68.

Digital Library

[34]

Davide Rossetti, Sreeram Potluri, and David Fontaine. 2016. State of GPUdirect Technologies. Retrieved August 21, 2016, from http://on-demand.gputechconf.com/gtc/2016/presentation/s6264-davide-rossetti-GPUDirect.pdf.

[35]

Leah Shalev, Julian Satran, Eran Borovik, and Muli Ben-Yehuda. 2010. IsoStack: Highly efficient network processing on dedicated cores. In Proceedings of the USENIX Annual Technical Conference. http://dl.acm.org/citation.cfm?id=1855840.1855845

Digital Library

[36]

Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating file systems with GPUs. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, 13.

Digital Library

[37]

Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2014a. GPUfs: Integrating file systems with GPUs. ACM Transactions on Computer Systems 32, 1, Article No. 1.

Digital Library

[38]

Mark Silberstein, Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, and Emmett Witchel. 2014b. GPUnet: Networking Abstractions for GPU Programs. Retrieved August 21, 2016, from https://sites.google.com/site/silbersteinmark/GPUnet.

[39]

W. Richard Stevens. 1993. TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley.

Digital Library

[40]

W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff. 2004. UNIX Network Programming. Vol. 1. Addison-Wesley Professional.

Digital Library

[41]

Jeff A. Stuart and John D. Owens. 2011. Multi-GPU MapReduce on GPU clusters. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS’11). IEEE, Los Alamitos, CA, 1068--1079.

Digital Library

[42]

Weibin Sun and Robert Ricci. 2013. Fast and flexible: Parallel packet processing with GPUs and Click. In Proceedings of the 9th ACM/IEEE Symposium on Architectures for Networking and Communications Systems. IEEE, Los Alamitos, CA, 25--36. http://dl.acm.org/citation.cfm?id=2537857.2537861

Digital Library

[43]

Justin Talbot, Richard M. Yoo, and Christos Kozyrakis. 2011. Phoenix++: Modular MapReduce for shared-memory systems. In Proceedings of the 2nd International Workshop on MapReduce and Its Applications. ACM, New York, NY, 9--16.

Digital Library

[44]

Taneja Group Technology Analysts. 2012. InfiniBand Data Center March. Retrieved August 21, 2016, from https://cw.infinibandta.org/document/dl/7269.

[45]

Animesh Trivedi, Bernard Metzler, Patrick Stuedi, and Thomas R. Gross. 2013. On limitations of network acceleration. In Proceedings of the 9th ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT’13). ACM, New York, NY, 121--126.

Digital Library

[46]

Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis. 2014. GASPP: A GPU-accelerated stateful packet processing framework. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC’14). 321--332.

Digital Library

[47]

Vijay Vasudevan, Michael Kaminsky, and David G. Andersen. 2012. Using vector interfaces to deliver millions of IOPS from a networked key-value storage server. In Proceedings of the ACM Symposium on Cloud Computing. ACM, New York, NY.

Digital Library

[48]

Vasily Volkov. 2010. Better Performance at Lower Occupancy. Retrieved August 21, 2016, from http://www.cs.berkeley.edu/&sim;volkov/volkov10-GTC.pdf.

[49]

Rob Von Behren, Jeremy Condit, Feng Zhou, George C. Necula, and Eric Brewer. 2003. Capriccio: Scalable threads for Internet services. ACM Operating Systems Review 37, 268--281.

Digital Library

[50]

Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: An architecture for well-conditioned, scalable Internet services. ACM Operating Systems Review 35, 230--243.

Digital Library

[51]

Bob Woodruf. 2013. OFS Software for the Intel Xeon Phi. In Proceedings of the OpenFabrics Alliance International Developer Workshop.

[52]

Lior Zeno and Mark Silberstein. 2016. The case for I/O preemption on discrete GPUs. In Proceedings of the International Workshop on GPU Computing Systems (GPGPU’16). 63--71.

Digital Library

Cited By

Turimbetov ISasongko MUnat D(2024)GPU-Initiated Resource Allocation for Irregular WorkloadsProceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions10.1145/3642961.3643799(1-8)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3642961.3643799
Xiang YTang DHuang RYao YXie CShi QXu RHaghighat MBao CGu YQi ZGuan H(2024)CARE: Cloudified Android With Optimized Rendering PlatformIEEE Transactions on Multimedia10.1109/TMM.2023.327430326(958-971)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3274303
Sano SBando YHiwada KKajihara HSuzuki TNakanishi YTaki DKaneko AShiozawa T(2023)GPU Graph Processing on CXL-Based Microsecond-Latency External MemoryProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624173(962-972)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624173
Show More Cited By

Index Terms

GPUnet: Networking Abstractions for GPU Programs
1. Networks
  1. Network architectures
    1. Programming interfaces
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Communications management
        Input / output

Recommendations

GPUfs: integrating a file system with GPUs
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

PU hardware is becoming increasingly general purpose, quickly outgrowing the traditional but constrained GPU-as-coprocessor programming model. To make GPUs easier to program and easier to integrate with existing systems, we propose making the host's ...
GPUrdma: GPU-side library for high performance networking from GPU kernels
ROSS '16: Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers

We present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both ...
GPUfs: Integrating a file system with GPUs

As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. This article advocates for extending standard operating system services and abstractions to GPUs in order ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems

ACM Transactions on Computer Systems Volume 34, Issue 3

September 2016

103 pages

ISSN:0734-2071

EISSN:1557-7333

DOI:10.1145/2966277

Editor:
Todd C. Mowry
Carnegie Mellon University, Pittsburgh, PA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 September 2016

Accepted: 01 June 2016

Received: 01 January 2016

Published in TOCS Volume 34, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation
Israel Science Foundation
Israeli Ministry of Science

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
1,289
Total Downloads

Downloads (Last 12 months)211
Downloads (Last 6 weeks)40

Reflects downloads up to 12 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Turimbetov ISasongko MUnat D(2024)GPU-Initiated Resource Allocation for Irregular WorkloadsProceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions10.1145/3642961.3643799(1-8)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3642961.3643799
Xiang YTang DHuang RYao YXie CShi QXu RHaghighat MBao CGu YQi ZGuan H(2024)CARE: Cloudified Android With Optimized Rendering PlatformIEEE Transactions on Multimedia10.1109/TMM.2023.327430326(958-971)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3274303
Sano SBando YHiwada KKajihara HSuzuki TNakanishi YTaki DKaneko AShiozawa T(2023)GPU Graph Processing on CXL-Based Microsecond-Latency External MemoryProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624173(962-972)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624173
Kumar ASivasubramaniam AZhu T(2023)SplitRPC: A {Control + Data} Path Splitting RPC Stack for ML Inference ServingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899747:2(1-26)Online publication date: 22-May-2023
https://dl.acm.org/doi/10.1145/3589974
Ismayilov IBaydamirli JSağbili DWahib MUnat DGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in ChargeProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593713(192-202)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593713
Guo LZhang KWang X(2022)Gaviss : Boosting the Performance of GPU-Accelerated NFV Systems via Data SharingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319336833:12(4472-4483)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3193368
Haavisto JCholez TRiekki J(2022)Unleashing GPUs for Network Function Virtualization: an open architecture based on Vulkan and KubernetesNOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS54207.2022.9789822(1-8)Online publication date: 25-Apr-2022
https://doi.org/10.1109/NOMS54207.2022.9789822
Nair AJoshi A(2022)Parallelizing CPU-GPU Network Processing Flows2022 International Conference on Innovative Trends in Information Technology (ICITIIT)10.1109/ICITIIT54346.2022.9744209(1-5)Online publication date: 12-Feb-2022
https://doi.org/10.1109/ICITIIT54346.2022.9744209
Djenouri YBelhadi AChen HLin J(2022)Intelligent deep fusion network for urban traffic flow anomaly identificationComputer Communications10.1016/j.comcom.2022.03.021189:C(175-181)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1016/j.comcom.2022.03.021
Tang DBao CYao YXie CShi QMao MXu RLi LHaghighat MQi ZGuan HShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)CAREProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475617(4582-4590)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475617
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents