Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

GPUnet: Networking Abstractions for GPU Programs

Published: 17 September 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Despite the popularity of GPUs in high-performance and scientific computing, and despite increasingly general-purpose hardware capabilities, the use of GPUs in network servers or distributed systems poses significant challenges.
    GPUnet is a native GPU networking layer that provides a socket abstraction and high-level networking APIs for GPU programs. We use GPUnet to streamline the development of high-performance, distributed applications like in-GPU-memory MapReduce and a new class of low-latency, high-throughput GPU-native network services such as a face verification server.

    References

    [1]
    Sandeep R. Agrawal, Valentin Pistol, Jun Pang, John Tran, David Tarjan, and Alvin R. Lebeck. 2014. Rhythm: Harnessing data parallel hardware for server workloads. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14).
    [2]
    Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. 2006. Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 12, 2037--2041.
    [3]
    R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. S. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, and P. Vicini. 2012. APEnet+: A 3D Torus network optimized for GPU-based HPC systems. Journal of Physics: Conference Series 396, 1--11.
    [4]
    Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 29--44.
    [5]
    Nathan Z. Beckmann, Charles Gruenwald III, Christopher R. Johnson, Harshad Kasture, Filippo Sironi, Anant Agarwal, M. Frans Kaashoek, and Nickolai Zeldovich. 2014. PIKA: A Network Service for Multikernel Operating Systems. Technical Report MIT-CSAIL-TR-2014-002. Massachusetts Institute of Technology, Cambridge, MA. http://hdl.handle.net/1721.1/84608.
    [6]
    Theophilus Benson, Aditya Akella, and David A. Maltz. 2010. Network traffic characteristics of data centers in the wild. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. ACM, New York, NY, 267--280.
    [7]
    Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning (ICML-13). 1337--1345.
    [8]
    Feras Daoud, Amir Watad, and Mark Silberstein. 2016. GPUrdma: GPU-side library for high performance networking from GPU kernels. In Proceedings of the ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS’16). Article No. 6.
    [9]
    J. Dean and S. Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04).
    [10]
    Bryan Ford. 2007. Structured streams: A new transport abstraction. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. ACM, New York, NY, 361--372.
    [11]
    Shalini Gupta. 2013. Efficient Object Detection on GPUs Using MB-LBP Features and Random Forests. Retrieved August 21, 2016, from http://on-demand.gputechconf.com/gtc/2013/presentations/S3297-Efficient-Object-Detection-GPU-MB-LBP-Forest.pdf.
    [12]
    Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader: A GPU-accelerated software router. ACM SIGCOMM Computer Communication Review 40, 4, 195--206.
    [13]
    Sangjin Han, Scott Marshall, Byung-Gon Chun, and Sylvia Ratnasamy. 2012. MegaPipe: A new programming interface for scalable network I/O. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04).
    [14]
    Sean Hefty. 2012. Rsockets. Available at https://www.openfabrics.org/index.php/resources/document-downlo ads/public-documents/doc_download/495-rsockets.html.
    [15]
    InfiniBand Trade Association. 2007. InfiniBand Architecture Specification, Volume 1—General Specification, Release 1.2.1. InfiniBand Trade Association.
    [16]
    Keon Jang, Sangjin Han, Seungyeop Han, Sue Moon, and KyoungSoo Park. 2011. SSLShader: Cheap SSL acceleration with commodity processors. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI’11). http://portal.acm.org/citation.cfm?id=1972457.1972459
    [17]
    Feng Ji, Heshan Lin, and Xiaosong Ma. 2013. RSVM: A region-based software virtual memory for GPU. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE, Los Alamitos, CA, 269--278.
    [18]
    Shinpei Kato, Jason Aumiller, and Scott Brandt. 2013. Zero-copy I/O processing for low-latency GPU computing. In Proceedings of the ACM/IEEE 4th International Conference on Cyber-Physical Systems (ICCPS’13). ACM, New York, NY, 170--178.
    [19]
    Shinpei Kato, Karthik Lakshmanan, Ragunathan Rajkumar, and Yutaka Ishikawa. 2011. TimeGraph: GPU scheduling for real-time multi-tasking environments. In Proceedings of the USENIX Annual Technical Conference. http://portal.acm.org/citation.cfm?id=2002181.2002183
    [20]
    Khronos Group. 2016. OpenCL: The Open Standard for Parallel Programming of Heterogeneous Systems. Retrieved August 21, 2016, from http://www.khronos.org/opencl.
    [21]
    David B. Kirk and W. Hwu Wen-mei. 2010. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann.
    [22]
    Maxwell Krohn, Eddie Kohler, and M. Frans Kaashoek. 2007. Events can make sense. In Proceedings of the USENIX Annual Technical Conference. http://dl.acm.org/citation.cfm?id=1364385.1364392
    [23]
    Felix Xiaozhu Lin, Zhen Wang, and Lin Zhong. 2014. K2: A mobile operating system for heterogeneous coherence domains. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY.
    [24]
    Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K. Panda. 2004. High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 32, 3, 167--198.
    [25]
    NVIDIA. 2015. Developing a Linux Kernel Module Using GPUDirect RDMA. Retrieved August 21, 2016, from http://docs.nvidia.com/cuda/gpudirect-rdma/index.html.
    [26]
    NVIDIA. 2016. GPU Applications. Retrieved August 21, 2016, from http://www.nvidia.com/object/ gpu-applications.html.
    [27]
    Ohio State University Network-Based Computing Laboratory. 2015. MVAPICH2: High Performance MPI over InfiniBand, iWARP and RoCE. http://mvapich.cse. ohio-state.edu. (2015).
    [28]
    John Ousterhout et al. 2010. The case for RAMClouds: Scalable high-performance storage entirely in DRAM. ACM Operating Systems Review 43, 4, 92--105.
    [29]
    Sreeram Potluri, Devendar Bureddy, Khaled Hamidouche, Akshay Venkatesh, Krishna Kandalla, Hari Subramoni, and Dhabaleswar K. Panda. 2013a. MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters. In Proceedings of the International Conference on High Performance Computing, Networking, Storage, and Analysis (SC’13). ACM, New York, NY.
    [30]
    Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy, and Dhabaleswar K. Panda. 2013b. Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs. In Proceedings of the 2013 42nd International Conference on Parallel Processing (ICPP’13). IEEE, Los Alamitos, CA, 80--89.
    [31]
    Alexander Rasmussen, Michael Conley, Rishi Kapoor, Vinh The Lam, George Porter, and Amin Vahdat. 2012. Themis: An I/O efficient MapReduce. In Proceedings of the ACM Symposium on Cloud Computing.
    [32]
    Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel. 2011. PTask: Operating system abstractions to manage GPUs as compute devices. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). 233--248.
    [33]
    Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: A compiler and runtime for heterogeneous systems. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 49--68.
    [34]
    Davide Rossetti, Sreeram Potluri, and David Fontaine. 2016. State of GPUdirect Technologies. Retrieved August 21, 2016, from http://on-demand.gputechconf.com/gtc/2016/presentation/s6264-davide-rossetti-GPUDirect.pdf.
    [35]
    Leah Shalev, Julian Satran, Eran Borovik, and Muli Ben-Yehuda. 2010. IsoStack: Highly efficient network processing on dedicated cores. In Proceedings of the USENIX Annual Technical Conference. http://dl.acm.org/citation.cfm?id=1855840.1855845
    [36]
    Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating file systems with GPUs. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, 13.
    [37]
    Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2014a. GPUfs: Integrating file systems with GPUs. ACM Transactions on Computer Systems 32, 1, Article No. 1.
    [38]
    Mark Silberstein, Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, and Emmett Witchel. 2014b. GPUnet: Networking Abstractions for GPU Programs. Retrieved August 21, 2016, from https://sites.google.com/site/silbersteinmark/GPUnet.
    [39]
    W. Richard Stevens. 1993. TCP/IP Illustrated, Volume 1: The Protocols. Addison-Wesley.
    [40]
    W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff. 2004. UNIX Network Programming. Vol. 1. Addison-Wesley Professional.
    [41]
    Jeff A. Stuart and John D. Owens. 2011. Multi-GPU MapReduce on GPU clusters. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS’11). IEEE, Los Alamitos, CA, 1068--1079.
    [42]
    Weibin Sun and Robert Ricci. 2013. Fast and flexible: Parallel packet processing with GPUs and Click. In Proceedings of the 9th ACM/IEEE Symposium on Architectures for Networking and Communications Systems. IEEE, Los Alamitos, CA, 25--36. http://dl.acm.org/citation.cfm?id=2537857.2537861
    [43]
    Justin Talbot, Richard M. Yoo, and Christos Kozyrakis. 2011. Phoenix++: Modular MapReduce for shared-memory systems. In Proceedings of the 2nd International Workshop on MapReduce and Its Applications. ACM, New York, NY, 9--16.
    [44]
    Taneja Group Technology Analysts. 2012. InfiniBand Data Center March. Retrieved August 21, 2016, from https://cw.infinibandta.org/document/dl/7269.
    [45]
    Animesh Trivedi, Bernard Metzler, Patrick Stuedi, and Thomas R. Gross. 2013. On limitations of network acceleration. In Proceedings of the 9th ACM Conference on Emerging Networking Experiments and Technologies (CoNEXT’13). ACM, New York, NY, 121--126.
    [46]
    Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis. 2014. GASPP: A GPU-accelerated stateful packet processing framework. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC’14). 321--332.
    [47]
    Vijay Vasudevan, Michael Kaminsky, and David G. Andersen. 2012. Using vector interfaces to deliver millions of IOPS from a networked key-value storage server. In Proceedings of the ACM Symposium on Cloud Computing. ACM, New York, NY.
    [48]
    Vasily Volkov. 2010. Better Performance at Lower Occupancy. Retrieved August 21, 2016, from http://www.cs.berkeley.edu/∼volkov/volkov10-GTC.pdf.
    [49]
    Rob Von Behren, Jeremy Condit, Feng Zhou, George C. Necula, and Eric Brewer. 2003. Capriccio: Scalable threads for Internet services. ACM Operating Systems Review 37, 268--281.
    [50]
    Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: An architecture for well-conditioned, scalable Internet services. ACM Operating Systems Review 35, 230--243.
    [51]
    Bob Woodruf. 2013. OFS Software for the Intel Xeon Phi. In Proceedings of the OpenFabrics Alliance International Developer Workshop.
    [52]
    Lior Zeno and Mark Silberstein. 2016. The case for I/O preemption on discrete GPUs. In Proceedings of the International Workshop on GPU Computing Systems (GPGPU’16). 63--71.

    Cited By

    View all
    • (2024)GPU-Initiated Resource Allocation for Irregular WorkloadsProceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions10.1145/3642961.3643799(1-8)Online publication date: 2-Mar-2024
    • (2024)CARE: Cloudified Android With Optimized Rendering PlatformIEEE Transactions on Multimedia10.1109/TMM.2023.327430326(958-971)Online publication date: 1-Jan-2024
    • (2023)GPU Graph Processing on CXL-Based Microsecond-Latency External MemoryProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624173(962-972)Online publication date: 12-Nov-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Computer Systems
    ACM Transactions on Computer Systems  Volume 34, Issue 3
    September 2016
    103 pages
    ISSN:0734-2071
    EISSN:1557-7333
    DOI:10.1145/2966277
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 September 2016
    Accepted: 01 June 2016
    Received: 01 January 2016
    Published in TOCS Volume 34, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPGPUs
    2. Operating systems design
    3. accelerators
    4. network servers

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)211
    • Downloads (Last 6 weeks)40
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)GPU-Initiated Resource Allocation for Irregular WorkloadsProceedings of the 3rd International Workshop on Extreme Heterogeneity Solutions10.1145/3642961.3643799(1-8)Online publication date: 2-Mar-2024
    • (2024)CARE: Cloudified Android With Optimized Rendering PlatformIEEE Transactions on Multimedia10.1109/TMM.2023.327430326(958-971)Online publication date: 1-Jan-2024
    • (2023)GPU Graph Processing on CXL-Based Microsecond-Latency External MemoryProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624173(962-972)Online publication date: 12-Nov-2023
    • (2023)SplitRPC: A {Control + Data} Path Splitting RPC Stack for ML Inference ServingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899747:2(1-26)Online publication date: 22-May-2023
    • (2023)Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in ChargeProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593713(192-202)Online publication date: 21-Jun-2023
    • (2022)Gaviss : Boosting the Performance of GPU-Accelerated NFV Systems via Data SharingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319336833:12(4472-4483)Online publication date: 1-Dec-2022
    • (2022)Unleashing GPUs for Network Function Virtualization: an open architecture based on Vulkan and KubernetesNOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS54207.2022.9789822(1-8)Online publication date: 25-Apr-2022
    • (2022)Parallelizing CPU-GPU Network Processing Flows2022 International Conference on Innovative Trends in Information Technology (ICITIIT)10.1109/ICITIIT54346.2022.9744209(1-5)Online publication date: 12-Feb-2022
    • (2022)Intelligent deep fusion network for urban traffic flow anomaly identificationComputer Communications10.1016/j.comcom.2022.03.021189:C(175-181)Online publication date: 1-May-2022
    • (2021)CAREProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475617(4582-4590)Online publication date: 17-Oct-2021
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media