Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2523721.2523758acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

RSVM: a region-based software virtual memory for GPU

Published: 07 October 2013 Publication History
  • Get Citation Alerts
  • Abstract

    While Graphics Processing Units (GPU) have gained much success in general purpose computing in recent years, their programming is still difficult, due to, particularly, explicitly managed GPU memory and manual CPU-GPU data transfer. Despite recent calls for managing GPU resources as first-class citizens in the operating system, a mature GPU memory management mechanism is still missing, which leads to reinventing the wheels in various GPU system software. Meanwhile, due to ever enlarging problem sizes, we urgently need a system-level mechanism for unified CPU-GPU memory management.
    In this work, we present the design of Region-based Software Virtual Memory (RSVM), a software virtual memory running on both CPU and GPU in a distributed and cooperative way. In addition to automatic GPU memory management and GPU-CPU data transfer, RSVM offers two novel features: 1) GPU kernel-issued on-demand data fetching from the host into the GPU memory, and 2) intra-kernel transparent GPU memory swapping into the main memory. Our study reveals important insights on the challenges and opportunities of building unified virtual memory systems for heterogeneous computing. Experimental results on real GPU benchmarks demonstrate that, though it incurs a small overhead, RSVM can transparently scale GPU kernels to large problem sizes exceeding the device memory size limit; developers write the same code for different problem sizes, but still can optimize on data layout definition accordingly. Our evaluation also identifies missing GPU architecture features for better system software efficiency.

    References

    [1]
    10th DIMACS Implementation Challenge - Graph Partitioning and Graph Clustering. http://www.cc.gatech.edu/dimacs10/index.shtml.
    [2]
    Graph500. http://www.graph500.org/.
    [3]
    GTgraph: A suite of synthetic random graph generators. http://www.cse.psu.edu/~madduri/software/GTgraph/index.html.
    [4]
    NVIDIA CUDA. http://www.nvidia.com/object/cuda.
    [5]
    OpenCL. http://www.khronos.org/opencl/.
    [6]
    Rodinia benchmark. https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Main\_Page.
    [7]
    The HSA Foundation. http://hsafoundation.com/.
    [8]
    C. Augonnet, J. Clet-Ortega, S. Thibault, and R. Namyst. Data-Aware Task Scheduling on Multi-accelerator Based Platforms. Parallel and Distributed Systems, International Conference on, 0:291--298, 2010.
    [9]
    T. Chen, R. Raghavan, J. N. Dale, and E. Iwata. Cell broadband engine architecture and its first implementation: a performance view. IBM J. Res. Dev., 51:559--572, September 2007.
    [10]
    A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU '10, pages 63--74, New York, NY, USA, 2010. ACM.
    [11]
    G. F. Diamos and S. Yalamanchili. Harmony: an execution model and runtime for heterogeneous many core systems. In Proceedings of the 17th international symposium on High performance distributed computing, HPDC '08, pages 197--200, New York, NY, USA, 2008. ACM.
    [12]
    A. E. Eichenberger, K. O'Brien, K. O'Brien, P. Wu, T. Chen, P. H. Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind. Optimizing Compiler for the CELL Processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, PACT '05, pages 161--172, Washington, DC, USA, 2005. IEEE Computer Society.
    [13]
    K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC '06, New York, NY, USA, 2006. ACM.
    [14]
    I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu. An asymmetric distributed shared memory model for heterogeneous parallel systems. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS '10, pages 347--358, New York, NY, USA, 2010. ACM.
    [15]
    S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating CUDA graph algorithms at maximum warp. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP '11, pages 267--276, New York, NY, USA, 2011. ACM.
    [16]
    T. B. Jablin, J. A. Jablin, P. Prabhu, F. Liu, and D. I. August. Dynamically managed data for CPU-GPU architectures. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO '12, pages 165--174, New York, NY, USA, 2012. ACM.
    [17]
    T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. Automatic CPU-GPU communication management and optimization. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, PLDI '11, pages 142--151, New York, NY, USA, 2011. ACM.
    [18]
    K. L. Johnson, M. F. Kaashoek, and D. A. Wallach. CRL: high-performance all-software distributed shared memory. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP '95), pages 213--226, Copper Mountain Resort, Colorado, December 1995. An earlier version of this work appeared as Technical Report MIT-LCS-TM-517, MIT Laboratory for Computer Science, March 1995.
    [19]
    S. Kato, M. McThrow, C. Maltzahn, and S. Brandt. Gdev: First-class GPU resource management in the operating system. In Proceedings of the USENIX Annual Technical Conference (ATC), June 2012.
    [20]
    K. Li and P. Hudak. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst., 7(4):321--359, Nov. 1989.
    [21]
    M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: a programming model for heterogeneous multi-core systems. In Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, ASPLOS XIII, pages 287--296, New York, NY, USA, 2008. ACM.
    [22]
    C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 45--55, New York, NY, USA, 2009. ACM.
    [23]
    J. Menon, M. De~Kruijf, and K. Sankaralingam. iGPU: exception support and speculative execution on GPUs. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, pages 72--83, Washington, DC, USA, 2012. IEEE Computer Society.
    [24]
    S. Pai, R. Govindarajan, and M. J. Thazhuthaveetil. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques, PACT '12, pages 33--42, New York, NY, USA, 2012. ACM.
    [25]
    C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: operating system abstractions to manage GPUs as compute devices. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP '11, pages 233--248, New York, NY, USA, 2011. ACM.
    [26]
    B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson. Programming model for a heterogeneous x86 platform. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation, PLDI '09, pages 431--440, New York, NY, USA, 2009. ACM.
    [27]
    L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. In ACM SIGGRAPH 2008 papers, SIGGRAPH '08, pages 18:1--18:15, New York, NY, USA, 2008. ACM.
    [28]
    M. Silberstein, B. Ford, I. Keidar, and E. Witchel. GPUfs: Integrating a File System with GPUs. In Proceedings of ASPLOS 2013, 2013.
    [29]
    M. Steinberger, M. Kenzel, B. Kainz, and D. Schmalstieg. ScatterAlloc: Massively Parallel Dynamic Memory Allocation for the GPU. In Proceedings of Innovative Parallel Computing (InPar'12), 2012.
    [30]
    J. Stuart, M. Cox, and J. Owens. GPU-to-CPU Callbacks. In M. Guarracino, F. Vivien, J. Träff, M. Cannatoro, M. Danelutto, A. Hast, F. Perla, A. Knüpfer, B. Di~Martino, and M. Alexander, editors, Euro-Par 2010 Parallel Processing Workshops, volume 6586 of Lecture Notes in Computer Science, pages 365--372. Springer Berlin / Heidelberg, 2011. 10.1007/978--3--642--21878--1_45.
    [31]
    S. Yan, X. Zhou, Y. Gao, H. Chen, G. Wu, S. Luo, and B. Saha. Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform. SIGOPS Oper. Syst. Rev., 45:92--100, February 2011.

    Cited By

    View all
    • (2020)AvA: Accelerated Virtualization of AcceleratorsProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378466(807-825)Online publication date: 9-Mar-2020
    • (2019)Automatic Virtualization of AcceleratorsProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321423(58-65)Online publication date: 13-May-2019
    • (2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
    October 2013
    422 pages
    ISBN:9781479910212

    Sponsors

    Publisher

    IEEE Press

    Publication History

    Published: 07 October 2013

    Check for updates

    Author Tags

    1. GPGPU
    2. GPU memory mangement
    3. heterogeneous system

    Qualifiers

    • Research-article

    Acceptance Rates

    PACT '13 Paper Acceptance Rate 36 of 208 submissions, 17%;
    Overall Acceptance Rate 121 of 471 submissions, 26%

    Upcoming Conference

    PACT '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)AvA: Accelerated Virtualization of AcceleratorsProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378466(807-825)Online publication date: 9-Mar-2020
    • (2019)Automatic Virtualization of AcceleratorsProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3317550.3321423(58-65)Online publication date: 13-May-2019
    • (2019)A case study on machine learning for synthesizing benchmarksProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329976(38-46)Online publication date: 22-Jun-2019
    • (2018)ActivePointersACM SIGOPS Operating Systems Review10.1145/3273982.327399052:1(84-95)Online publication date: 28-Aug-2018
    • (2017)Efficient exception handling support for GPUsProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123950(109-122)Online publication date: 14-Oct-2017
    • (2016)ActivePointersACM SIGARCH Computer Architecture News10.1145/3007787.300120044:3(596-608)Online publication date: 18-Jun-2016
    • (2016)GPUnetACM Transactions on Computer Systems10.1145/296309834:3(1-31)Online publication date: 17-Sep-2016
    • (2016)Supporting data-driven I/O on GPUs using GPUfsProceedings of the 9th ACM International on Systems and Storage Conference10.1145/2928275.2928276(1-11)Online publication date: 6-Jun-2016
    • (2016)ActivePointersProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.58(596-608)Online publication date: 18-Jun-2016
    • (2015)Managing GPU buffers for caching more apps in mobile systemsProceedings of the 12th International Conference on Embedded Software10.5555/2830865.2830888(207-216)Online publication date: 4-Oct-2015
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media