research-article

To move or not to move?: page migration for irregular applications in over-subscribed GPU memory systems with DynaMap

Authors:

Chia-Hao Chang,

Anand SivasubramaniamAuthors Info & Claims

SYSTOR '21: Proceedings of the 14th ACM International Conference on Systems and Storage

Article No.: 1, Pages 1 - 12

https://doi.org/10.1145/3456727.3463766

Published: 14 June 2021 Publication History

Abstract

This paper focuses on the severe page thrashing problem that can arise when running large irregular memory access applications on limited GPU memory systems. Such memory over-subscription causes very poor performance in the currently on demand (eager) or page-group granularity access-counter based (lazy) page migration mechanisms found in NVIDIA's UVM drivers. Our detailed analysis of these executions reveals a very novel insight: rather than duplicate the responsibility of catering to both temporal and spatial locality in both GPU caches and its memory, it is better for the former to simply cater to the temporal aspect, and the latter to the spatial aspect, thereby saving precious memory system capacities. Based on this, we build an adaptive page migration scheme, called DynaMap, that (i) uses a compiler pass to instrument off-the-shelf CUDA UVM applications for spatial utilization tracking, (ii) dynamically sets a spatial utilization threshold to determine migration based on memory pressure and access characteristics, and (iii) enhances the current NVIDIA UVM driver to dynamically migrate the page (from the host memory to the GPU) based on the threshold. Using 7 irregular applications from public benchmark suites, we implement DynaMap on a real system with different over-subscription ratios to show speedups as much as 2.5X (34% on the average) over state-of-the-art UVM implementations.

References

[1]

Guru 3D. 2020. GDDR6 significantly more expensive than GDDR5. https://www.guru3d.com/news-story/gddr6-significantly-more-expensive-than-gddr5.html

[2]

Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David Lipman. 1990. Basic Local Aligment Search Tool. Journal of molecular biology 215 (11 1990), 403--10.

[3]

Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In 2012 IEEE International Symposium on Workload Characterization (IISWC). 141--151.

Digital Library

[4]

Calin CaBcaval and David A Padua. 2003. Estimating Cache Misses and Locality Using Stack Distances. In Proceedings of the 17th Annual International Conference on Supercomputing (San Francisco, CA, USA) (ICS '03). Association for Computing Machinery, New York, NY, USA, 150--159.

Digital Library

[5]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of IEEE International Symposium on Workload Characterization (IISWC).

Digital Library

[6]

Stephen V. Cole and Jeremy Buhler. 2017. MERCATOR: A GPGPU Framework for Irregular Streaming Applications. In 2017 International Conference on High Performance Computing Simulation (HPCS). 727--736.

[7]

Thomson Comer. 2020. Accelerating Geographic Information Systems (GIS) Data Science with RAPIDS cuSpatial and GPUs. https://medium.com/rapids-ai/acclerating-gis-data-science-with-rapids-cuspatial-and-gpus-fd012b27af0a

[8]

Chandramohan A. Thekkath Daniel J. Scales, Kourosh Gharachorloo. 1996. Shasta: A Low Overhead Software-Only Approach for Supporting Fine-Grain Shared Memory. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems.

[9]

Alex Fender. 2020. Tackling Large Graphs with RAPIDS cuGraph and CUDA Unified Memory on GPUs. https://medium.com/rapids-ai/tackling-large-graphs-with-rapids-cugraph-and-unified-virtual-memory-b5b69a065d4

[10]

Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory. In Proceedings of the 46th International Symposium on Computer Architecture.

Digital Library

[11]

D. Ganguly, Z. Zhang, J. Yang, and R. Melhem. 2020. Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 451--461.

[12]

Prasun Gera, Hyojong Kim, Piyush Sao, Hyesoon Kim, and David Bader. 2020. Traversing Large Graphs on GPUs with Unified Memory. Proc. VLDB Endow. 13, 7 (March 2020), 1119--1133.

Digital Library

[13]

Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a High-Level Language Targeted to GPU Codes. In Proceedings of Innovative Parallel Computing.

[14]

Kirsten Hildrum and Philip S. Yu. 2005. Focused Community Discovery. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM '05). IEEE Computer Society, USA, 641--644.

Digital Library

[15]

G. Janssen, V. Zolotov, and T. D. Le. 2019. Large Data Flow Graphs in Limited GPU Memory. In 2019 IEEE International Conference on Big Data (Big Data). 1821--1830.

[16]

Hyojong Kim, Jaewoong Sim, Prasun Gera, Ramyad Hadidi, and Hyesoon Kim. 2020. "Batch-Aware Unified Memory Management in GPUs for Irregular Workloads". In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1357--1370.

Digital Library

[17]

Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2019. A Framework for Memory Oversubscription Management in Graphics Processing Units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems.

Digital Library

[18]

Lingda Li and Barbara Chapman. 2019. Compiler Assisted Hybrid Implicit and Explicit GPU Memory Management under Unified Address Space. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '19). Association for Computing Machinery, New York, NY, USA, Article 51, 16 pages.

Digital Library

[19]

LLVM. 2020. The LLVM Compiler Infrastructure. https://llvm.org/devmtg/2019-04/talks.html

[20]

Inc. Micron Technology. 2019. GDDR Memory Enabling AI and High performance Compute. https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9968-gddr-memory-enabling-ai-and-high-performance-compute-presented-by-micron.pdf

[21]

David S. Miller, Richard Henderson, and Jakub Jelinek. 2020. Dynamic DMA mapping Guide. https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt

[22]

Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and ChingYungLin. 2015. GraphBIG: Understanding Graph Computing in the Context of Industrial Solutions. In International Conference for High Performance Computing, Networking, Storage and Analysis.

Digital Library

[23]

NVIDIA. 2020. CUDA TOOKIT DOCUMENTATION. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

[24]

Zaid Qureshi, Vikram Sharma Mailthody, Seung Won Min, I Chung, Jinjun Xiong, and Wen-mei Hwu. 2020. Tearning Down The Memory Wall. In Arxiv pre-print.

[25]

Bin Ren, Tomi Poutanen, Todd Mytkowicz, Wolfram Schulte, Gagan Agrawal, and James R. Larus. 2013. SIMD Parallelization of Applications That Traverse Irregular Data Structures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO '13). IEEE Computer Society, USA, 1--10.

Digital Library

[26]

Adrien Rémy. 2015. Solving dense linear systems on accelerated multicore architectures. (07 2015).

[27]

Nikolay Sakharnykh. 2016. Beyond GPU Memory Limits with Unified Memory on Pascal. https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/

[28]

N. Sakharnykh. 2017. Unified Memory on Pascal and Volta. http://on-demand.gputechconf.com/gtc/2017/presentation/s7285nikolay-sakharnykh-unified-memory-on-pascal-and-volta.pdf

[29]

L. Semiconductors. 2020. Scatter-Gather DMA Controller IP. http://www.latticesemi.com/Products/DesignSoftwareAndIP/IntellectualProperty/IPCore/IPCores01/ScatterGatherDMAController

[30]

Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. 2016. Hetero-Mark, A Benchmark Suite for CPU-GPU Collaborative Computing. In Proceedings of IEEE International Symposium on Workload Characterization (IISWC).

[31]

Stephen W. Timcheck and Jeremy D. Buhler. 2020. Reducing Queuing Impact in Irregular Data Streaming Applications. In 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3). 22--30.

[32]

Tao Zhang, Jingjie Zhang, Wei Shu, Min-You Wu, and Xiaoyao Liang. 2015. Efficient Graph Computation on Hybrid CPU and GPU Systems. J. Supercomput. 71, 4 (April 2015), 1563--1586.

Digital Library

Cited By

B PCox GVesely JBasu A(2024)SUV: Static Analysis Guided Unified Virtual Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00030(293-308)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00030
Abdullah RLee HZhou HAwad A(2024)Salus: Efficient Security Support for CXL-Expanded GPU Memory2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00027(1-15)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00027
Allen TCooper BGe R(2023)Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual MemoryACM Transactions on Architecture and Code Optimization10.1145/363295321:1(1-24)Online publication date: 14-Nov-2023
https://dl.acm.org/doi/10.1145/3632953
Show More Cited By

Index Terms

To move or not to move?: page migration for irregular applications in over-subscribed GPU memory systems with DynaMap
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management

Recommendations

In-depth analyses of unified virtual memory system for GPU accelerated computing
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

The abstraction of a shared memory space over separate CPU and GPU memory domains has eased the burden of portability for many HPC codebases. However, users pay for the ease of use provided by systems-managed memory space with a moderate-to-high ...
Group-based memory oversubscription for virtualized clouds

As memory resource is a primary inhibitor of oversubscribing data centers in virtualized clouds, efficient memory management has been more appealing to public cloud providers. Although memory oversubscription improves overall memory efficiency, existing ...
GPU virtualization for high performance general purpose computing on the ESX hypervisor
HPC '14: Proceedings of the High Performance Computing Symposium

Graphics Processing Units (GPU) have become important components in high performance computing (HPC) systems for their massively parallel computing capability and energy efficiency. Virtualization technologies are increasingly applied to HPC to reduce ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SYSTOR '21: Proceedings of the 14th ACM International Conference on Systems and Storage

June 2021

226 pages

ISBN:9781450383981

DOI:10.1145/3456727

General Chairs:
Bruno Wassermann
IBM Research - Haifa, Israel
,
Michal Malka
IBM Research - Haifa, Israel
,
Program Chairs:
Vijay Chidambaram
University of Texas at Austin/VMWare Research
,
Danny Raz
Technion - Israel Institute of Technology, Israel

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

In-Cooperation

Technion: Israel Institute of Technology
USENIX Assoc: USENIX Assoc

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Best Paper

Author Tags

Qualifiers

Research-article

Funding Sources

DARPA/SRC, NSF

Conference

SYSTOR '21

Sponsor:

SIGOPS

SYSTOR '21: The 14th ACM International Systems and Storage Conference

June 14 - 16, 2021

Haifa, Israel

Acceptance Rates

SYSTOR '21 Paper Acceptance Rate 18 of 63 submissions, 29%;

Overall Acceptance Rate 108 of 323 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
639
Total Downloads

Downloads (Last 12 months)171
Downloads (Last 6 weeks)17

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

B PCox GVesely JBasu A(2024)SUV: Static Analysis Guided Unified Virtual Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00030(293-308)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00030
Abdullah RLee HZhou HAwad A(2024)Salus: Efficient Security Support for CXL-Expanded GPU Memory2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00027(1-15)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00027
Allen TCooper BGe R(2023)Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual MemoryACM Transactions on Architecture and Code Optimization10.1145/363295321:1(1-24)Online publication date: 14-Nov-2023
https://dl.acm.org/doi/10.1145/3632953
Giannoula CHuang KTang JKoziris NGoumas GChishti ZVijaykumar N(2023)DaeMon: Architectural Support for Efficient Data Movement in Fully Disaggregated SystemsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35794457:1(1-36)Online publication date: 2-Mar-2023
https://dl.acm.org/doi/10.1145/3579445

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten