Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3456727.3463766acmconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article

To move or not to move?: page migration for irregular applications in over-subscribed GPU memory systems with DynaMap

Published: 14 June 2021 Publication History

Abstract

This paper focuses on the severe page thrashing problem that can arise when running large irregular memory access applications on limited GPU memory systems. Such memory over-subscription causes very poor performance in the currently on demand (eager) or page-group granularity access-counter based (lazy) page migration mechanisms found in NVIDIA's UVM drivers. Our detailed analysis of these executions reveals a very novel insight: rather than duplicate the responsibility of catering to both temporal and spatial locality in both GPU caches and its memory, it is better for the former to simply cater to the temporal aspect, and the latter to the spatial aspect, thereby saving precious memory system capacities. Based on this, we build an adaptive page migration scheme, called DynaMap, that (i) uses a compiler pass to instrument off-the-shelf CUDA UVM applications for spatial utilization tracking, (ii) dynamically sets a spatial utilization threshold to determine migration based on memory pressure and access characteristics, and (iii) enhances the current NVIDIA UVM driver to dynamically migrate the page (from the host memory to the GPU) based on the threshold. Using 7 irregular applications from public benchmark suites, we implement DynaMap on a real system with different over-subscription ratios to show speedups as much as 2.5X (34% on the average) over state-of-the-art UVM implementations.

References

[1]
Guru 3D. 2020. GDDR6 significantly more expensive than GDDR5. https://www.guru3d.com/news-story/gddr6-significantly-more-expensive-than-gddr5.html
[2]
Stephen Altschul, Warren Gish, Webb Miller, Eugene Myers, and David Lipman. 1990. Basic Local Aligment Search Tool. Journal of molecular biology 215 (11 1990), 403--10.
[3]
Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs. In 2012 IEEE International Symposium on Workload Characterization (IISWC). 141--151.
[4]
Calin CaBcaval and David A Padua. 2003. Estimating Cache Misses and Locality Using Stack Distances. In Proceedings of the 17th Annual International Conference on Supercomputing (San Francisco, CA, USA) (ICS '03). Association for Computing Machinery, New York, NY, USA, 150--159.
[5]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of IEEE International Symposium on Workload Characterization (IISWC).
[6]
Stephen V. Cole and Jeremy Buhler. 2017. MERCATOR: A GPGPU Framework for Irregular Streaming Applications. In 2017 International Conference on High Performance Computing Simulation (HPCS). 727--736.
[7]
Thomson Comer. 2020. Accelerating Geographic Information Systems (GIS) Data Science with RAPIDS cuSpatial and GPUs. https://medium.com/rapids-ai/acclerating-gis-data-science-with-rapids-cuspatial-and-gpus-fd012b27af0a
[8]
Chandramohan A. Thekkath Daniel J. Scales, Kourosh Gharachorloo. 1996. Shasta: A Low Overhead Software-Only Approach for Supporting Fine-Grain Shared Memory. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems.
[9]
Alex Fender. 2020. Tackling Large Graphs with RAPIDS cuGraph and CUDA Unified Memory on GPUs. https://medium.com/rapids-ai/tackling-large-graphs-with-rapids-cugraph-and-unified-virtual-memory-b5b69a065d4
[10]
Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory. In Proceedings of the 46th International Symposium on Computer Architecture.
[11]
D. Ganguly, Z. Zhang, J. Yang, and R. Melhem. 2020. Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 451--461.
[12]
Prasun Gera, Hyojong Kim, Piyush Sao, Hyesoon Kim, and David Bader. 2020. Traversing Large Graphs on GPUs with Unified Memory. Proc. VLDB Endow. 13, 7 (March 2020), 1119--1133.
[13]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a High-Level Language Targeted to GPU Codes. In Proceedings of Innovative Parallel Computing.
[14]
Kirsten Hildrum and Philip S. Yu. 2005. Focused Community Discovery. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM '05). IEEE Computer Society, USA, 641--644.
[15]
G. Janssen, V. Zolotov, and T. D. Le. 2019. Large Data Flow Graphs in Limited GPU Memory. In 2019 IEEE International Conference on Big Data (Big Data). 1821--1830.
[16]
Hyojong Kim, Jaewoong Sim, Prasun Gera, Ramyad Hadidi, and Hyesoon Kim. 2020. "Batch-Aware Unified Memory Management in GPUs for Irregular Workloads". In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1357--1370.
[17]
Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2019. A Framework for Memory Oversubscription Management in Graphics Processing Units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems.
[18]
Lingda Li and Barbara Chapman. 2019. Compiler Assisted Hybrid Implicit and Explicit GPU Memory Management under Unified Address Space. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '19). Association for Computing Machinery, New York, NY, USA, Article 51, 16 pages.
[19]
LLVM. 2020. The LLVM Compiler Infrastructure. https://llvm.org/devmtg/2019-04/talks.html
[20]
Inc. Micron Technology. 2019. GDDR Memory Enabling AI and High performance Compute. https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9968-gddr-memory-enabling-ai-and-high-performance-compute-presented-by-micron.pdf
[21]
David S. Miller, Richard Henderson, and Jakub Jelinek. 2020. Dynamic DMA mapping Guide. https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
[22]
Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and ChingYungLin. 2015. GraphBIG: Understanding Graph Computing in the Context of Industrial Solutions. In International Conference for High Performance Computing, Networking, Storage and Analysis.
[23]
NVIDIA. 2020. CUDA TOOKIT DOCUMENTATION. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[24]
Zaid Qureshi, Vikram Sharma Mailthody, Seung Won Min, I Chung, Jinjun Xiong, and Wen-mei Hwu. 2020. Tearning Down The Memory Wall. In Arxiv pre-print.
[25]
Bin Ren, Tomi Poutanen, Todd Mytkowicz, Wolfram Schulte, Gagan Agrawal, and James R. Larus. 2013. SIMD Parallelization of Applications That Traverse Irregular Data Structures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (CGO '13). IEEE Computer Society, USA, 1--10.
[26]
Adrien Rémy. 2015. Solving dense linear systems on accelerated multicore architectures. (07 2015).
[27]
Nikolay Sakharnykh. 2016. Beyond GPU Memory Limits with Unified Memory on Pascal. https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/
[28]
N. Sakharnykh. 2017. Unified Memory on Pascal and Volta. http://on-demand.gputechconf.com/gtc/2017/presentation/s7285nikolay-sakharnykh-unified-memory-on-pascal-and-volta.pdf
[29]
L. Semiconductors. 2020. Scatter-Gather DMA Controller IP. http://www.latticesemi.com/Products/DesignSoftwareAndIP/IntellectualProperty/IPCore/IPCores01/ScatterGatherDMAController
[30]
Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. 2016. Hetero-Mark, A Benchmark Suite for CPU-GPU Collaborative Computing. In Proceedings of IEEE International Symposium on Workload Characterization (IISWC).
[31]
Stephen W. Timcheck and Jeremy D. Buhler. 2020. Reducing Queuing Impact in Irregular Data Streaming Applications. In 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3). 22--30.
[32]
Tao Zhang, Jingjie Zhang, Wei Shu, Min-You Wu, and Xiaoyao Liang. 2015. Efficient Graph Computation on Hybrid CPU and GPU Systems. J. Supercomput. 71, 4 (April 2015), 1563--1586.

Cited By

View all
  • (2024)SUV: Static Analysis Guided Unified Virtual Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00030(293-308)Online publication date: 2-Nov-2024
  • (2024)Salus: Efficient Security Support for CXL-Expanded GPU Memory2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00027(1-15)Online publication date: 2-Mar-2024
  • (2023)Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual MemoryACM Transactions on Architecture and Code Optimization10.1145/363295321:1(1-24)Online publication date: 14-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SYSTOR '21: Proceedings of the 14th ACM International Conference on Systems and Storage
June 2021
226 pages
ISBN:9781450383981
DOI:10.1145/3456727
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • Technion: Israel Institute of Technology
  • USENIX Assoc: USENIX Assoc

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2021

Permissions

Request permissions for this article.

Check for updates

Badges

  • Best Paper

Author Tags

  1. GPGPU
  2. UVM
  3. memory oversubscription

Qualifiers

  • Research-article

Funding Sources

  • DARPA/SRC, NSF

Conference

SYSTOR '21
Sponsor:

Acceptance Rates

SYSTOR '21 Paper Acceptance Rate 18 of 63 submissions, 29%;
Overall Acceptance Rate 108 of 323 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)171
  • Downloads (Last 6 weeks)17
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SUV: Static Analysis Guided Unified Virtual Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00030(293-308)Online publication date: 2-Nov-2024
  • (2024)Salus: Efficient Security Support for CXL-Expanded GPU Memory2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00027(1-15)Online publication date: 2-Mar-2024
  • (2023)Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual MemoryACM Transactions on Architecture and Code Optimization10.1145/363295321:1(1-24)Online publication date: 14-Nov-2023
  • (2023)DaeMon: Architectural Support for Efficient Data Movement in Fully Disaggregated SystemsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35794457:1(1-36)Online publication date: 2-Mar-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media