PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications

Zhang, Lingqi; Wahib, Mohamed; Chen, Peng; Meng, Jintao; Wang, Xiao; Toshio, Endo; Matsuoka, Satoshi

doi:10.1145/3577193.3593705

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2204.02064v3 (cs)

[Submitted on 5 Apr 2022 (v1), revised 1 May 2023 (this version, v3), latest version 12 May 2023 (v4)]

Title:PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications

Authors:Lingqi Zhang, Mohamed Wahib, Peng Chen, Jintao Meng, Xiao Wang, Endo Toshio, Satoshi Matsuoka

View PDF

Abstract:Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of $2.12$x for 2D stencils and $1.24$x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of $4.86$x in smaller SpMV datasets from SuiteSparse and $1.43$x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: this https URL.

Comments:	This paper will be published in 2023 International Conference on Supercomputing (ICS23)
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2204.02064 [cs.DC]
	(or arXiv:2204.02064v3 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2204.02064
Related DOI:	https://doi.org/10.1145/3577193.3593705

Submission history

From: Lingqi Zhang [view email]
[v1] Tue, 5 Apr 2022 08:59:18 UTC (5,023 KB)
[v2] Sat, 21 May 2022 05:10:32 UTC (5,400 KB)
[v3] Mon, 1 May 2023 06:08:30 UTC (6,719 KB)
[v4] Fri, 12 May 2023 11:16:55 UTC (6,712 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators