GVProf: A value profiler for GPU-based clusters

K Zhou, Y Hao, J Mellor-Crummey… - … Conference for High …, 2020 - ieeexplore.ieee.org
SC20: International Conference for High Performance Computing …, 2020ieeexplore.ieee.org
GPGPUs are widely used in high-performance computing systems to accelerate scientific
and machine learning workloads. Developing efficient GPU kernels is critically important to
obtain “bare-metal” performance on GPU-based clusters. In this paper, we describe the
design and implementation of GVPROF, the first value profiler that pinpoints value-related
inefficiencies in applications running on NVIDIA GPU-based clusters. The novelty of
GVPROF resides in its ability to detect temporal and spatial value redundancies, which …
GPGPUs are widely used in high-performance computing systems to accelerate scientific and machine learning workloads. Developing efficient GPU kernels is critically important to obtain “bare-metal” performance on GPU-based clusters. In this paper, we describe the design and implementation of GVPROF, the first value profiler that pinpoints value-related inefficiencies in applications running on NVIDIA GPU-based clusters. The novelty of GVPROF resides in its ability to detect temporal and spatial value redundancies, which provides useful information to guide code optimization. GVPROF can monitor production multi-node multi-GPU executions in clusters. Our experiments with well-known GPU benchmarks and HPC applications show that GVPROF incurs acceptable overhead and scales to large executions. Using GVPROF, we optimized several HPC and machine learning workloads on one NVIDIA V100 GPU. In one case study of LAMMPS, optimizations based on information from GVProf led to whole-program speedups ranging from 1.37x on a single GPU to 1.08x on 64 GPUs.
ieeexplore.ieee.org