Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Optimising GPGPU Execution Through Runtime Micro-Architecture Parameter Analysis

Giuseppe M. Sarda KU Leuven, Leuven, Belgium imec, Leuven, Belgium Nimish Shah KU Leuven, Leuven, Belgium Debjyoti Bhattacharjee imec, Leuven, Belgium Peter Debacker imec, Leuven, Belgium Marian Verhelst KU Leuven, Leuven, Belgium imec, Leuven, Belgium
Abstract

GPGPU execution analysis has always been tied to closed-source, proprietary benchmarking tools that provide high-level, non-exhaustive, and/or statistical information, preventing a thorough understanding of bottlenecks and optimization possibilities. Open-source hardware platforms offer opportunities to overcome such limits and co-optimize the full hardware-mapping-algorithm compute stack. Yet, so far, this has remained under-explored. In this work, we exploit micro-architecture parameter analysis to develop a hardware-aware, runtime mapping technique for OpenCL kernels on the open Vortex RISC-V GPGPU. Our method is based on trace observations and targets optimal hardware resource utilization to achieve superior performance and flexibility compared to hardware-agnostic mapping approaches. The technique was validated on different architectural GPU configurations across several OpenCL kernels. Overall, our approach significantly enhances the performance of the open-source Vortex GPGPU, contributing to unlocking its potential and usability.

1 Introduction

©2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: paper DOI.
Refer to caption
Figure 1: Execution traces of the vecadd kernel under 4 different lws. Each plot shows tagged instruction wavefronts, the PC, the active thread mask and the timestamp of instruction issues from different warps.
Refer to caption
Figure 2: Violin plots showing the comparison (ratio) of latencies from our methodology vs fixed (lws=32, right in blue) and naive mapping (lws=1, left in yellow) on 450 different HW architectural configurations. Data tables show the average, the worst result, and the result count ¡1 (x/450) in percentage.

The rise of AI algorithms, demanding more compute power, and the slowdown of Moore’s law, hindering performance improvements from pure silicon technology scaling, led to the development of data-parallel, application-specific architectures like GPUs and NPUs. However, most architectures are proprietary, facing challenges regarding versatility and closed-source limitations. An opportunity to address these issues lies in open-source hardware and software platforms, with the Vortex RISCV-based GPGPU [5] being a promising and versatile option for exploration and characterization across various algorithms. The open hardware, ISA, and software stack allow, in fact, deep analysis and understanding of the execution in the platform. This enables recognizing the bottlenecks and more chances to co-optimize the GPU across the whole stack. Our work demonstrates the impact of leveraging low-level, micro-architecture information to improve, with a single flexible approach, the execution of several kernels on this open-source Vortex GPU. To ensure algorithmic versatility, we chose different math kernels and synthetic layers from typical Deep Neural Networks (DNN) [1] and Graph Convolutional Networks (GCN) [2].

Specifically, this paper reports our contributions in terms of:

\bullet trace analysis from the RISC-V-based Vortex GPGPU

\bullet our hardware-aware optimal, runtime OpenCL kernel mapping

\bullet the impact of our technique on the execution of typical math kernels and layers in DNN and GCN

2 Analysis and workload mapping

The Vortex POCL compiler [4] accepts standard OpenCL kernels. Compiled code is linked with runtime libraries, which take care of initializing the platform and spatially and temporally mapping the parallel instances of the kernel.

Before calling the kernel, the Vortex runtime maps the workload equally across cores. Within each core, the kernel iterations are further distributed among threads first and then warps, depending on the local_work_size𝑙𝑜𝑐𝑎𝑙_𝑤𝑜𝑟𝑘_𝑠𝑖𝑧𝑒local\_work\_sizeitalic_l italic_o italic_c italic_a italic_l _ italic_w italic_o italic_r italic_k _ italic_s italic_i italic_z italic_e (lws). This lws is one of the arguments passed by the host platform when calling the GPGPU execution [3] and, in essence, determines the iterations each thread loops around the kernel for each internal call. Fig. 1 shows an example of the impact of changing the lws parameter on the execution of 128-element vector addition (vecadd) in a simple 1 core, 2 warps, 4 threads (1c2w4t) GPU configuration; the plot provides the PC, the instruction thread mask, and warp issue information over time, for 4 different lws values. For better visualization, we tagged instruction addresses with different semantic sections of the code (shown as a waveform graph above every plot). Depending on the relationship between the lws mapping parameter, the global workload size gws (e.g., the total iterations the kernel will be executed), and the hardware parallelism hp, resolved in Eq. 1, there are 3 possible scenarios (gws=128 and hp=8 in the example):

\bullet lws <<< gws/hp: the software will spawn more warps than the hw can support. The execution will be scheduled at different timesteps with multiple kernel calls, cfr. the uppermost ”lws=1” scenario in Fig. 1.

\bullet lws = gws/hp: all warps will be loaded in parallel into the hardware with a single kernel call, cfr. the ”lws=16” scenario.

\bullet lws >>> gws/hp: all warps will be loaded in parallel into the hardware, yet with reduced hardware utilization, cfr. ”lws=32/64” scenarios.

The optimal lws value is, hence, both hardware and algorithm dependent, and can be determined as:

lws=gwshp, with hp=cores×warps×threads𝑙𝑤𝑠𝑔𝑤𝑠𝑝, with 𝑝𝑐𝑜𝑟𝑒𝑠𝑤𝑎𝑟𝑝𝑠𝑡𝑟𝑒𝑎𝑑𝑠lws=\frac{gws}{hp}\text{, with }hp=cores\times warps\times threadsitalic_l italic_w italic_s = divide start_ARG italic_g italic_w italic_s end_ARG start_ARG italic_h italic_p end_ARG , with italic_h italic_p = italic_c italic_o italic_r italic_e italic_s × italic_w italic_a italic_r italic_p italic_s × italic_t italic_h italic_r italic_e italic_a italic_d italic_s (1)

This value can be evaluated at runtime based on the hardware properties and the workload size, without being explicitly specified by the programmer.

3 Validation

To validate our observations, we analyzed the execution on 450 different hardware GPU configurations, spanning from 1 core, 2 warps, and 2 threads (1c2w2t) to 64c32w32t, running stand-alone math kernels and combined ones for DNN and GCN layers. We compared our mapping, obtained with Eq. 1, with a naive (lws=1) (e.g., never unrolling the kernel temporally over one thread) and a fixed one (lws=32).
Fig. 2 compares the resulting number of execution cycles, plotting the ratio between other mappings and ours. Our technique shows an average 1.3×1.3\times1.3 × and 3.7×3.7\times3.7 × performance boost for the math kernels over the lws=1 mapping and the lws=32, respectively.

From the plots, we can observe that, across different hw solutions, providing the kernel execution with the same lws results in a large performance variability: from optimal to up to 20x slower. This proves that our mapping, with hw and sw awareness, can adapt and benefit a wide range of kernels. Note that in a few specific hw configurations, spawning more or less warps can bring small benefits to the execution (because of e.g., reduced overhead, improved memory bandwidth utilization, etc). This is visible in the plot, as some distribution cut-offs are slightly below the bold, red line on 1. Also, when the hardware parallelism hp exceeds the gws of the executed kernel, Eq. 1 resolves to lws=1. This justifies the peaks around 0 on the left, yellow side of the violin plots. Finally, the Gaussian blur filter, the near-neighbor search, and GCN aggregation kernels show atypical trends, we will explore the reasons in future work.

4 Conclusions

In this work, we analyzed the software-to-hardware mapping flow on Vortex through execution traces, showing a method to runtime optimize the lws parameter and abstract its hardware impact to the programmer. We validated the approach on the diverse math kernels and ML layers. It is clear that other factors still impact the runtime kernel execution in Vortex. Going further, these will be analyzed in more depth, to improve the end-to-end execution of neural networks from a combined software and hardware point of view.

Acknowledgments

Project funded by the European Research Council (ERC) grant No. 101088865, EU H2020 grant No. 101070374, the Flanders AI Research Program, and the KU Leuven.

References

  • [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [2] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations, 2017.
  • [3] TheKhronosGroupInc., “Opencl 2.1 api specification page 242,” 2017.
  • [4] B. Tine, L. Seyong, V. Jeff, and K. Hyesoon, “Bringing opencl to commodity risc-v cpus,” Workshop on RISC-V for Computer Architecture Research (CARRV), 2021.
  • [5] B. Tine, K. P. Yalamarthy, F. Elsabbagh, and K. Hyesoon, “Vortex: Extending the risc-v isa for gpgpu and 3d-graphics,” in MICRO-54, 2021.