research-article

Open access

COX : Exposing CUDA Warp-level Functions to CPUs

Authors:

Ruobing Han,

Jaewon Lee,

Jaewoong Sim, and

Hyesoon KimAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 19, Issue 4

Article No.: 59, Pages 1 - 25

https://doi.org/10.1145/3554736

Published: 16 September 2022 Publication History

All formats PDF

Abstract

As CUDA becomes the de facto programming language among data parallel applications such as high-performance computing or machine learning applications, running CUDA on other platforms becomes a compelling option. Although several efforts have attempted to support CUDA on devices other than NVIDIA GPUs, due to extra steps in the translation, the support is always a few years behind CUDA’s latest features. In particular, the new CUDA programming model exposes the warp concept in the programming language, which greatly changes the way the CUDA code should be mapped to CPU programs. In this article, hierarchical collapsing that correctly supports CUDA warp-level functions on CPUs is proposed. To verify hierarchical collapsing, we build a framework, COX, that supports executing CUDA source code on the CPU backend. With hierarchical collapsing, 90% of kernels in CUDA SDK samples can be executed on CPUs, much higher than previous works (68%). We also evaluate the performance with benchmarks for real applications and show that hierarchical collapsing can generate CPU programs with comparable or even higher performance than previous projects in general.

1 Introduction

The high-performance computing power in GPUs has developed a strong software ecosystem based on GPU programs. Although there are other choices for GPU programming such as Heterogeneous-compute Interface for Portability (HIP) [4], OpenMP, and DPC++ [19], in recent years, CUDA has remained dominant. In the realm of deep learning, as concluded in Reference [43], before 2014, all popular frameworks (e.g., Caffe, Tensorflow, Torch, and MXNET) only supported CUDA for the GPU backend. Although these frameworks currently have started to support non-NVIDIA GPUs, due to the high market share of NVIDIA on the GPU market, most researchers still tend to develop and release their GPU programs written by CUDA [7]. In the multimedia realm, a survey of video editing applications [30] shows that CUDA is compatible in 14 of the 17 applications, while OpenCL is only supported by 9 of them.

Unfortunately, despite the popularity of CUDA programs, NVIDIA GPUs are the main hardware platforms to run them. Although there have been several efforts to run CUDA on non-NVIDIA GPUs, they lack support for newer CUDA features. There are several challenges in running CUDA on other platforms. One of the biggest challenge is converting Single Program Multiple Data (SPMD) programs to programs for non-SPMD-friendly architectures. In the SPMD programming model, the same kernel is executed by many threads at runtime, and the GPU is built for throughput-oriented processing by supporting many threads (or warps). However, other architectures often do not have that many threads. Thus, CUDA kernel programs need to be converted efficiently to fewer threads. Supporting CUDA with fewer threads is an essential component for running CUDA on X86, ARM, RISC-V, or Intel-GPU, which have fewer hardware threads than NVIDIA GPUs. We call this mechanism the SPMD-to-MPMD transformation (Multiple Program Multiple Data (MPMD)).

Several existing projects support running GPU programs on CPUs [1, 6, 11, 12, 13, 15, 19, 21, 23, 26, 37, 53, 55]. Most projects apply compiler-level transformation to translate GPU programs to be suitable for CPUs. These projects use the same granularity of transformation: A CPU thread is responsible for executing all the workload within a CUDA block (or OpenCL work-group). The CUDA-block-to-CPU-thread mapping is optimal based on the following observations. First, the transformed programs only require a few CPU threads (the number of GPU blocks for the original CUDA programs). Fewer number of CPU threads can lead to less context switching. Second, in CUDA programs, the memory access within a CUDA block always has great locality. Thus, wrapping these memory accesses into a CPU thread may also bring spatial/temporal locality for CPU caches [54]. Last, threads within a block have similar computation patterns, which makes them amenable to be optimized by Single Instruction Multiple Data (SIMD) instructions [24, 26] in the modern CPU architectures with further optimizations [25, 31, 34, 35, 36, 46, 47]. Other projects apply optimizations on the hardware level [11] or system level [15] to support executing GPU programs on CPU, which are beyond the scope of this article.

The transformation is shown in Figure 1(b). For a given GPU kernel, the translator first splits it into different regions according to synchronization instructions. Each region is then wrapped with a for-loop with GPU block size number of iteration. As this transformation uses single-layer for-loops to wrap each region, we name it flat collapsing . As flat collapsing wraps the original programs with single-layer loops, all threads within the block are mapped to iterations in the for-loops. After mapping, there is no difference among threads within a block except that they have different values for the loop iteration index. This transformation was proposed based on early CUDA programming models and cannot easily support several important features proposed in recent CUDA programming models. One significant change is the warp-level programming model in CUDA.¹ This new feature is critical for achieving high performance. One of the most important impacts brought by this feature is that threads within a CUDA warp should not be treated equally with threads in other CUDA warps. Thus, in this article, hierarchical collapsing is proposed to support these warp-level CUDA features on CPUs. Throughout the article, the focus is on the CUDA programming model. However, the same techniques would also apply to other SPMD programming models (e.g., OpenCL[32], HIP [4], and DPC++ [52]).

Fig. 1.

The main contributions of this article are as follows:

•

Propose hierarchical collapsing for SPMD-to-MPMD transformation and implement it with LLVM passes. The new transformation provides the correctness for GPU programs that use warp-level functions;

•

Extend the Parallel Region concept [23] into the Hierarchical Parallel Region to provide the correct translation when the GPU programs have warp-level functions;

•

Evaluate the SPMD-to-MPMD transformation on real applications and compare the coverage and performance with existing frameworks.

To support executing CUDA on CPUs, besides the SPMD-to-MPMD transformation, there is another technical issue need to be solved. CUDA provides runtime APIs (e.g., cudaMalloc and kernel launch) to communicate between the host and the device, these APIs should be replaced or re-implemented for the CPU backend. The CUDA runtime migration is beyond the scope of this article and we only briefly discuss it in Section 6.

To verify the hierarchical collapsing, we build COX, a framework that supports executing CUDA source code on CPU backends. COX integrates the hierarchical collapsing proposed in this article and also implements a runtime system to use pthreads to execute the transformed kernels.

2 Background

2.1 Running SPMD Programming Models on CPUs

The basic mechanism to support SPMD programming models on CPUs is to map a GPU block/work-group to a CPU thread and iterate for the thread block/work-group using loops [12, 23, 25, 26, 55]. Figure 1(a) and (b) show the input and output of the basic process. This transformation has different names in different projects: microthreading [53], thread aggregation [58], thread-fusion [12], region-based serialization [54], loop chunking [51], and kernel serialization [8]. In this article, this transformation is called flat collapsing, as it uses single-layer loops to represent all threads within a block.

When a GPU program contains synchronization primitives (e.g., synchthreads()), the code before/after the instructions should be wrapped by different loops. Below are some of the terminologies used in References [23, 26, 55] to support flat collapsing in general cases:

•

Parallel Region (PR): The Region means a group of consecutive instructions. The Parallel Region means the groups of instructions that must be executed by all the threads within a block before proceeding to the next regions. In CUDA programs, the instructions before/after __syncthreads belong to different PRs. For example, in Figure 1, statements-A and statements-B form two PRs.

•

Implicit barrier: Unlike explicit barriers that are declared by programmers (e.g., __synchthreads), flat collapsing inserts implicit barriers that are necessary to define the Parallel Regions. For simple functions that do not contain barriers (e.g., __syncthreads) inside the conditional statements, implicit barriers are not needed. However, when barriers are present in the conditional statements, the situation becomes more complicated. For example, to transform a CUDA kernel that has a barrier within an if–then construct, COX needs to insert implicit barriers into the original program so the correct PRs can be formed in subsequent steps.

The definition of Parallel Region in previous projects for flat collapsing cannot support warp-level features. Flat collapsing generates a single-layer loop for each PR to simulate all threads within a block. This coarse-grain simulation cannot distinguish threads among different CUDA warps. In this article, an extension definition, hierarchical Parallel Region, is proposed and used to support warp-level features (Section 5.5).

2.2 Warp-level Programming Features

2.2.1 Warp-level Collectives.

CUDA provides a list of warp-level collective functions [39], which are necessary for high-performance reduction. Two of these functions that are commonly used in existing benchmarks are introduced below.

•

Warp shuffle: In early versions, although most GPUs have local memory and global memory to support data exchange between threads, there was no efficient way to exchange data within a warp. To efficiently exchange data that are stored in registers, CUDA provides a series of warp shuffle instructions. When the warp shuffle instructions are invoked, a thread sends its local data to another thread in the same warp.

•

Warp Vote: Instead of only exchanging data among threads within a warp, warp vote instructions can directly make logical reductions (e.g., all, any) for the local variables, controlled by the mask argument. These features, used with warp shuffle and the cooperative group, are necessary to implement high-performance reduction kernels.

2.2.2 Cooperative Group.

In early CUDA programming models, there are only two explicit groups of threads: block and grid. Users cannot easily organize a small group of threads within the block as a sub-group. NVIDIA proposed a new concept in CUDA 7.0 called cooperative group. The corresponding instructions allow users to group threads within a block, and this group can further be used for data exchange. There are two kinds of grouping strategies: static and dynamic. For the static grouping, it is known whether a thread belongs to a group in compile time (e.g., group threads with index 0 and 1), while for dynamic grouping, it can only be known during runtime (e.g., group all activated threads).

3 Motivation

Section 2.1 introduces the concepts of PR and implicit barrier. This section discusses, with examples, the limitation of these concepts and how to extend them. Assume the input CUDA kernel shown in Code 1,² and its block size is \( b\_size \) . The code accumulates the variable \( val \) within the first warp and stores the accumulated value in the first thread of this warp.

Code 1.

The warp shuffle function (shfl_down_sync) contains a warp-level barrier: Each thread within a warp has to first calculate the value of val, after all threads in this warp complete the calculation, they invoke shfl_down_sync at the same time (see Section 5.2 for details). In Code 1, this warp-level barrier is inside a branch of an if–then construct (if(threadIdx.x < 32)); thus, only the first warp will reach the warp-level barrier, assuming the warp size is 32 in NVIDIA GPUS.

With flat collapsing, Code 1 would become Code 2. Here are the steps:

•

split codes into PRs according to barriers and wrap them with single-layer for-loops. In Code 2, as shfl_down contains a barrier, three PRs are separately wrapped with single-layer for-loops in line 2, line 5, and line 8. (note that the loop in line 4 is from the source input, not generated by flat collapsing);

•

replace the use of threadIdx.x by the loop iteration variable tx;

•

replicate variables (e.g., val) used by more than one PRs into array forms.

Unfortunately, these steps are insufficient to generate the correct CPU program for this GPU kernel. The key reason is that shfl_down_sync contains a warp-level barrier, which cannot be easily presented by single-layer loops.

For NVIDIA GPUs that compute capability lower than 3.0, there are only block-level barriers, so it can be safely assumed that for each barrier instruction, all or no threads within the block can reach the barrier. Thus, flat collapsing wraps the PRs before/after the barrier with single-layer loops with b_size numbers of iteration. In Code 2, these generated loops are located in the branch of an if–then construct. With this transformation, if it is known at runtime that this barrier will be reached, then the control flow will reach this branch and execute the generated for-loops. In other words, flat collapsing assumes that all threads within the BLOCK always reach the same barriers. However, in Code 1, only the threads within the first WARP should reach the warp barrier. This violates the assumptions in flat collapsing . To get the correct result, the transformer needs to insert extra instructions into line 6 and line 9 in Code 2 (shown in gray).

These transformations are quite complicated, even for the demo example, not to mention implementing them in compilers used for all possible CFGs (control flow graph). The source of the complex is that flat collapsing uses single-layer loops to present all threads within a block, which cannot easily present the CUDA warp concept. Based on the above analysis, hierarchical collapsing is proposed to produce Code 3. The concept is also illustrated in Figure 1(c). Code 3 has two nested loops: The outer loop with induction variable wid (line 4) is for the block level called inter-warp loop, and the inner loops with induction variable tx (lines 5, 7, 12, and 14) are for the warp level called intra-warp loop. The inserted loops are shown in gray text in Code 3.

For flat collapsing, to control the different control flows for different warps, if–then construct are required to insert to transformed codes. However, for hierarchical collapsing, only a simple loop peeling (line 10 in Code 3) is needed. With the low complexity, hierarchical collapsing can easily be implemented and integrated into compilers. In COX, hierarchical collapsing is implemented as a new LLVM pass to automatically transform the GPU kernels.

Code 2.

In the given example, although only flag[0] is used, the transformed program also calculate other elements in flag. This is due to these calculations may have side effects. To guarantee correctness, they have to be executed even if these outputs are not needed.

Code 3.

4 COX

We build a framework, COX, to support executing CUDA source code on X86 with high performance. The overview for COX is shown in Figure 2. In the first step, the CUDA kernel code is compiled to Intermediate Representation (NVVM IR) by Clang. After SPMD-to-MPMD transformer (hierarchical collapsing), NVVM IR will be translated to LLVM IR with MPMD format and emit the object file by LLVM backend. The object file will be linked with built-in libraries and migrated host code to generate executable files for X86. In this article, we mainly discuss the kernel SPMD-to-MPMD transformer. Although COX is implemented with X86 instructions, it can be easily migrated to other architectures.

Fig. 2.

5 Kernel Transformation

5.1 Overview of Kernel Transformation

In this section, the detail of applying hierarchical collapsing on a simple CUDA function (Code 4) is introduced. This function contains a CUDA warp vote function. First, each thread calculates its local variable (tx%2). Then, threads within warps communicate their local variables with all operation: If all local variable are non-zero, then the reduction returns true and otherwise returns false.

Figure 3 shows the pipeline of applying hierarchical collapsing transformation on Code 4. First, warp-level functions are replaced with built-in functions defined in the runtime library, as shown in Step 1 (Section 5.2). In Step 2, the implicit barriers are identified and inserted (Section 5.3). Then, in Steps 3 to 5, hierarchical Parallel Regions are identified, and intra/inter-warp loops are generated accordingly (Section 5.4). The generated LLVM IR will be compiled and linked with host programs and runtime libraries to generate a CPU-executable file.

Fig. 3.

Code 4.

5.2 Support Warp-level Functions

In GPU architectures, when threads within a warp invoke the warp functions, the GPU will have internal communications to accumulate and/or communicate the local variables within the warps. To support these features on CPUs, the accumulation and communication need to be explicitly performed. The implementation for warp-vote function on CPU backend is shown in Code 5. In this section, we introduces the detail to apply this transformation.

In the initialization, an array warp_vote with length 32 is allocated. The warp_vote should be stored in CPU thread local memory, as a CPU thread is used to simulate a GPU block. A GPU warp vote instruction is translated to the following CPU instructions: For threads within a warp, first, each thread stores its local variable into a unique position in warp_vote. After all the elements are set, the result for this warp vote can easily be computed. The function warp_all is defined in a runtime library. To utilize the computation resource of x86, warp_all is implemented with the AVX instructions. The benefits of AVX are evaluated in Section 7.2.3. The ways to support warp shuffle are quite similar.

Code 5.

5.3 Insert Implicit Barriers

In Steps 3, 4, and 5, hierarchical collapsing needs barrier information to identify the Parallel Region and generate intra/inter-warp loops. Thus, it is important to insert implicit barriers that are not shown in the input GPU codes but are necessary for identifying the Parallel Region. The implicit barriers are sourced from barriers in conditional statements. An example is shown in Figure 4. The input CUDA kernel (Figure 4(a)) has a barrier in the for-loop construct. As there is a barrier in the conditional statement, implicit barriers are inserted (Figure 4(b)) to guide the generation of intra/inter-warp loops in future steps. Finally, according to the barriers, two pairs of nested for-loops are generated (Figure 4(c)). Note, all transformations in COX are done in the LLVM IR level. This source code level example is only used for explanation.

Fig. 4.

To make hierarchical collapsing work, extra block-level barriers are inserted at the beginning of the entry block and at the end of the exit block. The rest of the implicit barriers are needed when there are barriers inside the conditional statements (e.g., if–then construct, for-loop construct). As CUDA is a high-level flexible language, even a single concept can generate quite different CFGs. For example, the loop concept can be implemented by different CFGs, such as do-while-loop, while-loop, and for-loop. Fortunately, with existing LLVM transformations, COX can convert the input CFGs to canonical formats and we only need to focus on these canonical formats in the below discussion. Here we list some important features in the canonical format:

•

Each branch instruction has only two successors; most input CFGs already have this feature except the CFGs that use switch-case constructs.

•

All loops are in canonical format such that (1) they all have pre-headers; (2) each loop has only a single latch, in other words, a single backedge; and (3) the loop headers dominate all exit blocks.

COX invokes LLVM loopSimplify and lowerSwitch passes to get canonical format CFGs. After being transformed to the canonical format, the only two kinds of conditional statements existing in CFGs are the IfThen construct and ForLoop construct.

Barriers in if–then construct. The CFG of a classical if–then construct is shown in the left side of Figure 5(a). There is a barrier in the block if.body. According to Reference [41], for a block/warp barrier, none or all threads within the block/warp can reach this barrier. Thus, COX can safely apply loop peeling on the CFG, which uses the branch directions of the first thread as all other threads’ branch directions. In other words, COX peels the first thread to evaluate the branch direction and the rest of the threads within the warp/block can just follow the same direction. See Code 3 for a loop peeling example. The result after inserting implicit barriers and block split is shown in the right side of Figure 5(a). Several details are worth mentioning:

Fig. 5.

•

insert implicit barriers with the same type as the barrier in if.body. In the example, there is a warp barrier in if.body; thus, COX also inserts warp barriers as implicit barriers;

•

after transformation, all blocks will be wrapped with intra-warp loops, except if.cond, which is used for loop peeling;

•

if.cond should contain only a single conditional-branch instruction and it should not have any side effects. In Figure 5(a), all computation instructions are put into if.head, which are executed b_size times, as the original GPU program does.

The detailed algorithm for inserting implicit barriers derived from barriers in an if–then construct is described in Algorithm 1. COX has to do some additional checking to avoid infinite loops caused by for-loop constructs in CFGs. For simplicity, this checking part is not shown in Algorithm 1.

Barriers in for-loop construct. The canonical loops contain single latch blocks and loop headers that dominate all exit blocks.³ COX inserts implicit barriers before/after the branch instructions (back edge of the loop). Figure 5(b) shows the example of inserting implicit barriers for a for-loop construct that contains a block barrier. Just as with an if–then construct, all these inserted implicit barriers (shown by bold text in the figure) should have the same type as the barriers in for.header (block barrier in the example).

5.4 Split Blocks before/after Each Barrier

The instructions before/after a barrier should be wrapped with different intra/inter-warp loops. Thus, in this step, COX splits the blocks that have barriers inside. See Step 3 in Figure 3 for an example.

5.5 Hierarchical Parallel Region

After insert implicit warp/block barriers, loops are generated to wrap instructions between barriers. The instructions between two consecutive barriers are called Parallel Region, which should be wrapped into the same loops. Compared with flat collapsing that only use single-layer loops to wrap PRs, hierarchical collapsing uses nested loops: inter-warp loop and intra-warp. Thus, two kinds of Parallel Regions are required: warp-level Parallel Region and block-level Parallel Region. These two kinds of PRs will be wrapped with intra/inter-warp loops separately. It is obvious that a warp-level PR will always be a subset for a block-level PR (a GPU warp is always within a GPU block); thus, the new concept is called Hierarchical Parallel Region. An example of a Hierarchical Parallel Region is shown in Figure 6. Thus, the rest of the steps are for finding the block/warp level PRs and wrapping them with inter/intra-warp loops. Algorithm 2 is used for finding the set of warp-level PRs. The algorithm for finding the block-level PRs is very similar, except it only concerns the block barrier. COX has to find the PRs for the warp level and block level sequentially: COX first finds all warp-level PRs and generates intra-warp loops to wrap these PRs. Then, COX finds the block-level PRs in the new CFG and wraps them with inter-warp loops.

Fig. 6.

5.6 Wrap PR with For-Loop

In this step, COX wraps warp/block-level PRs with intra/inter-warp loop (Figure 3(e) and (f)). Although this step is quite straightforward, it requires proving the correctness:⁴ After inserting intra/inter-warp loops, each instruction from the input GPU kernel is executed b_size times (b_size is the block size), except the instructions used for loop peeling.

Finally, after generating intra/inter-warp loops, some local variables need to be replicated: For local variables that are used in several warp-level PRs but only used in a single block-level PR, they are replicated with arrays of length 32. For local variables that are used among different block-level PRs, they are replicated by arrays of length equals to block size.

5.7 Limitation of Hierarchical Collapsing

Both flat collapsing and hierarchical collapsing are compile-level transformations. Thus, only the static features are addressed. The latest CUDA version supports several dynamic features. For example, for warp-level collective operations, users can assign only a sub-group of threads within the warp to do the collective operations. The sub-group is organized by a mask argument at runtime. For the cooperative group, users can also group all activated threads into a group at runtime. Neither flat collapsing nor hierarchical collapsing can easily support these dynamic features. However, although these dynamic features provide more flexibility, they can be harmful for performance, as they may incur warp-divergence. Thus, most high-performance implementations [33, 40] use static warp-level collectives and the cooperative group to avoid warp-divergence. In this article, only these static use cases are of concern. For the same reason, only the aligned barriers [38] are of concern. In other words, for a block/warp barrier, it is assumed that all or none of the threads within a block/warp can reach this barrier.

6 Runtime System

Besides CUDA kernel programs, to execute CUDA programs on CPUs, CUDA host functions are also required to be migrated. To support executing CUDA host programs on CPUs, the CUDA runtime APIs, which involves memory management and kernel execution, needs to be implemented on CPUs. As both host and device are CPUs, CUDA malloc and memcpy are replaced by C malloc and memcpy. COX uses pthreads fork/join to replace kernel launch/synchronization in CUDA. There are several potential optimization for the runtime system, such as using thread-pool instead of fork/join for kernel launching and synchronization, using regular expression or LLVM transformation to automatically generate COX host programs from the CUDA source code. These optimizations are beyond the scope of this article and are open for future research.

7 Evaluation

In this section, we try to answer the following questions:

•

Can hierarchical collapsing successfully bring higher coverage? (Section 7.1)

•

What is the performance between hierarchical collapsing and flat collapsing to these kernels that do not use warp-level features? (Section 7.2.1)

•

What kind of technology can be used to generate high-performance CPU programs? (Section 7.2.2–7.2.3)

•

What is the kernel performance compared with existing frameworks that also support CUDA on CPUs? (Section 7.3)

Although several frameworks support executing CUDA on CPUs, most of them were developed decades ago and are not activated [12, 18, 55]. We select two frameworks to compare: HIP-CPU [5] and DPC++ [52]. HIP-CPU is developed by AMD that supports executing HIP programs on CPUs, while DPC++ is developed by Intel that supports executing SYCL [3] on CPUs. To support CUDA programs, we uses HIPIFY [6] and DPCT [19] to translate CUDA into HIP and SYCL. Detailed information about these frameworks is described in Section 8. The environment setting is described below:

•

Software: Ubuntu 18.04, LLVM 10.0, gcc 7.5.0 (for DPC++ and COX), gcc 11.1.0 (for HIP-CPU), CUDA 10.1, DPC++ Ver. : 2022.0.0., HIPIFY [6] (commit: 8389fa1), HIP-CPU(commit: 56f559c);

•

Hardware (for classical linear algebra tasks): 1 socket * Intel Xeon Silver 4210 CPU;

•

Hardware (for real benchmarks tasks): 1 socket * 11th Generation Intel Core i7 Processors;

•

Benchmarks: CUDA SDK samples 10.1, Hetero-mark [56], Crystal [50], Rodinia [10];

•

Profiling tool: Intel SDE [22].

We choose to use CUDA SDK samples to evaluate the coverage for different frameworks, while we use Rodinia, Hetero-mark, and Crystal benchmarks for evaluating the performance. Rodinia and Hetero-mark consist of classical HPC applications (e.g., Gaussian Elimination, AES encryption). The Crystal benchmark is a GPU database benchmark that contains 13 high-performance implementations of SQL quires. The evaluation mainly focuses on generated kernel functions for CPU backend, not for the runtime system (CPU multi-thread runtime system). However, for most CUDA kernels, the workloads are highly related to the input data value that affect the memory locality and branch behaviors. Thus, to make our evaluation close to the real cases, instead of using dummy input and evaluating the execution time for a single kernel launch and single GPU block, we use the original input and runtime configurations (e.g., grid size, block size) provided by the benchmarks.

7.1 Coverage

In this section, we verify that using hierarchical collapsing can result in higher coverage. Table 1 analyzes examples in CUDA SDK samples 10.1 that do not require special hardware support (e.g., tensorcore, unified memory). As shown in the table, the existing frameworks can only automatically support at most 21 kernels (coverage = 68%). Those failed kernels use new CUDA features. However, COX supports 28 kernels (coverage = 90%).

Table 1.

Kernel name	Features	DPCT	HIPIFY	COX
initVectors		✓	✓	✓
gpuDotProduct	warp cooperative group	✗	✗	✓
gpuSpMV		✓	✓	✓
r1_div_x		✓	✓	✓
a_minus		✓	✓	✓
gpuConjugateGradient	grid sync	✗	✗	✗
multigpuConjugateGradient	multi grid sync	✗	✗	✗
MatrixMulCUDA		✓	✓	✓
matrixMul		✓	✓	✓
copyp2p		✓	✓	✓
reduce0	block cooperative group	✓	✓	✓
reduce1	block cooperative group	✓	✓	✓
reduce2	block cooperative group	✓	✓	✓
reduce3	block cooperative group	✓	✓	✓
reduce4	warp cooperative group	✗	✗	✓
reduce5	warp cooperative group	✗	✗	✓
reduce6	warp cooperative group	✗	✗	✓
shfl_intimage_rows	warp shuffle	✓	✗	✓
shfl_vertical_shfl	warp shuffle	✓	✗	✓
shfl_scan_test	warp shuffle	✗*	✗	✓
uniform_add		✓	✓	✓
reduce	warp cooperative group	✗	✗	✓
reduceFinal	warp cooperative group	✗	✗	✓
simpleKernel		✓	✓	✓
VoteAnyKernel1	warp vote	✓	✗	✓
VoteAllKernel2	warp vote	✓	✗	✓
VoteAnyKernel3	warp vote	✓	✗	✓
spinWhileLessThanone		✓	✓	✓
matrixMultiplyKernel		✓	✓	✓
vectorAdd		✓	✓	✓
filter_arr	activated thread sync	✗	✗	✗
coverage		68%	52%	90%

Table 1. Coverage of COX Compared to Other Frameworks

*Enabled by manual code migration [57].

The CUDA features supported by DPCT, HIPIFY, and COX are also shown in Figure 7.

Fig. 7.

Although COX achieves the highest coverage by using hierarchical collapsing, three kernels still cannot be supported. \( gpuConjugateGradient \) and \( multiGpuConjugateGradient \) rely on synchronization between different grids and devices, which utilize the grid cooperative group and multi-grid cooperative group separately. \( filter\_arr \) uses a dynamic cooperative group: it dynamically groups all activated threads. As discussed in Section 2.2.2, all of these features should be supported at the runtime level: Frameworks should schedule threads accordingly at runtime, and each thread can only know whether it is activated during runtime. Supporting runtime features is included for future work.

We also evaluate these three frameworks on Rodinia, Hetero-mark, and Crystal benchmarks. In all frameworks, COX can achieve higher coverage than other frameworks. However, in Rodinia and Hetero-mark, the high coverage is not the result of hierarchical collapsing, as all CUDA kernels do not use warp-level functions. The reason is that as DPCT and HIPIFY use source-to-source translation, they cannot handle complex C++ use cases (e.g., complex macro). For Crystal, there are 13 examples, and three of them use warp-level features. COX can support all 13 examples, while HIPIFY can only support the remaining 10 examples that do not use warp-level features. DPC++ does not support atomicCAS functions on CPUs. Thus, it cannot support any of applications in Crystal. The evaluation results for these real benchmarks are recorded in Section 7.3.

7.2 Performance on Classical Tasks

In this section, we evaluate and analyze the performance on classical tasks, which are simple to analyze with different configurations.

7.2.1 Flat collapsing vs. hierarchical collapsing.

Compared with flat collapsing, hierarchical collapsing can achieve higher coverage by using nested loops. However, for CUDA programs that do not use warp-level features, nested loops are unnecessary. In this section, we evaluate the performance between flat collapsing and hierarchical collapsing on classical linear operations that do not use warp-level features (Figure 8). In all configurations, programs generated by hierarchical collapsing are slower than those generated by flat collapsing . This is due to nested loops involving more computation and instructions, and also making the programs difficult to be optimized by compilers. Thus, COX only applies hierarchical collapsing to CUDA kernels that use warp-level features. For the rest of the kernels, COX applies flat collapsing instead.

Fig. 8.

7.2.2 Variable vs. Hardcoding.

COX supports two modes: The variable mode generates CPU programs that keep CUDA runtime configurations (e.g., grid size, block size) as variables. Instead, the hardcoding mode replaces all these configuration variables with constants. Although the programs generated by these two modes will be forwarded to LLVM’s optimizer with \( -O3 \) flag, they have an obvious difference, especially when compiling complicated kernels. Figure 9 shows the speed up using hardcoding mode, compared with variable mode. These two modes have a relatively small difference for the VectorAdd kernel, as it is quite simple and can easily be vectorized with compiler optimizations even the block size is not provided at compile time. However, for more complicated kernels, the hardcoding mode generates programs with higher performance. Although the hardcoding mode can generate programs with higher performance, it requires recompiling when changing the CUDA runtime configurations, which will incur extra compilation time. In the following sections, the variable mode is used for comparison with other frameworks.

Fig. 9.

7.2.3 SIMD Instructions.

For CPU programs, SIMD instructions are necessary to achieve high performance [2, 9, 24, 45]. The warp vote function execution time with/without AVX is shown in Table 2. With AVX instructions, around 10 \( \times \) speedup is achieved for both functions. The benefit is due to fewer instructions and branches.

Table 2.

Function		w/ AVX	w/o AVX
	time (µs)	0.241	2.542
	instructions	1,447,901,852	23,472,339,251
vote any	branches	100,110,593	4,260,452,162
	time (µs)	0.236	2.992
	instructions	1,384,476,021	29,552,745,486
vote any	branches	100,108,177	5,220,517,219

Table 2. Both Functions gain around 10 \( \times \) Speed up When Using AVX Instructions

7.3 Performance on Real Benchmarks

The runtime configuration and kernel execution time for executing on COX, DPC++, and HIP-CPU are recorded in Table 3. Since DPC++ does not support the atomicCAS instruction on CPUs, it cannot support any applications in Crystal.

Table 3.

Benchmark	Applications	No. of kernel launch	Grid size	Block size	COX	DPC++	HIP-CPU
Hetero-mark	aes	1	16,384	64	500.3	169.7	519.3
	bs	2,048	32	64	99.743	65.5	136.9
	ep	40	511	64	7038.79	59.3	2415
	fir	2,048	64	64	151.4	89.55	227.7
	ga	256	16	64	2585.31	71.24	1858
	kmeans	144	1,563	64	2991.6	141	1196.5
	pr	1,000	32	64	330.632	88.02	168.7
	hist	1	128	64	228.87	221.359	193.8
crystal	q11	10,000	1	32	259.525
	q12	10,000	1	32	302.632
	q13	10,000	1	32	298.223
	q21	40,000	1	32	443.308		1180.49
	q22	40,000	1	128	1003.57		2057.56
	q23	40,000	1	128	923.145		1984.86
	q31	40,000	1	128	1165.66		2174.03
	q32	40,000	1	128	1112.99		2112.73
	q33	40,000	1	128	1184.54		2102.14
	q34	40,000	1	128	1093.68		2056.16
	q41	50,000	1	128	1428.59		2659.43
	q42	50,000	1	128	1497.65		2745.15
	q43	50,000	1	128	1387.41		2648.31
Rodinia	hotspot3D	100	1,024	256	699	131	893.8
	particlefilter	1	157	128	121.776		32.884
	bpnn_forward	1	4,096	256	14.948	61.576
	bfs	24	1,954	512	39.726	50.458	272.7
	btree	1	10,000	256	11.181	20.9
	gaussian	414	1	512	28.35	56.8	200.805
	hotspot	1	1,849	256	5.91	63.512	33.773
	srad	4	16,384	256	82.347	150	933.18
	pathfinder	5	463	256	33	42.3	168.72
	nn	1	5,120	256	3.542	41.8	18.64

Table 3. Runtime Configuration and Total Kernel Execution Time (ms) on Hetero-mark, Rodinia, and Crystal Benchmarks

We mark the frameworks of the lowest kernel execution time in red.

7.3.1 COX vs. HIP-CPU.

To execute CUDA on a CPU, COX maps GPU threads to iterations in for-loops generated by hierarchical collapsing . However, HIP-CPU maps GPU threads to fibers (see Section 8 for more detail). HIP-CPU’s mapping strategy is more straightforward and easy to implement. However, the original CUDA programs always have lightweight kernels, which are friendly for throughput-oriented architectures. Wrapping these lightweight kernels with for-loops can generate programs with heavier workloads that are more friendly for latency-oriented architectures (CPU) integrated with SIMD instructions. Besides, although fiber has lightweight context switching, the overhead is still larger than iterations among for-loops. Thus, in most cases, COX is faster than HIP-CPU. The higher performance is the result of high utilization of SIMD instructions and lower overhead for thread switching, which lead to fewer instructions (Table 4). In all applications in Crystal and some applications in Rodinia and Hetero-mark (gaussian, nn, fir), COX can utilize more SIMD instructions for executing, which leads to fewer executed instructions.

Table 4.

Benchmark	Application	Metrics	COX	HIP-CPU
crystal	q21	sse-packed	802548	859188
		avx128	239890	0
		avx256	599725	0
		# of inst	571967850	1123266901
	q31	sse-packed	825341	117468
		avx128	239914	0
		avx256	599785	0
		# of inst	580012183	1168657481
	q41	sse-packed	1049937	243728
		avx128	299862	0
		avx256	749655	0
		# of inst	730220536	1478500813
Rodinia	gauss	sse-packed	2706	1704
		avx128	828	414
		avx256	2070	414
		# of inst	1668761	3450999
	nn	sse-packed	313	47
		avx128	2	2
		avx256	5	2
		# of inst	10385	18597
	particlefilter	sse-packed	78972949	565778475
		avx128	10584731	289510335
		avx256	3392	4592
		# of inst	1.02885E+11	41436192956
Hetero-mark	ep	sse-packed	93609	171036
		avx128	1206	1446
		avx256	13100	13715
		# of inst	1.48391E+11	33645862337
	fir	sse-packed	12806668	13069300
		avx128	25194	3436
		avx256	56210	10830
		# of inst	13803614808	14015215887
	pr	sse-packed	2599873	2724917
		avx128	2360574	2350348
		avx256	862755	841493
		# of inst	39217359547	14129291883
	hist	sse-packed	63425	97148
		avx128	1527	2259
		avx256	22024785	22026540
		# of inst	12102193872	3942375563
		atomic	39187287	39202294

Table 4. Profiling Data for COX and HIP-CPU

However, in some cases, HIP-CPU are faster than COX . In PR, the major kernel workload is irregular memory access, which cannot profit from SIMD instructions. As this application contains one thousand kernel launches, the runtime framework is the critical component, which is beyond the scope of this article. The particlefilter and EP contain heavy CUDA kernels (Code 6) that can be optimized with SIMD instructions. HIP-CPU relies on GNU 11 to apply auto-vectorization for the source code, while COX applies LLVM passes on transformed LLVM IRs for auto-vectorization. The transformed LLVM IRs are too complex to be fully optimized. Thus, in these cases, HIP-CPU has higher SIMD utilization and achieves higher performance. For the hist application, the critical component is atomic instructions. Since COX and HIP-CPU have the similar number of atomic instructions, they achieve close performance.

Code 6.

7.3.2 COX vs. DPC++.

As DPC++ is not fully open sourced, we cannot fully analyze it and compare it with COX but can only use profiling data to get some inferences. From the profiling data in Table 5, for all cases, DPC++ contains more atomic instructions, which may be generated by DPC++ runtime system for CPU thread scheduling. For gaussian and nn applications, both COX and DPC++ achieve high SIMD utilization. As DPC++ has many more atomic instructions and memory access in these two applications, DPC++ is slower than COX . As for EP and FIR, DPC++ achieves much higher SIMD utilization and results in many fewer executed instructions. Thus, DPC++ is much faster than COX . Other applications, including (e.g., aes and kmeans), involve a large amount of blocks, and the kernel workloads are relatively low compared with runtime overhead. In these situations, the performance mainly depends on the runtime system for how to implement kernel launch and synchronization. Further explanation for DPC++ requires reverse engineering, which is beyond the scope of this article.

Table 5.

Application	Metrics	COX	DPC++
gauss	mem atomic	236054	4533963
	mem	652226992	934676835
	sse-packed	475790	26797775
	avx128	88155	6047607
	avx256	90072	6350822
	# of inst	1768929320	1489909537
nn	mem atomic	10506926	27954105
	mem	1952231817	2551670355
	sse-packed	4034949	16555799
	avx128	10519077	16726252
	avx256	30403929	38158606
	# of inst	4137101986	5547944991
ep	mem atomic	36128	14992385
	mem	68025621670	1269791602
	sse-packed	93609	18724231
	avx128	1206	9662776
	avx256	13100	92355635
	# of inst	1.48391E+11	2936592756
fir	mem atomic	845405	23656931
	mem	6658160838	2514792478
	sse-packed	12806668	42218234
	avx128	25194	25188562
	avx256	56210	41781477
	# of inst	13803614808	6283526400

Table 5. Profiling Data for COX and DPC++

8 Related Work

The CPU architecture belongs to MPMD, while the GPU architecture is SPMD. Although users can naively execute a GPU thread with a CPU thread, due to the limited parallelism in CPUs, the system can only execute around 100 CPU threads simultaneously, which is much smaller than the parallelism in the GPU architecture. Thus, to achieve the same number of threads as a GPU, the CPU has to create more threads than it can actually execute simultaneously, which will incur a large amount of thread context-switching overhead. Two methods solve this issue. The first is to accelerate the context switching time in the CPU. Some researchers extend the CPU architecture to accelerate context switching [11]; these hardware-level extensions are beyond the scope of this article. In the software level, Reference [15] proposes to use lightweight threading to accelerate context switching. Context switching only stores and reloads a few registers while maintaining the stack memory. Most modifications are in the runtime level, and users can directly use the original GPU source code. As reported in Reference [54], the AMD CPU OpenCL implementation is based on this technology. However, even with these optimizations, there is still significant overhead for context switching, around 10 ns per switching.

Thus, another direction is being explored: increasing the workload of each CPU thread. For each CPU thread, instead of executing a single GPU thread, it executes all GPU threads within a block. This mechanism can elicit two benefits. First, it can increase the CPU execution time to make it much larger so that context switching overhead becomes negligible. Besides, with more workload in a single thread, there are more opportunities for optimizations (e.g., vectorization and loop transformation).

This mechanism has several different names: microthreading [53], thread aggregation [58], thread-fusion [12], region-based serialization [54], loop chunking [51], and kernel serialization [8]. In this article, this mechanism is given a new name: flat collapsing . In References [53, 55], the authors propose wrapping an SPMD kernel with loops, and the loop size equals the block size. Thus, each loop iteration can simulate a GPU thread within a block, and a CPU thread is mapped to a GPU block. An important technical detail is supporting synchronization instructions: Compilers should separately wrap instructions before/after a synchronization instruction into different loops to maintain the correctness. A similar technology is also discussed in Reference [51] that utilizes loop transformations (e.g., loop strip-mining, interchange, distribution, unswitching) to transform SPMD execution models with synchronization to Task Parallel execution models. The authors of Reference [58] propose improved static analysis to vectorize the generated loop-programs to improve the performance and also propose another algorithm to wrap the original kernels with loops to avoid additional synchronization points in previous works. In some GPU architectures, such as NVIDIA GPU, there is an implicit lock step within a group of threads. The author of Reference [16] propose transformations to detect these implicit warp-level synchronizations and maintained them during transformations. The authors of Reference [54] propose using C Extensions for Array Notation to further accelerate the generated CPU programs with SIMD execution and better spatial locality.

Several projects have been proposed to execute CUDA on non-NVIDIA devices. In the early days, NVIDIA provided an emulation framework [1] to execute CUDA on a CPU; each thread within a GPU block is executed by a CPU thread. Horus [14] is another emulator. It supports parsing and executing NVIDIA PTX instructions on CPU devices. These emulators are for debugging rather than for performance. In MCUDA [55], the authors also use a source-to-source translation to translate CUDA to C with flat collapsing . Ocelot [12] uses the same mechanism, but instead of source-to-source translation, it converts in the PTX level to avoid recompiling. MapCG [18] is a hybrid computing framework that uses source-to-source translation to translate CUDA kernels to C programs. The author of Reference [28] proposes another framework for hybird-computing based on Ocelot to translate GPU programs on the PTX level. Cumuls [8] uses Clang to parse the CUDA programs and modifies them on the AST level. Cumuls is mainly concerned with CUDA runtime support; as for the compilation part, it reuses the transformation in MCUDA. Instead of directly translating CUDA/PTX to CPU executable files, other projects utilize the portability of other front-end languages. The authors of References [17, 49] propose using source-to-source translation to translate CUDA to OpenCL. Instead of source-to-source translation, References [44, 48] implement the translations with LLVM IR. The DPC++ Compatibility Tool [19] and HIPIFY [6] are tools that translate CUDA to source languages for Intel and AMD devices.

Most related works only focus on supporting old CUDA features. However, the rapid evolution of GPU hardware and software stacks brings lots of new features that are important to achieve high performance, such as warp-level collectives, unified memory, and CudaGraph. Achieving high coverage on these new features is an ongoing project. The researchers in Reference [42] propose using explicit barriers and memory exchanges to support warp shuffle on OpenMP, which shares the same insight with COX .

OpenCL is an open standard that allows executing SPMD programs on MPMD architectures. POCL [23] is an open source OpenCL implementation that supports CPU backend. To support SPMD programs on CPU, POCL implements flat collapsing on the LLVM IR level. The authors in Reference [26] also propose using flat collapsing on OpenCL but with a different method to insert extra synchronization and find the Parallel Regions, which results in fewer extra synchronization barriers. However, this method is not extendable for the Hierarchical Parallel Regions; thus, it cannot be utilized to support warp-level features. In Reference [27], another OpenCL implementation has been proposed that mainly focuses on supporting OpenCL programs on multi-device clusters with heterogeneous devices.

8.1 Dpc++

DPC++ [52] is a programming language designed by Intel for data parallel programming. Intel also provides oneAPI framework to execute DPC++ programs on Intel CPUs and GPUs. To support CUDA on Intel devices, Intel provides DPCT [19], a tool to migrate CUDA programs to DPC++ programs. DPC++, the same as CUDA, provides an interface for SPMD programs. Currently, there are seldom documents that record detailed information of how oneAPI framework executes DPC++ programs on CPUs. From the Intel community [20], we can infer that it also implements flat collapsing mechanism. From the evaluation in Section 7.3, we find that the major optimizations in DPC++ for CPU backend are for applications with a large number of kernel launch and a large number of blocks. DPC++ only supports Intel CPUs/GPUs; thus, it should also contains compiler optimizations that are customized for Intel CPU architectures.

8.2 HIP-CPU

HIP [4] is a GPU programming language designed by AMD for AMD GPUs. The design philosophy for HIP is to make HIP close enough to CUDA so that developers can easily translate CUDA programs to HIP. AMD also provides HIPIFY [6] to automatically translate CUDA to HIP. HIP-CPU [5] is a framework developed by AMD, to support executing HIP on CPUs. This framework uses a new mapping mechanism: Instead of mapping GPU threads into loop iterations, it maps GPU threads to CPU fibers. Although this mechanism does not generate explicit loops, the fibers form implicit loops by context switching. Since there is no concept of CUDA warp, this transformation is also regarded as a sub-class of flat collapsing .

The SPMD-to-MPMD transformation in DPC++, HIP-CPU, and COX is shown in Figure 10.

Fig. 10.

9 Conclusion

COX is a framework that supports executing CUDA on CPUs. In this article, we focus on the compilation part in this framework. More particularly, as existing frameworks cannot support CUDA warp-level functions, we propose hierarchical collapsing to support these functions on CPUs. With hierarchical collapsing, COX can achieve higher coverage on CUDA10.1 SDK and Crystal benchmarks compared with existing frameworks. We also compare the performance of the generated kernel programs with existing frameworks and find the performance has a great variance among different applications. We use profiling tools to analyze the programs and find that, although in most applications hierarchical collapsing can achieve comparable or even higher performance, for some applications, the generated programs are too complex to be fully optimized by compilers. Future works will provide compiler optimizations to increase the SIMD instruction utilization for programs generated by hierarchical collapsing .

Acknowledgment

We also thank Blaise Tine, Jeffrey Young, Jiashen Cao, Jun Chen, Bhanu Garg, and the hardware support from NSF CCRI #2016701, and Intel.

Footnotes

Warp is now officially a part of the programming model instead of being a microarchitecture concept.

This example is a simplified version of reduction_kernel.cu in CUDA 10.1 SDK.

A latch is a node in the loop that has an edge to the loop header. An exiting edge is an edge from inside the loop to a node outside of the loop. The source of such an edge is called an exiting block, its target is an exit block [29].

⁴

To make the article brief, we move the proof into Appendix.

Supplementary Material

3554736-app (3554736-app.pdf)

Supplementary material

Download
512.13 KB

References

[1]

Compiling and Executing CUDA Programs in Emulation Mode. Retrieved from https://developer.nvidia.com/cuda-toolkit.

Abstract

1 Introduction

2 Background

2.1 Running SPMD Programming Models on CPUs

2.2 Warp-level Programming Features

2.2.1 Warp-level Collectives.

2.2.2 Cooperative Group.

3 Motivation

4 COX

5 Kernel Transformation

5.1 Overview of Kernel Transformation

5.2 Support Warp-level Functions

5.3 Insert Implicit Barriers

5.4 Split Blocks before/after Each Barrier

5.5 Hierarchical Parallel Region

5.6 Wrap PR with For-Loop

5.7 Limitation of Hierarchical Collapsing

6 Runtime System

7 Evaluation

7.1 Coverage

7.2 Performance on Classical Tasks

7.2.1 Flat collapsing vs. hierarchical collapsing.

7.2.2 Variable vs. Hardcoding.

7.2.3 SIMD Instructions.

7.3 Performance on Real Benchmarks

7.3.1 COX vs. HIP-CPU.

7.3.2 COX vs. DPC++.

8 Related Work

8.1 Dpc++

8.2 HIP-CPU

9 Conclusion

Acknowledgment

Footnotes

Supplementary Material

References

Cited By

Index Terms

Recommendations

CuPBoP: Making CUDA a Portable Language

Taming warp divergence

An OpenCL micro-benchmark suite for GPUs and CPUs

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations