As CUDA becomes the de facto programming language among data parallel applications such as high-performance computing or machine learning applications, running CUDA on other platforms becomes a compelling option. Although several efforts have attempted to support CUDA on devices other than NVIDIA GPUs, due to extra steps in the translation, the support is always a few years behind CUDA’s latest features. In particular, the new CUDA programming model exposes the warp concept in the programming language, which greatly changes the way the CUDA code should be mapped to CPU programs. In this article, hierarchical collapsing that correctly supports CUDA warp-level functions on CPUs is proposed. To verify hierarchical collapsing, we build a framework, COX, that supports executing CUDA source code on the CPU backend. With hierarchical collapsing, 90% of kernels in CUDA SDK samples can be executed on CPUs, much higher than previous works (68%). We also evaluate the performance with benchmarks for real applications and show that hierarchical collapsing can generate CPU programs with comparable or even higher performance than previous projects in general.
1 Introduction
The high-performance computing power in GPUs has developed a strong software ecosystem based on GPU programs. Although there are other choices for GPU programming such as Heterogeneous-compute Interface for Portability (HIP) [4], OpenMP, and DPC++ [19], in recent years, CUDA has remained dominant. In the realm of deep learning, as concluded in Reference [43], before 2014, all popular frameworks (e.g., Caffe, Tensorflow, Torch, and MXNET) only supported CUDA for the GPU backend. Although these frameworks currently have started to support non-NVIDIA GPUs, due to the high market share of NVIDIA on the GPU market, most researchers still tend to develop and release their GPU programs written by CUDA [7]. In the multimedia realm, a survey of video editing applications [30] shows that CUDA is compatible in 14 of the 17 applications, while OpenCL is only supported by 9 of them.
Unfortunately, despite the popularity of CUDA programs, NVIDIA GPUs are the main hardware platforms to run them. Although there have been several efforts to run CUDA on non-NVIDIA GPUs, they lack support for newer CUDA features. There are several challenges in running CUDA on other platforms. One of the biggest challenge is converting Single Program Multiple Data (SPMD) programs to programs for non-SPMD-friendly architectures. In the SPMD programming model, the same kernel is executed by many threads at runtime, and the GPU is built for throughput-oriented processing by supporting many threads (or warps). However, other architectures often do not have that many threads. Thus, CUDA kernel programs need to be converted efficiently to fewer threads. Supporting CUDA with fewer threads is an essential component for running CUDA on X86, ARM, RISC-V, or Intel-GPU, which have fewer hardware threads than NVIDIA GPUs. We call this mechanism the SPMD-to-MPMD transformation (Multiple Program Multiple Data (MPMD)).
Several existing projects support running GPU programs on CPUs [1, 6, 11, 12, 13, 15, 19, 21, 23, 26, 37, 53, 55]. Most projects apply compiler-level transformation to translate GPU programs to be suitable for CPUs. These projects use the same granularity of transformation: A CPU thread is responsible for executing all the workload within a CUDA block (or OpenCL work-group). The CUDA-block-to-CPU-thread mapping is optimal based on the following observations. First, the transformed programs only require a few CPU threads (the number of GPU blocks for the original CUDA programs). Fewer number of CPU threads can lead to less context switching. Second, in CUDA programs, the memory access within a CUDA block always has great locality. Thus, wrapping these memory accesses into a CPU thread may also bring spatial/temporal locality for CPU caches [54]. Last, threads within a block have similar computation patterns, which makes them amenable to be optimized by Single Instruction Multiple Data (SIMD) instructions [24, 26] in the modern CPU architectures with further optimizations [25, 31, 34, 35, 36, 46, 47]. Other projects apply optimizations on the hardware level [11] or system level [15] to support executing GPU programs on CPU, which are beyond the scope of this article.
The transformation is shown in Figure 1(b). For a given GPU kernel, the translator first splits it into different regions according to synchronization instructions. Each region is then wrapped with a for-loop with GPU block size number of iteration. As this transformation uses single-layer for-loops to wrap each region, we name it flat collapsing . As flat collapsing wraps the original programs with single-layer loops, all threads within the block are mapped to iterations in the for-loops. After mapping, there is no difference among threads within a block except that they have different values for the loop iteration index. This transformation was proposed based on early CUDA programming models and cannot easily support several important features proposed in recent CUDA programming models. One significant change is the warp-level programming model in CUDA.1 This new feature is critical for achieving high performance. One of the most important impacts brought by this feature is that threads within a CUDA warp should not be treated equally with threads in other CUDA warps. Thus, in this article, hierarchical collapsing is proposed to support these warp-level CUDA features on CPUs. Throughout the article, the focus is on the CUDA programming model. However, the same techniques would also apply to other SPMD programming models (e.g., OpenCL[32], HIP [4], and DPC++ [52]).
Fig. 1.
Fig. 1. The programming model for input CUDA SPMD and output CPU programs by flat collapsing and hierarchical collapsing .
The main contributions of this article are as follows:
•
Propose hierarchical collapsing for SPMD-to-MPMD transformation and implement it with LLVM passes. The new transformation provides the correctness for GPU programs that use warp-level functions;
•
Extend the Parallel Region concept [23] into the Hierarchical Parallel Region to provide the correct translation when the GPU programs have warp-level functions;
•
Evaluate the SPMD-to-MPMD transformation on real applications and compare the coverage and performance with existing frameworks.
To support executing CUDA on CPUs, besides the SPMD-to-MPMD transformation, there is another technical issue need to be solved. CUDA provides runtime APIs (e.g., cudaMalloc and kernel launch) to communicate between the host and the device, these APIs should be replaced or re-implemented for the CPU backend. The CUDA runtime migration is beyond the scope of this article and we only briefly discuss it in Section 6.
To verify the hierarchical collapsing, we build COX, a framework that supports executing CUDA source code on CPU backends. COX integrates the hierarchical collapsing proposed in this article and also implements a runtime system to use pthreads to execute the transformed kernels.
2 Background
2.1 Running SPMD Programming Models on CPUs
The basic mechanism to support SPMD programming models on CPUs is to map a GPU block/work-group to a CPU thread and iterate for the thread block/work-group using loops [12, 23, 25, 26, 55]. Figure 1(a) and (b) show the input and output of the basic process. This transformation has different names in different projects: microthreading [53], thread aggregation [58], thread-fusion [12], region-based serialization [54], loop chunking [51], and kernel serialization [8]. In this article, this transformation is called flat collapsing, as it uses single-layer loops to represent all threads within a block.
When a GPU program contains synchronization primitives (e.g., synchthreads()), the code before/after the instructions should be wrapped by different loops. Below are some of the terminologies used in References [23, 26, 55] to support flat collapsing in general cases:
•
Parallel Region (PR): The Region means a group of consecutive instructions. The Parallel Region means the groups of instructions that must be executed by all the threads within a block before proceeding to the next regions. In CUDA programs, the instructions before/after __syncthreads belong to different PRs. For example, in Figure 1, statements-A and statements-B form two PRs.
•
Implicit barrier: Unlike explicit barriers that are declared by programmers (e.g., __synchthreads), flat collapsing inserts implicit barriers that are necessary to define the Parallel Regions. For simple functions that do not contain barriers (e.g., __syncthreads) inside the conditional statements, implicit barriers are not needed. However, when barriers are present in the conditional statements, the situation becomes more complicated. For example, to transform a CUDA kernel that has a barrier within an if–then construct, COX needs to insert implicit barriers into the original program so the correct PRs can be formed in subsequent steps.
The definition of Parallel Region in previous projects for flat collapsing cannot support warp-level features. Flat collapsing generates a single-layer loop for each PR to simulate all threads within a block. This coarse-grain simulation cannot distinguish threads among different CUDA warps. In this article, an extension definition, hierarchical Parallel Region, is proposed and used to support warp-level features (Section 5.5).
2.2 Warp-level Programming Features
2.2.1 Warp-level Collectives.
CUDA provides a list of warp-level collective functions [39], which are necessary for high-performance reduction. Two of these functions that are commonly used in existing benchmarks are introduced below.
•
Warp shuffle: In early versions, although most GPUs have local memory and global memory to support data exchange between threads, there was no efficient way to exchange data within a warp. To efficiently exchange data that are stored in registers, CUDA provides a series of warp shuffle instructions. When the warp shuffle instructions are invoked, a thread sends its local data to another thread in the same warp.
•
Warp Vote: Instead of only exchanging data among threads within a warp, warp vote instructions can directly make logical reductions (e.g., all, any) for the local variables, controlled by the mask argument. These features, used with warp shuffle and the cooperative group, are necessary to implement high-performance reduction kernels.
2.2.2 Cooperative Group.
In early CUDA programming models, there are only two explicit groups of threads: block and grid. Users cannot easily organize a small group of threads within the block as a sub-group. NVIDIA proposed a new concept in CUDA 7.0 called cooperative group. The corresponding instructions allow users to group threads within a block, and this group can further be used for data exchange. There are two kinds of grouping strategies: static and dynamic. For the static grouping, it is known whether a thread belongs to a group in compile time (e.g., group threads with index 0 and 1), while for dynamic grouping, it can only be known during runtime (e.g., group all activated threads).
3 Motivation
Section 2.1 introduces the concepts of PR and implicit barrier. This section discusses, with examples, the limitation of these concepts and how to extend them. Assume the input CUDA kernel shown in Code 1,2 and its block size is \( b\_size \) . The code accumulates the variable \( val \) within the first warp and stores the accumulated value in the first thread of this warp.
Code 1.
Code 1. Input GPU reduction kernel.
The warp shuffle function (shfl_down_sync) contains a warp-level barrier: Each thread within a warp has to first calculate the value of val, after all threads in this warp complete the calculation, they invoke shfl_down_sync at the same time (see Section 5.2 for details). In Code 1, this warp-level barrier is inside a branch of an if–then construct (if(threadIdx.x < 32)); thus, only the first warp will reach the warp-level barrier, assuming the warp size is 32 in NVIDIA GPUS.
With flat collapsing, Code 1 would become Code 2. Here are the steps:
•
split codes into PRs according to barriers and wrap them with single-layer for-loops. In Code 2, as shfl_down contains a barrier, three PRs are separately wrapped with single-layer for-loops in line 2, line 5, and line 8. (note that the loop in line 4 is from the source input, not generated by flat collapsing);
•
replace the use of threadIdx.x by the loop iteration variable tx;
•
replicate variables (e.g., val) used by more than one PRs into array forms.
Unfortunately, these steps are insufficient to generate the correct CPU program for this GPU kernel. The key reason is that shfl_down_sync contains a warp-level barrier, which cannot be easily presented by single-layer loops.
For NVIDIA GPUs that compute capability lower than 3.0, there are only block-level barriers, so it can be safely assumed that for each barrier instruction, all or no threads within the block can reach the barrier. Thus, flat collapsing wraps the PRs before/after the barrier with single-layer loops with b_size numbers of iteration. In Code 2, these generated loops are located in the branch of an if–then construct. With this transformation, if it is known at runtime that this barrier will be reached, then the control flow will reach this branch and execute the generated for-loops. In other words, flat collapsing assumes that all threads within the BLOCK always reach the same barriers. However, in Code 1, only the threads within the first WARP should reach the warp barrier. This violates the assumptions in flat collapsing . To get the correct result, the transformer needs to insert extra instructions into line 6 and line 9 in Code 2 (shown in gray).
These transformations are quite complicated, even for the demo example, not to mention implementing them in compilers used for all possible CFGs (control flow graph). The source of the complex is that flat collapsing uses single-layer loops to present all threads within a block, which cannot easily present the CUDA warp concept. Based on the above analysis, hierarchical collapsing is proposed to produce Code 3. The concept is also illustrated in Figure 1(c). Code 3 has two nested loops: The outer loop with induction variable wid (line 4) is for the block level called inter-warp loop, and the inner loops with induction variable tx (lines 5, 7, 12, and 14) are for the warp level called intra-warp loop. The inserted loops are shown in gray text in Code 3.
For flat collapsing, to control the different control flows for different warps, if–then construct are required to insert to transformed codes. However, for hierarchical collapsing, only a simple loop peeling (line 10 in Code 3) is needed. With the low complexity, hierarchical collapsing can easily be implemented and integrated into compilers. In COX, hierarchical collapsing is implemented as a new LLVM pass to automatically transform the GPU kernels.
Code 2.
Code 2. CPU warp shuffle program generated by flat collapsing.
In the given example, although only flag[0] is used, the transformed program also calculate other elements in flag. This is due to these calculations may have side effects. To guarantee correctness, they have to be executed even if these outputs are not needed.
Code 3.
Code 3. CPU warp shuffle program by hierarchical collapsing.
4 COX
We build a framework, COX, to support executing CUDA source code on X86 with high performance. The overview for COX is shown in Figure 2. In the first step, the CUDA kernel code is compiled to Intermediate Representation (NVVM IR) by Clang. After SPMD-to-MPMD transformer (hierarchical collapsing), NVVM IR will be translated to LLVM IR with MPMD format and emit the object file by LLVM backend. The object file will be linked with built-in libraries and migrated host code to generate executable files for X86. In this article, we mainly discuss the kernel SPMD-to-MPMD transformer. Although COX is implemented with X86 instructions, it can be easily migrated to other architectures.
Fig. 2.
Fig. 2. The overview of COX framework. In this article, we only discuss the CUDA kernel transformation.
5 Kernel Transformation
5.1 Overview of Kernel Transformation
In this section, the detail of applying hierarchical collapsing on a simple CUDA function (Code 4) is introduced. This function contains a CUDA warp vote function. First, each thread calculates its local variable (tx%2). Then, threads within warps communicate their local variables with all operation: If all local variable are non-zero, then the reduction returns true and otherwise returns false.
Figure 3 shows the pipeline of applying hierarchical collapsing transformation on Code 4. First, warp-level functions are replaced with built-in functions defined in the runtime library, as shown in Step 1 (Section 5.2). In Step 2, the implicit barriers are identified and inserted (Section 5.3). Then, in Steps 3 to 5, hierarchical Parallel Regions are identified, and intra/inter-warp loops are generated accordingly (Section 5.4). The generated LLVM IR will be compiled and linked with host programs and runtime libraries to generate a CPU-executable file.
Fig. 3.
Fig. 3. Steps of NVVM to LLVM-IR transformer.
Code 4.
Code 4. CUDA Warp Vote example.
5.2 Support Warp-level Functions
In GPU architectures, when threads within a warp invoke the warp functions, the GPU will have internal communications to accumulate and/or communicate the local variables within the warps. To support these features on CPUs, the accumulation and communication need to be explicitly performed. The implementation for warp-vote function on CPU backend is shown in Code 5. In this section, we introduces the detail to apply this transformation.
In the initialization, an array warp_vote with length 32 is allocated. The warp_vote should be stored in CPU thread local memory, as a CPU thread is used to simulate a GPU block. A GPU warp vote instruction is translated to the following CPU instructions: For threads within a warp, first, each thread stores its local variable into a unique position in warp_vote. After all the elements are set, the result for this warp vote can easily be computed. The function warp_all is defined in a runtime library. To utilize the computation resource of x86, warp_all is implemented with the AVX instructions. The benefits of AVX are evaluated in Section 7.2.3. The ways to support warp shuffle are quite similar.
Code 5.
Code 5. Implement CUDA Warp Vote on CPU.
5.3 Insert Implicit Barriers
In Steps 3, 4, and 5, hierarchical collapsing needs barrier information to identify the Parallel Region and generate intra/inter-warp loops. Thus, it is important to insert implicit barriers that are not shown in the input GPU codes but are necessary for identifying the Parallel Region. The implicit barriers are sourced from barriers in conditional statements. An example is shown in Figure 4. The input CUDA kernel (Figure 4(a)) has a barrier in the for-loop construct. As there is a barrier in the conditional statement, implicit barriers are inserted (Figure 4(b)) to guide the generation of intra/inter-warp loops in future steps. Finally, according to the barriers, two pairs of nested for-loops are generated (Figure 4(c)). Note, all transformations in COX are done in the LLVM IR level. This source code level example is only used for explanation.
Fig. 4.
Fig. 4. An example of implicit barriers needed for identifying PRs.
To make hierarchical collapsing work, extra block-level barriers are inserted at the beginning of the entry block and at the end of the exit block. The rest of the implicit barriers are needed when there are barriers inside the conditional statements (e.g., if–then construct, for-loop construct). As CUDA is a high-level flexible language, even a single concept can generate quite different CFGs. For example, the loop concept can be implemented by different CFGs, such as do-while-loop, while-loop, and for-loop. Fortunately, with existing LLVM transformations, COX can convert the input CFGs to canonical formats and we only need to focus on these canonical formats in the below discussion. Here we list some important features in the canonical format:
•
Each branch instruction has only two successors; most input CFGs already have this feature except the CFGs that use switch-case constructs.
•
All loops are in canonical format such that (1) they all have pre-headers; (2) each loop has only a single latch, in other words, a single backedge; and (3) the loop headers dominate all exit blocks.
COX invokes LLVM loopSimplify and lowerSwitch passes to get canonical format CFGs. After being transformed to the canonical format, the only two kinds of conditional statements existing in CFGs are the IfThen construct and ForLoop construct.
Barriers in if–then construct. The CFG of a classical if–then construct is shown in the left side of Figure 5(a). There is a barrier in the block if.body. According to Reference [41], for a block/warp barrier, none or all threads within the block/warp can reach this barrier. Thus, COX can safely apply loop peeling on the CFG, which uses the branch directions of the first thread as all other threads’ branch directions. In other words, COX peels the first thread to evaluate the branch direction and the rest of the threads within the warp/block can just follow the same direction. See Code 3 for a loop peeling example. The result after inserting implicit barriers and block split is shown in the right side of Figure 5(a). Several details are worth mentioning:
Fig. 5.
Fig. 5. After transformation, the inserted barriers are shown in bold. The PRs identified in further step are also shown in the figure.
•
insert implicit barriers with the same type as the barrier in if.body. In the example, there is a warp barrier in if.body; thus, COX also inserts warp barriers as implicit barriers;
•
after transformation, all blocks will be wrapped with intra-warp loops, except if.cond, which is used for loop peeling;
•
if.cond should contain only a single conditional-branch instruction and it should not have any side effects. In Figure 5(a), all computation instructions are put into if.head, which are executed b_size times, as the original GPU program does.
The detailed algorithm for inserting implicit barriers derived from barriers in an if–then construct is described in Algorithm 1. COX has to do some additional checking to avoid infinite loops caused by for-loop constructs in CFGs. For simplicity, this checking part is not shown in Algorithm 1.
Barriers in for-loop construct. The canonical loops contain single latch blocks and loop headers that dominate all exit blocks.3 COX inserts implicit barriers before/after the branch instructions (back edge of the loop). Figure 5(b) shows the example of inserting implicit barriers for a for-loop construct that contains a block barrier. Just as with an if–then construct, all these inserted implicit barriers (shown by bold text in the figure) should have the same type as the barriers in for.header (block barrier in the example).
5.4 Split Blocks before/after Each Barrier
The instructions before/after a barrier should be wrapped with different intra/inter-warp loops. Thus, in this step, COX splits the blocks that have barriers inside. See Step 3 in Figure 3 for an example.
5.5 Hierarchical Parallel Region
After insert implicit warp/block barriers, loops are generated to wrap instructions between barriers. The instructions between two consecutive barriers are called Parallel Region, which should be wrapped into the same loops. Compared with flat collapsing that only use single-layer loops to wrap PRs, hierarchical collapsing uses nested loops: inter-warp loop and intra-warp. Thus, two kinds of Parallel Regions are required: warp-level Parallel Region and block-level Parallel Region. These two kinds of PRs will be wrapped with intra/inter-warp loops separately. It is obvious that a warp-level PR will always be a subset for a block-level PR (a GPU warp is always within a GPU block); thus, the new concept is called Hierarchical Parallel Region. An example of a Hierarchical Parallel Region is shown in Figure 6. Thus, the rest of the steps are for finding the block/warp level PRs and wrapping them with inter/intra-warp loops. Algorithm 2 is used for finding the set of warp-level PRs. The algorithm for finding the block-level PRs is very similar, except it only concerns the block barrier. COX has to find the PRs for the warp level and block level sequentially: COX first finds all warp-level PRs and generates intra-warp loops to wrap these PRs. Then, COX finds the block-level PRs in the new CFG and wraps them with inter-warp loops.
Fig. 6.
Fig. 6. Due to the warp barrier in block1, there are two warp-level PRs ({ block1},{block2}) and a single block-level PR ({block1, block2}).
5.6 Wrap PR with For-Loop
In this step, COX wraps warp/block-level PRs with intra/inter-warp loop (Figure 3(e) and (f)). Although this step is quite straightforward, it requires proving the correctness:4 After inserting intra/inter-warp loops, each instruction from the input GPU kernel is executed b_size times (b_size is the block size), except the instructions used for loop peeling.
Finally, after generating intra/inter-warp loops, some local variables need to be replicated: For local variables that are used in several warp-level PRs but only used in a single block-level PR, they are replicated with arrays of length 32. For local variables that are used among different block-level PRs, they are replicated by arrays of length equals to block size.
5.7 Limitation of Hierarchical Collapsing
Both flat collapsing and hierarchical collapsing are compile-level transformations. Thus, only the static features are addressed. The latest CUDA version supports several dynamic features. For example, for warp-level collective operations, users can assign only a sub-group of threads within the warp to do the collective operations. The sub-group is organized by a mask argument at runtime. For the cooperative group, users can also group all activated threads into a group at runtime. Neither flat collapsing nor hierarchical collapsing can easily support these dynamic features. However, although these dynamic features provide more flexibility, they can be harmful for performance, as they may incur warp-divergence. Thus, most high-performance implementations [33, 40] use static warp-level collectives and the cooperative group to avoid warp-divergence. In this article, only these static use cases are of concern. For the same reason, only the aligned barriers [38] are of concern. In other words, for a block/warp barrier, it is assumed that all or none of the threads within a block/warp can reach this barrier.
6 Runtime System
Besides CUDA kernel programs, to execute CUDA programs on CPUs, CUDA host functions are also required to be migrated. To support executing CUDA host programs on CPUs, the CUDA runtime APIs, which involves memory management and kernel execution, needs to be implemented on CPUs. As both host and device are CPUs, CUDA malloc and memcpy are replaced by C malloc and memcpy. COX uses pthreads fork/join to replace kernel launch/synchronization in CUDA. There are several potential optimization for the runtime system, such as using thread-pool instead of fork/join for kernel launching and synchronization, using regular expression or LLVM transformation to automatically generate COX host programs from the CUDA source code. These optimizations are beyond the scope of this article and are open for future research.
7 Evaluation
In this section, we try to answer the following questions:
•
Can hierarchical collapsing successfully bring higher coverage? (Section 7.1)
•
What is the performance between hierarchical collapsing and flat collapsing to these kernels that do not use warp-level features? (Section 7.2.1)
•
What kind of technology can be used to generate high-performance CPU programs? (Section 7.2.2–7.2.3)
•
What is the kernel performance compared with existing frameworks that also support CUDA on CPUs? (Section 7.3)
Although several frameworks support executing CUDA on CPUs, most of them were developed decades ago and are not activated [12, 18, 55]. We select two frameworks to compare: HIP-CPU [5] and DPC++ [52]. HIP-CPU is developed by AMD that supports executing HIP programs on CPUs, while DPC++ is developed by Intel that supports executing SYCL [3] on CPUs. To support CUDA programs, we uses HIPIFY [6] and DPCT [19] to translate CUDA into HIP and SYCL. Detailed information about these frameworks is described in Section 8. The environment setting is described below:
We choose to use CUDA SDK samples to evaluate the coverage for different frameworks, while we use Rodinia, Hetero-mark, and Crystal benchmarks for evaluating the performance. Rodinia and Hetero-mark consist of classical HPC applications (e.g., Gaussian Elimination, AES encryption). The Crystal benchmark is a GPU database benchmark that contains 13 high-performance implementations of SQL quires. The evaluation mainly focuses on generated kernel functions for CPU backend, not for the runtime system (CPU multi-thread runtime system). However, for most CUDA kernels, the workloads are highly related to the input data value that affect the memory locality and branch behaviors. Thus, to make our evaluation close to the real cases, instead of using dummy input and evaluating the execution time for a single kernel launch and single GPU block, we use the original input and runtime configurations (e.g., grid size, block size) provided by the benchmarks.
7.1 Coverage
In this section, we verify that using hierarchical collapsing can result in higher coverage. Table 1 analyzes examples in CUDA SDK samples 10.1 that do not require special hardware support (e.g., tensorcore, unified memory). As shown in the table, the existing frameworks can only automatically support at most 21 kernels (coverage = 68%). Those failed kernels use new CUDA features. However, COX supports 28 kernels (coverage = 90%).
Table 1.
Kernel name
Features
DPCT
HIPIFY
COX
initVectors
✓
✓
✓
gpuDotProduct
warp cooperative group
✗
✗
✓
gpuSpMV
✓
✓
✓
r1_div_x
✓
✓
✓
a_minus
✓
✓
✓
gpuConjugateGradient
grid sync
✗
✗
✗
multigpuConjugateGradient
multi grid sync
✗
✗
✗
MatrixMulCUDA
✓
✓
✓
matrixMul
✓
✓
✓
copyp2p
✓
✓
✓
reduce0
block cooperative group
✓
✓
✓
reduce1
block cooperative group
✓
✓
✓
reduce2
block cooperative group
✓
✓
✓
reduce3
block cooperative group
✓
✓
✓
reduce4
warp cooperative group
✗
✗
✓
reduce5
warp cooperative group
✗
✗
✓
reduce6
warp cooperative group
✗
✗
✓
shfl_intimage_rows
warp shuffle
✓
✗
✓
shfl_vertical_shfl
warp shuffle
✓
✗
✓
shfl_scan_test
warp shuffle
✗*
✗
✓
uniform_add
✓
✓
✓
reduce
warp cooperative group
✗
✗
✓
reduceFinal
warp cooperative group
✗
✗
✓
simpleKernel
✓
✓
✓
VoteAnyKernel1
warp vote
✓
✗
✓
VoteAllKernel2
warp vote
✓
✗
✓
VoteAnyKernel3
warp vote
✓
✗
✓
spinWhileLessThanone
✓
✓
✓
matrixMultiplyKernel
✓
✓
✓
vectorAdd
✓
✓
✓
filter_arr
activated thread sync
✗
✗
✗
coverage
68%
52%
90%
Table 1. Coverage of COX Compared to Other Frameworks
The CUDA features supported by DPCT, HIPIFY, and COX are also shown in Figure 7.
Fig. 7.
Fig. 7. The CUDA features supported by HIPIFY, DPCT, and COX .
Although COX achieves the highest coverage by using hierarchical collapsing, three kernels still cannot be supported. \( gpuConjugateGradient \) and \( multiGpuConjugateGradient \) rely on synchronization between different grids and devices, which utilize the grid cooperative group and multi-grid cooperative group separately. \( filter\_arr \) uses a dynamic cooperative group: it dynamically groups all activated threads. As discussed in Section 2.2.2, all of these features should be supported at the runtime level: Frameworks should schedule threads accordingly at runtime, and each thread can only know whether it is activated during runtime. Supporting runtime features is included for future work.
We also evaluate these three frameworks on Rodinia, Hetero-mark, and Crystal benchmarks. In all frameworks, COX can achieve higher coverage than other frameworks. However, in Rodinia and Hetero-mark, the high coverage is not the result of hierarchical collapsing, as all CUDA kernels do not use warp-level functions. The reason is that as DPCT and HIPIFY use source-to-source translation, they cannot handle complex C++ use cases (e.g., complex macro). For Crystal, there are 13 examples, and three of them use warp-level features. COX can support all 13 examples, while HIPIFY can only support the remaining 10 examples that do not use warp-level features. DPC++ does not support atomicCAS functions on CPUs. Thus, it cannot support any of applications in Crystal. The evaluation results for these real benchmarks are recorded in Section 7.3.
7.2 Performance on Classical Tasks
In this section, we evaluate and analyze the performance on classical tasks, which are simple to analyze with different configurations.
7.2.1 Flat collapsing vs. hierarchical collapsing.
Compared with flat collapsing, hierarchical collapsing can achieve higher coverage by using nested loops. However, for CUDA programs that do not use warp-level features, nested loops are unnecessary. In this section, we evaluate the performance between flat collapsing and hierarchical collapsing on classical linear operations that do not use warp-level features (Figure 8). In all configurations, programs generated by hierarchical collapsing are slower than those generated by flat collapsing . This is due to nested loops involving more computation and instructions, and also making the programs difficult to be optimized by compilers. Thus, COX only applies hierarchical collapsing to CUDA kernels that use warp-level features. For the rest of the kernels, COX applies flat collapsing instead.
Fig. 8.
Fig. 8. The slow-down rate of programs generated by hierarchical collapsing compared with programs generated by flat collapsing .
7.2.2 Variable vs. Hardcoding.
COX supports two modes: The variable mode generates CPU programs that keep CUDA runtime configurations (e.g., grid size, block size) as variables. Instead, the hardcoding mode replaces all these configuration variables with constants. Although the programs generated by these two modes will be forwarded to LLVM’s optimizer with \( -O3 \) flag, they have an obvious difference, especially when compiling complicated kernels. Figure 9 shows the speed up using hardcoding mode, compared with variable mode. These two modes have a relatively small difference for the VectorAdd kernel, as it is quite simple and can easily be vectorized with compiler optimizations even the block size is not provided at compile time. However, for more complicated kernels, the hardcoding mode generates programs with higher performance. Although the hardcoding mode can generate programs with higher performance, it requires recompiling when changing the CUDA runtime configurations, which will incur extra compilation time. In the following sections, the variable mode is used for comparison with other frameworks.
Fig. 9.
Fig. 9. Compared with variable mode, hardcoding can generate faster programs.
7.2.3 SIMD Instructions.
For CPU programs, SIMD instructions are necessary to achieve high performance [2, 9, 24, 45]. The warp vote function execution time with/without AVX is shown in Table 2. With AVX instructions, around 10 \( \times \) speedup is achieved for both functions. The benefit is due to fewer instructions and branches.
Table 2.
Function
w/ AVX
w/o AVX
time (µs)
0.241
2.542
instructions
1,447,901,852
23,472,339,251
vote any
branches
100,110,593
4,260,452,162
time (µs)
0.236
2.992
instructions
1,384,476,021
29,552,745,486
vote any
branches
100,108,177
5,220,517,219
Table 2. Both Functions gain around 10 \( \times \) Speed up When Using AVX Instructions
7.3 Performance on Real Benchmarks
The runtime configuration and kernel execution time for executing on COX, DPC++, and HIP-CPU are recorded in Table 3. Since DPC++ does not support the atomicCAS instruction on CPUs, it cannot support any applications in Crystal.
Table 3.
Benchmark
Applications
No. of kernel launch
Grid size
Block size
COX
DPC++
HIP-CPU
Hetero-mark
aes
1
16,384
64
500.3
169.7
519.3
bs
2,048
32
64
99.743
65.5
136.9
ep
40
511
64
7038.79
59.3
2415
fir
2,048
64
64
151.4
89.55
227.7
ga
256
16
64
2585.31
71.24
1858
kmeans
144
1,563
64
2991.6
141
1196.5
pr
1,000
32
64
330.632
88.02
168.7
hist
1
128
64
228.87
221.359
193.8
crystal
q11
10,000
1
32
259.525
q12
10,000
1
32
302.632
q13
10,000
1
32
298.223
q21
40,000
1
32
443.308
1180.49
q22
40,000
1
128
1003.57
2057.56
q23
40,000
1
128
923.145
1984.86
q31
40,000
1
128
1165.66
2174.03
q32
40,000
1
128
1112.99
2112.73
q33
40,000
1
128
1184.54
2102.14
q34
40,000
1
128
1093.68
2056.16
q41
50,000
1
128
1428.59
2659.43
q42
50,000
1
128
1497.65
2745.15
q43
50,000
1
128
1387.41
2648.31
Rodinia
hotspot3D
100
1,024
256
699
131
893.8
particlefilter
1
157
128
121.776
32.884
bpnn_forward
1
4,096
256
14.948
61.576
bfs
24
1,954
512
39.726
50.458
272.7
btree
1
10,000
256
11.181
20.9
gaussian
414
1
512
28.35
56.8
200.805
hotspot
1
1,849
256
5.91
63.512
33.773
srad
4
16,384
256
82.347
150
933.18
pathfinder
5
463
256
33
42.3
168.72
nn
1
5,120
256
3.542
41.8
18.64
Table 3. Runtime Configuration and Total Kernel Execution Time (ms) on Hetero-mark, Rodinia, and Crystal Benchmarks
We mark the frameworks of the lowest kernel execution time in red.
7.3.1 COX vs. HIP-CPU.
To execute CUDA on a CPU, COX maps GPU threads to iterations in for-loops generated by hierarchical collapsing . However, HIP-CPU maps GPU threads to fibers (see Section 8 for more detail). HIP-CPU’s mapping strategy is more straightforward and easy to implement. However, the original CUDA programs always have lightweight kernels, which are friendly for throughput-oriented architectures. Wrapping these lightweight kernels with for-loops can generate programs with heavier workloads that are more friendly for latency-oriented architectures (CPU) integrated with SIMD instructions. Besides, although fiber has lightweight context switching, the overhead is still larger than iterations among for-loops. Thus, in most cases, COX is faster than HIP-CPU. The higher performance is the result of high utilization of SIMD instructions and lower overhead for thread switching, which lead to fewer instructions (Table 4). In all applications in Crystal and some applications in Rodinia and Hetero-mark (gaussian, nn, fir), COX can utilize more SIMD instructions for executing, which leads to fewer executed instructions.
Table 4.
Benchmark
Application
Metrics
COX
HIP-CPU
crystal
q21
sse-packed
802548
859188
avx128
239890
0
avx256
599725
0
# of inst
571967850
1123266901
q31
sse-packed
825341
117468
avx128
239914
0
avx256
599785
0
# of inst
580012183
1168657481
q41
sse-packed
1049937
243728
avx128
299862
0
avx256
749655
0
# of inst
730220536
1478500813
Rodinia
gauss
sse-packed
2706
1704
avx128
828
414
avx256
2070
414
# of inst
1668761
3450999
nn
sse-packed
313
47
avx128
2
2
avx256
5
2
# of inst
10385
18597
particlefilter
sse-packed
78972949
565778475
avx128
10584731
289510335
avx256
3392
4592
# of inst
1.02885E+11
41436192956
Hetero-mark
ep
sse-packed
93609
171036
avx128
1206
1446
avx256
13100
13715
# of inst
1.48391E+11
33645862337
fir
sse-packed
12806668
13069300
avx128
25194
3436
avx256
56210
10830
# of inst
13803614808
14015215887
pr
sse-packed
2599873
2724917
avx128
2360574
2350348
avx256
862755
841493
# of inst
39217359547
14129291883
hist
sse-packed
63425
97148
avx128
1527
2259
avx256
22024785
22026540
# of inst
12102193872
3942375563
atomic
39187287
39202294
Table 4. Profiling Data for COX and HIP-CPU
However, in some cases, HIP-CPU are faster than COX . In PR, the major kernel workload is irregular memory access, which cannot profit from SIMD instructions. As this application contains one thousand kernel launches, the runtime framework is the critical component, which is beyond the scope of this article. The particlefilter and EP contain heavy CUDA kernels (Code 6) that can be optimized with SIMD instructions. HIP-CPU relies on GNU 11 to apply auto-vectorization for the source code, while COX applies LLVM passes on transformed LLVM IRs for auto-vectorization. The transformed LLVM IRs are too complex to be fully optimized. Thus, in these cases, HIP-CPU has higher SIMD utilization and achieves higher performance. For the hist application, the critical component is atomic instructions. Since COX and HIP-CPU have the similar number of atomic instructions, they achieve close performance.
Code 6.
Code 6. CUDA kernels in EP and Particlefilter.
7.3.2 COX vs. DPC++.
As DPC++ is not fully open sourced, we cannot fully analyze it and compare it with COX but can only use profiling data to get some inferences. From the profiling data in Table 5, for all cases, DPC++ contains more atomic instructions, which may be generated by DPC++ runtime system for CPU thread scheduling. For gaussian and nn applications, both COX and DPC++ achieve high SIMD utilization. As DPC++ has many more atomic instructions and memory access in these two applications, DPC++ is slower than COX . As for EP and FIR, DPC++ achieves much higher SIMD utilization and results in many fewer executed instructions. Thus, DPC++ is much faster than COX . Other applications, including (e.g., aes and kmeans), involve a large amount of blocks, and the kernel workloads are relatively low compared with runtime overhead. In these situations, the performance mainly depends on the runtime system for how to implement kernel launch and synchronization. Further explanation for DPC++ requires reverse engineering, which is beyond the scope of this article.
Table 5.
Application
Metrics
COX
DPC++
gauss
mem atomic
236054
4533963
mem
652226992
934676835
sse-packed
475790
26797775
avx128
88155
6047607
avx256
90072
6350822
# of inst
1768929320
1489909537
nn
mem atomic
10506926
27954105
mem
1952231817
2551670355
sse-packed
4034949
16555799
avx128
10519077
16726252
avx256
30403929
38158606
# of inst
4137101986
5547944991
ep
mem atomic
36128
14992385
mem
68025621670
1269791602
sse-packed
93609
18724231
avx128
1206
9662776
avx256
13100
92355635
# of inst
1.48391E+11
2936592756
fir
mem atomic
845405
23656931
mem
6658160838
2514792478
sse-packed
12806668
42218234
avx128
25194
25188562
avx256
56210
41781477
# of inst
13803614808
6283526400
Table 5. Profiling Data for COX and DPC++
8 Related Work
The CPU architecture belongs to MPMD, while the GPU architecture is SPMD. Although users can naively execute a GPU thread with a CPU thread, due to the limited parallelism in CPUs, the system can only execute around 100 CPU threads simultaneously, which is much smaller than the parallelism in the GPU architecture. Thus, to achieve the same number of threads as a GPU, the CPU has to create more threads than it can actually execute simultaneously, which will incur a large amount of thread context-switching overhead. Two methods solve this issue. The first is to accelerate the context switching time in the CPU. Some researchers extend the CPU architecture to accelerate context switching [11]; these hardware-level extensions are beyond the scope of this article. In the software level, Reference [15] proposes to use lightweight threading to accelerate context switching. Context switching only stores and reloads a few registers while maintaining the stack memory. Most modifications are in the runtime level, and users can directly use the original GPU source code. As reported in Reference [54], the AMD CPU OpenCL implementation is based on this technology. However, even with these optimizations, there is still significant overhead for context switching, around 10 ns per switching.
Thus, another direction is being explored: increasing the workload of each CPU thread. For each CPU thread, instead of executing a single GPU thread, it executes all GPU threads within a block. This mechanism can elicit two benefits. First, it can increase the CPU execution time to make it much larger so that context switching overhead becomes negligible. Besides, with more workload in a single thread, there are more opportunities for optimizations (e.g., vectorization and loop transformation).
This mechanism has several different names: microthreading [53], thread aggregation [58], thread-fusion [12], region-based serialization [54], loop chunking [51], and kernel serialization [8]. In this article, this mechanism is given a new name: flat collapsing . In References [53, 55], the authors propose wrapping an SPMD kernel with loops, and the loop size equals the block size. Thus, each loop iteration can simulate a GPU thread within a block, and a CPU thread is mapped to a GPU block. An important technical detail is supporting synchronization instructions: Compilers should separately wrap instructions before/after a synchronization instruction into different loops to maintain the correctness. A similar technology is also discussed in Reference [51] that utilizes loop transformations (e.g., loop strip-mining, interchange, distribution, unswitching) to transform SPMD execution models with synchronization to Task Parallel execution models. The authors of Reference [58] propose improved static analysis to vectorize the generated loop-programs to improve the performance and also propose another algorithm to wrap the original kernels with loops to avoid additional synchronization points in previous works. In some GPU architectures, such as NVIDIA GPU, there is an implicit lock step within a group of threads. The author of Reference [16] propose transformations to detect these implicit warp-level synchronizations and maintained them during transformations. The authors of Reference [54] propose using C Extensions for Array Notation to further accelerate the generated CPU programs with SIMD execution and better spatial locality.
Several projects have been proposed to execute CUDA on non-NVIDIA devices. In the early days, NVIDIA provided an emulation framework [1] to execute CUDA on a CPU; each thread within a GPU block is executed by a CPU thread. Horus [14] is another emulator. It supports parsing and executing NVIDIA PTX instructions on CPU devices. These emulators are for debugging rather than for performance. In MCUDA [55], the authors also use a source-to-source translation to translate CUDA to C with flat collapsing . Ocelot [12] uses the same mechanism, but instead of source-to-source translation, it converts in the PTX level to avoid recompiling. MapCG [18] is a hybrid computing framework that uses source-to-source translation to translate CUDA kernels to C programs. The author of Reference [28] proposes another framework for hybird-computing based on Ocelot to translate GPU programs on the PTX level. Cumuls [8] uses Clang to parse the CUDA programs and modifies them on the AST level. Cumuls is mainly concerned with CUDA runtime support; as for the compilation part, it reuses the transformation in MCUDA. Instead of directly translating CUDA/PTX to CPU executable files, other projects utilize the portability of other front-end languages. The authors of References [17, 49] propose using source-to-source translation to translate CUDA to OpenCL. Instead of source-to-source translation, References [44, 48] implement the translations with LLVM IR. The DPC++ Compatibility Tool [19] and HIPIFY [6] are tools that translate CUDA to source languages for Intel and AMD devices.
Most related works only focus on supporting old CUDA features. However, the rapid evolution of GPU hardware and software stacks brings lots of new features that are important to achieve high performance, such as warp-level collectives, unified memory, and CudaGraph. Achieving high coverage on these new features is an ongoing project. The researchers in Reference [42] propose using explicit barriers and memory exchanges to support warp shuffle on OpenMP, which shares the same insight with COX .
OpenCL is an open standard that allows executing SPMD programs on MPMD architectures. POCL [23] is an open source OpenCL implementation that supports CPU backend. To support SPMD programs on CPU, POCL implements flat collapsing on the LLVM IR level. The authors in Reference [26] also propose using flat collapsing on OpenCL but with a different method to insert extra synchronization and find the Parallel Regions, which results in fewer extra synchronization barriers. However, this method is not extendable for the Hierarchical Parallel Regions; thus, it cannot be utilized to support warp-level features. In Reference [27], another OpenCL implementation has been proposed that mainly focuses on supporting OpenCL programs on multi-device clusters with heterogeneous devices.
8.1 Dpc++
DPC++ [52] is a programming language designed by Intel for data parallel programming. Intel also provides oneAPI framework to execute DPC++ programs on Intel CPUs and GPUs. To support CUDA on Intel devices, Intel provides DPCT [19], a tool to migrate CUDA programs to DPC++ programs. DPC++, the same as CUDA, provides an interface for SPMD programs. Currently, there are seldom documents that record detailed information of how oneAPI framework executes DPC++ programs on CPUs. From the Intel community [20], we can infer that it also implements flat collapsing mechanism. From the evaluation in Section 7.3, we find that the major optimizations in DPC++ for CPU backend are for applications with a large number of kernel launch and a large number of blocks. DPC++ only supports Intel CPUs/GPUs; thus, it should also contains compiler optimizations that are customized for Intel CPU architectures.
8.2 HIP-CPU
HIP [4] is a GPU programming language designed by AMD for AMD GPUs. The design philosophy for HIP is to make HIP close enough to CUDA so that developers can easily translate CUDA programs to HIP. AMD also provides HIPIFY [6] to automatically translate CUDA to HIP. HIP-CPU [5] is a framework developed by AMD, to support executing HIP on CPUs. This framework uses a new mapping mechanism: Instead of mapping GPU threads into loop iterations, it maps GPU threads to CPU fibers. Although this mechanism does not generate explicit loops, the fibers form implicit loops by context switching. Since there is no concept of CUDA warp, this transformation is also regarded as a sub-class of flat collapsing .
The SPMD-to-MPMD transformation in DPC++, HIP-CPU, and COX is shown in Figure 10.
Fig. 10.
Fig. 10. The SPMD-to-MPMD transformations implemented by DPC++, HIP-CPU, and COX .
9 Conclusion
COX is a framework that supports executing CUDA on CPUs. In this article, we focus on the compilation part in this framework. More particularly, as existing frameworks cannot support CUDA warp-level functions, we propose hierarchical collapsing to support these functions on CPUs. With hierarchical collapsing, COX can achieve higher coverage on CUDA10.1 SDK and Crystal benchmarks compared with existing frameworks. We also compare the performance of the generated kernel programs with existing frameworks and find the performance has a great variance among different applications. We use profiling tools to analyze the programs and find that, although in most applications hierarchical collapsing can achieve comparable or even higher performance, for some applications, the generated programs are too complex to be fully optimized by compilers. Future works will provide compiler optimizations to increase the SIMD instruction utilization for programs generated by hierarchical collapsing .
Acknowledgment
We also thank Blaise Tine, Jeffrey Young, Jiashen Cao, Jun Chen, Bhanu Garg, and the hardware support from NSF CCRI #2016701, and Intel.
Footnotes
1
Warp is now officially a part of the programming model instead of being a microarchitecture concept.
2
This example is a simplified version of reduction_kernel.cu in CUDA 10.1 SDK.
3
A latch is a node in the loop that has an edge to the loop header. An exiting edge is an edge from inside the loop to a node outside of the loop. The source of such an edge is called an exiting block, its target is an exit block [29].
4
To make the article brief, we move the proof into Appendix.
Jose-Ignacio Agulleiro and Jose-Jesus Fernandez. 2015. Tomo3D 2.0—Exploitation of advanced vector extensions (AVX) for 3D reconstruction. J. Struct. Biol. 189, 2 (2015), 147–152.
Aksel Alpay and Vincent Heuveline. 2020. SYCL beyond OpenCL: The architecture, current state and future direction of hipSYCL. In Proceedings of the International Workshop on OpenCL.
Abu Asaduzzaman, Alec Trent, S. Osborne, C. Aldershof, and Fadi N. Sibai. 2021. Impact of CUDA and OpenCL on parallel and distributed computing. In Proceedings of the 8th International Conference on Electrical and Electronics Engineering (ICEEE’21). IEEE, 238–242.
Vera Blomkvist Karlsson. 2021. Cumulus - translating CUDA to sequential C++: Simplifying the process of debugging CUDA programs. Dissertation. KTH ROYAL INSTITUTE OF TECHNOLOGY.
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, 44–54.
Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 353–364.
G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. C. Ocelot. 2010. A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10), Vol. 10. 353–364.
Amr S. Elhelw and Sreepathi Pai. 2020. Horus: A modular GPU emulator framework. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’20). IEEE, 104–106.
Jayanth Gummaraju, Ben Sander, Laurent Morichetti, Benedict R. Gaster, Michael Houston, and Bixia Zheng. 2010. Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 205–215.
Ziyu Guo, Eddy Zheng Zhang, and Xipeng Shen. 2011. Correctly treating synchronizations in compiling fine-grained spmd-threaded programs for cpu. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, 310–319.
Chuntao Hong, Dehao Chen, Wenguang Chen, Weimin Zheng, and Haibo Lin. 2010. MapCG: Writing parallel program portable between CPU and GPU. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. 217–226.
Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, and Heikki Berg. 2015. pocl: A performance-portable OpenCL implementation. Int. J. Parallel Program. 43, 5 (2015), 752–785.
Hwancheol Jeong, Sunghoon Kim, Weonjong Lee, and Seok-Ho Myung. 2012. Performance of SSE and AVX instruction sets. In Proceedings of the 30th International Symposium on Lattice Field Theory.
Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’11). 141–150.
Ralf Karrenberg and Sebastian Hack. 2012. Improving performance of OpenCL on CPUs. In Proceedings of the International Conference on Compiler Construction. Springer, 1–20.
Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. 2012. SnuCL: An OpenCL framework for heterogeneous CPU/GPU clusters. In Proceedings of the 26th ACM International Conference on Supercomputing. 341–352.
Changmin Lee, Won Woo Ro, and Jean-Luc Gaudiot. 2014. Boosting CUDA applications with CPU–GPU hybrid computing. Int. J. Parallel Program. 42, 2 (2014), 384–404.
Saeed Maleki, Yaoqing Gao, Maria J. Garzar, Tommy Wong, David A. Padua, et al. 2011. An evaluation of vectorizing compilers. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, 372–382.
Lifeng Nai, Yinglong Xia, Ilie G Tanase, Hyesoon Kim, and Ching-Yung Lin. 2015. GraphBIG: Understanding graph computing in the context of industrial solutions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). IEEE, 1–12.
Dorit Nuzman and Richard Henderson. 2006. Multi-platform auto-vectorization. In International Symposium on Code Generation and Optimization (CGO’06). IEEE.
Atmn Patel, Shilei Tian, Johannes Doerfert, and Barbara Chapman. 2021. A virtual GPU as developer-friendly OpenMP offload target. In Proceedings of the 50th International Conference on Parallel Processing Workshop. 1–7.
Hugh Perkins. 2016. Cltorch: A hardware-agnostic backend for the torch deep neural network library, based on opencl. arXiv preprint arXiv:1606.04884 (2016).
Hugh Perkins. 2017. CUDA-on-CL: A compiler and runtime for running NVIDIA CUDA C++ 11 applications on OpenCL 1.2 Devices. In Proceedings of the 5th International Workshop on OpenCL. 1–4.
Matt Pharr and William R Mark. 2012. A SPMD compiler for high-performance CPU programming. In Proceedings of the Innovative Parallel Computing: Foundations & Applications of GPU, Manycore, and Heterogeneous Systems.
Vasileios Porpodas. 2017. Supergraph-slp auto-vectorization. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT’17). IEEE, 330–342.
Jaewon Lee Jaewoong Sim Hyesoon Kim Ruobing Han, Blaise Tine. 2021. Supporting CUDA for an extended RISC-V GPU architecture. In Proceedings of the 5th Workshop on Computer Architecture Research with RISC-V (CARRV’21).
Paul Sathre, Mark Gardner, and Wu-chun Feng. 2019. On the portability of cpu-accelerated applications via automated source-to-source translation. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. 1–8.
Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A study of the fundamental performance characteristics of GPUs and CPUs for database analytics. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’20). 1617–1632.
Jun Shirako, Jisheng M. Zhao, V. Krishna Nandivada, and Vivek N. Sarkar. 2009. Chunking parallel loops in the presence of synchronization. In Proceedings of the 23rd International Conference on Supercomputing. 181–192.
André Silveira, Rafael Bohrer Avila, Marcos E. Barreto, and Philippe Olivier Alexandre Navaux. 2000. DPC++: Object-oriented programming applied to cluster computing. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’00).
John A. Stratton, Vinod Grover, Jaydeep Marathe, Bastiaan Aarts, Mike Murphy, Ziang Hu, and Wen-mei W Hwu. 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. 111–119.
John A. Stratton, Hee-Seok Kim, Thoman B. Jablin, and Wen-Mei W. Hwu. 2013. Performance Portability in Accelerated Parallel Kernels. Center for Reliable and High-Performance Computing.
John A. Stratton, Sam S. Stone, and W. Hwu Wen-mei. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing. Springer, 16–30.
Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. 2016. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’16). IEEE, 1–10.
Yao Zhang, Mark Sinclair, and Andrew A. Chien. 2013. Improving performance portability in OpenCL programs. In Proceedings of the International Supercomputing Conference. Springer, 136–150.
Han RChen JGarg BZhou XLu JYoung JSim JKim H(2024)CuPBoP: Making CUDA a Portable LanguageACM Transactions on Design Automation of Electronic Systems10.1145/365994929:4(1-25)Online publication date: 21-Jun-2024
Tian SScogland TChapman BDoerfert J(2023)OpenMP Kernel Language Extensions for Performance Portable GPU CodesProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624164(876-883)Online publication date: 12-Nov-2023
Meyer JAlpay AHack SFröning HHeuveline V(2023)Implementation Techniques for SPMD Kernels on CPUsProceedings of the 2023 International Workshop on OpenCL10.1145/3585341.3585342(1-12)Online publication date: 18-Apr-2023
CUDA is designed specifically for NVIDIA GPUs and is not compatible with non-NVIDIA devices. Enabling CUDA execution on alternative backends could greatly benefit the hardware community by fostering a more diverse software ecosystem.
Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
Han RChen JGarg BZhou XLu JYoung JSim JKim H(2024)CuPBoP: Making CUDA a Portable LanguageACM Transactions on Design Automation of Electronic Systems10.1145/365994929:4(1-25)Online publication date: 21-Jun-2024
Tian SScogland TChapman BDoerfert J(2023)OpenMP Kernel Language Extensions for Performance Portable GPU CodesProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624164(876-883)Online publication date: 12-Nov-2023
Meyer JAlpay AHack SFröning HHeuveline V(2023)Implementation Techniques for SPMD Kernels on CPUsProceedings of the 2023 International Workshop on OpenCL10.1145/3585341.3585342(1-12)Online publication date: 18-Apr-2023