Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

COX : Exposing CUDA Warp-level Functions to CPUs

Published: 16 September 2022 Publication History
  • Get Citation Alerts
  • Abstract

    As CUDA becomes the de facto programming language among data parallel applications such as high-performance computing or machine learning applications, running CUDA on other platforms becomes a compelling option. Although several efforts have attempted to support CUDA on devices other than NVIDIA GPUs, due to extra steps in the translation, the support is always a few years behind CUDA’s latest features. In particular, the new CUDA programming model exposes the warp concept in the programming language, which greatly changes the way the CUDA code should be mapped to CPU programs. In this article, hierarchical collapsing that correctly supports CUDA warp-level functions on CPUs is proposed. To verify hierarchical collapsing, we build a framework, COX, that supports executing CUDA source code on the CPU backend. With hierarchical collapsing, 90% of kernels in CUDA SDK samples can be executed on CPUs, much higher than previous works (68%). We also evaluate the performance with benchmarks for real applications and show that hierarchical collapsing can generate CPU programs with comparable or even higher performance than previous projects in general.

    1 Introduction

    The high-performance computing power in GPUs has developed a strong software ecosystem based on GPU programs. Although there are other choices for GPU programming such as Heterogeneous-compute Interface for Portability (HIP) [4], OpenMP, and DPC++ [19], in recent years, CUDA has remained dominant. In the realm of deep learning, as concluded in Reference [43], before 2014, all popular frameworks (e.g., Caffe, Tensorflow, Torch, and MXNET) only supported CUDA for the GPU backend. Although these frameworks currently have started to support non-NVIDIA GPUs, due to the high market share of NVIDIA on the GPU market, most researchers still tend to develop and release their GPU programs written by CUDA [7]. In the multimedia realm, a survey of video editing applications [30] shows that CUDA is compatible in 14 of the 17 applications, while OpenCL is only supported by 9 of them.
    Unfortunately, despite the popularity of CUDA programs, NVIDIA GPUs are the main hardware platforms to run them. Although there have been several efforts to run CUDA on non-NVIDIA GPUs, they lack support for newer CUDA features. There are several challenges in running CUDA on other platforms. One of the biggest challenge is converting Single Program Multiple Data (SPMD) programs to programs for non-SPMD-friendly architectures. In the SPMD programming model, the same kernel is executed by many threads at runtime, and the GPU is built for throughput-oriented processing by supporting many threads (or warps). However, other architectures often do not have that many threads. Thus, CUDA kernel programs need to be converted efficiently to fewer threads. Supporting CUDA with fewer threads is an essential component for running CUDA on X86, ARM, RISC-V, or Intel-GPU, which have fewer hardware threads than NVIDIA GPUs. We call this mechanism the SPMD-to-MPMD transformation (Multiple Program Multiple Data (MPMD)).
    Several existing projects support running GPU programs on CPUs [1, 6, 11, 12, 13, 15, 19, 21, 23, 26, 37, 53, 55]. Most projects apply compiler-level transformation to translate GPU programs to be suitable for CPUs. These projects use the same granularity of transformation: A CPU thread is responsible for executing all the workload within a CUDA block (or OpenCL work-group). The CUDA-block-to-CPU-thread mapping is optimal based on the following observations. First, the transformed programs only require a few CPU threads (the number of GPU blocks for the original CUDA programs). Fewer number of CPU threads can lead to less context switching. Second, in CUDA programs, the memory access within a CUDA block always has great locality. Thus, wrapping these memory accesses into a CPU thread may also bring spatial/temporal locality for CPU caches [54]. Last, threads within a block have similar computation patterns, which makes them amenable to be optimized by Single Instruction Multiple Data (SIMD) instructions [24, 26] in the modern CPU architectures with further optimizations [25, 31, 34, 35, 36, 46, 47]. Other projects apply optimizations on the hardware level [11] or system level [15] to support executing GPU programs on CPU, which are beyond the scope of this article.
    The transformation is shown in Figure 1(b). For a given GPU kernel, the translator first splits it into different regions according to synchronization instructions. Each region is then wrapped with a for-loop with GPU block size number of iteration. As this transformation uses single-layer for-loops to wrap each region, we name it flat collapsing . As flat collapsing wraps the original programs with single-layer loops, all threads within the block are mapped to iterations in the for-loops. After mapping, there is no difference among threads within a block except that they have different values for the loop iteration index. This transformation was proposed based on early CUDA programming models and cannot easily support several important features proposed in recent CUDA programming models. One significant change is the warp-level programming model in CUDA.1 This new feature is critical for achieving high performance. One of the most important impacts brought by this feature is that threads within a CUDA warp should not be treated equally with threads in other CUDA warps. Thus, in this article, hierarchical collapsing is proposed to support these warp-level CUDA features on CPUs. Throughout the article, the focus is on the CUDA programming model. However, the same techniques would also apply to other SPMD programming models (e.g., OpenCL[32], HIP [4], and DPC++ [52]).
    Fig. 1.
    Fig. 1. The programming model for input CUDA SPMD and output CPU programs by flat collapsing and hierarchical collapsing .
    The main contributions of this article are as follows:
    Propose hierarchical collapsing for SPMD-to-MPMD transformation and implement it with LLVM passes. The new transformation provides the correctness for GPU programs that use warp-level functions;
    Extend the Parallel Region concept [23] into the Hierarchical Parallel Region to provide the correct translation when the GPU programs have warp-level functions;
    Evaluate the SPMD-to-MPMD transformation on real applications and compare the coverage and performance with existing frameworks.
    To support executing CUDA on CPUs, besides the SPMD-to-MPMD transformation, there is another technical issue need to be solved. CUDA provides runtime APIs (e.g., cudaMalloc and kernel launch) to communicate between the host and the device, these APIs should be replaced or re-implemented for the CPU backend. The CUDA runtime migration is beyond the scope of this article and we only briefly discuss it in Section 6.
    To verify the hierarchical collapsing, we build COX, a framework that supports executing CUDA source code on CPU backends. COX integrates the hierarchical collapsing proposed in this article and also implements a runtime system to use pthreads to execute the transformed kernels.

    2 Background

    2.1 Running SPMD Programming Models on CPUs

    The basic mechanism to support SPMD programming models on CPUs is to map a GPU block/work-group to a CPU thread and iterate for the thread block/work-group using loops [12, 23, 25, 26, 55]. Figure 1(a) and (b) show the input and output of the basic process. This transformation has different names in different projects: microthreading [53], thread aggregation [58], thread-fusion [12], region-based serialization [54], loop chunking [51], and kernel serialization [8]. In this article, this transformation is called flat collapsing, as it uses single-layer loops to represent all threads within a block.
    When a GPU program contains synchronization primitives (e.g., synchthreads()), the code before/after the instructions should be wrapped by different loops. Below are some of the terminologies used in References [23, 26, 55] to support flat collapsing in general cases:
    Parallel Region (PR): The Region means a group of consecutive instructions. The Parallel Region means the groups of instructions that must be executed by all the threads within a block before proceeding to the next regions. In CUDA programs, the instructions before/after __syncthreads belong to different PRs. For example, in Figure 1, statements-A and statements-B form two PRs.
    Implicit barrier: Unlike explicit barriers that are declared by programmers (e.g., __synchthreads), flat collapsing inserts implicit barriers that are necessary to define the Parallel Regions. For simple functions that do not contain barriers (e.g., __syncthreads) inside the conditional statements, implicit barriers are not needed. However, when barriers are present in the conditional statements, the situation becomes more complicated. For example, to transform a CUDA kernel that has a barrier within an if–then construct, COX needs to insert implicit barriers into the original program so the correct PRs can be formed in subsequent steps.
    The definition of Parallel Region in previous projects for flat collapsing cannot support warp-level features. Flat collapsing generates a single-layer loop for each PR to simulate all threads within a block. This coarse-grain simulation cannot distinguish threads among different CUDA warps. In this article, an extension definition, hierarchical Parallel Region, is proposed and used to support warp-level features (Section 5.5).

    2.2 Warp-level Programming Features

    2.2.1 Warp-level Collectives.

    CUDA provides a list of warp-level collective functions [39], which are necessary for high-performance reduction. Two of these functions that are commonly used in existing benchmarks are introduced below.
    Warp shuffle: In early versions, although most GPUs have local memory and global memory to support data exchange between threads, there was no efficient way to exchange data within a warp. To efficiently exchange data that are stored in registers, CUDA provides a series of warp shuffle instructions. When the warp shuffle instructions are invoked, a thread sends its local data to another thread in the same warp.
    Warp Vote: Instead of only exchanging data among threads within a warp, warp vote instructions can directly make logical reductions (e.g., all, any) for the local variables, controlled by the mask argument. These features, used with warp shuffle and the cooperative group, are necessary to implement high-performance reduction kernels.

    2.2.2 Cooperative Group.

    In early CUDA programming models, there are only two explicit groups of threads: block and grid. Users cannot easily organize a small group of threads within the block as a sub-group. NVIDIA proposed a new concept in CUDA 7.0 called cooperative group. The corresponding instructions allow users to group threads within a block, and this group can further be used for data exchange. There are two kinds of grouping strategies: static and dynamic. For the static grouping, it is known whether a thread belongs to a group in compile time (e.g., group threads with index 0 and 1), while for dynamic grouping, it can only be known during runtime (e.g., group all activated threads).

    3 Motivation

    Section 2.1 introduces the concepts of PR and implicit barrier. This section discusses, with examples, the limitation of these concepts and how to extend them. Assume the input CUDA kernel shown in Code 1,2 and its block size is \( b\_size \) . The code accumulates the variable \( val \) within the first warp and stores the accumulated value in the first thread of this warp.
    Code 1.
    Code 1. Input GPU reduction kernel.
    The warp shuffle function (shfl_down_sync) contains a warp-level barrier: Each thread within a warp has to first calculate the value of val, after all threads in this warp complete the calculation, they invoke shfl_down_sync at the same time (see Section 5.2 for details). In Code 1, this warp-level barrier is inside a branch of an if–then construct (if(threadIdx.x < 32)); thus, only the first warp will reach the warp-level barrier, assuming the warp size is 32 in NVIDIA GPUS.
    With flat collapsing, Code 1 would become Code 2. Here are the steps:
    split codes into PRs according to barriers and wrap them with single-layer for-loops. In Code 2, as shfl_down contains a barrier, three PRs are separately wrapped with single-layer for-loops in line 2, line 5, and line 8. (note that the loop in line 4 is from the source input, not generated by flat collapsing);
    replace the use of threadIdx.x by the loop iteration variable tx;
    replicate variables (e.g., val) used by more than one PRs into array forms.
    Unfortunately, these steps are insufficient to generate the correct CPU program for this GPU kernel. The key reason is that shfl_down_sync contains a warp-level barrier, which cannot be easily presented by single-layer loops.
    For NVIDIA GPUs that compute capability lower than 3.0, there are only block-level barriers, so it can be safely assumed that for each barrier instruction, all or no threads within the block can reach the barrier. Thus, flat collapsing wraps the PRs before/after the barrier with single-layer loops with b_size numbers of iteration. In Code 2, these generated loops are located in the branch of an if–then construct. With this transformation, if it is known at runtime that this barrier will be reached, then the control flow will reach this branch and execute the generated for-loops. In other words, flat collapsing assumes that all threads within the BLOCK always reach the same barriers. However, in Code 1, only the threads within the first WARP should reach the warp barrier. This violates the assumptions in flat collapsing . To get the correct result, the transformer needs to insert extra instructions into line 6 and line 9 in Code 2 (shown in gray).
    These transformations are quite complicated, even for the demo example, not to mention implementing them in compilers used for all possible CFGs (control flow graph). The source of the complex is that flat collapsing uses single-layer loops to present all threads within a block, which cannot easily present the CUDA warp concept. Based on the above analysis, hierarchical collapsing is proposed to produce Code 3. The concept is also illustrated in Figure 1(c). Code 3 has two nested loops: The outer loop with induction variable wid (line 4) is for the block level called inter-warp loop, and the inner loops with induction variable tx (lines 5, 7, 12, and 14) are for the warp level called intra-warp loop. The inserted loops are shown in gray text in Code 3.
    For flat collapsing, to control the different control flows for different warps, if–then construct are required to insert to transformed codes. However, for hierarchical collapsing, only a simple loop peeling (line 10 in Code 3) is needed. With the low complexity, hierarchical collapsing can easily be implemented and integrated into compilers. In COX, hierarchical collapsing is implemented as a new LLVM pass to automatically transform the GPU kernels.
    Code 2.
    Code 2. CPU warp shuffle program generated by flat collapsing.
    In the given example, although only flag[0] is used, the transformed program also calculate other elements in flag. This is due to these calculations may have side effects. To guarantee correctness, they have to be executed even if these outputs are not needed.
    Code 3.
    Code 3. CPU warp shuffle program by hierarchical collapsing.

    4 COX

    We build a framework, COX, to support executing CUDA source code on X86 with high performance. The overview for COX is shown in Figure 2. In the first step, the CUDA kernel code is compiled to Intermediate Representation (NVVM IR) by Clang. After SPMD-to-MPMD transformer (hierarchical collapsing), NVVM IR will be translated to LLVM IR with MPMD format and emit the object file by LLVM backend. The object file will be linked with built-in libraries and migrated host code to generate executable files for X86. In this article, we mainly discuss the kernel SPMD-to-MPMD transformer. Although COX is implemented with X86 instructions, it can be easily migrated to other architectures.
    Fig. 2.
    Fig. 2. The overview of COX framework. In this article, we only discuss the CUDA kernel transformation.

    5 Kernel Transformation

    5.1 Overview of Kernel Transformation

    In this section, the detail of applying hierarchical collapsing on a simple CUDA function (Code 4) is introduced. This function contains a CUDA warp vote function. First, each thread calculates its local variable (tx%2). Then, threads within warps communicate their local variables with all operation: If all local variable are non-zero, then the reduction returns true and otherwise returns false.
    Figure 3 shows the pipeline of applying hierarchical collapsing transformation on Code 4. First, warp-level functions are replaced with built-in functions defined in the runtime library, as shown in Step 1 (Section 5.2). In Step 2, the implicit barriers are identified and inserted (Section 5.3). Then, in Steps 3 to 5, hierarchical Parallel Regions are identified, and intra/inter-warp loops are generated accordingly (Section 5.4). The generated LLVM IR will be compiled and linked with host programs and runtime libraries to generate a CPU-executable file.
    Fig. 3.
    Fig. 3. Steps of NVVM to LLVM-IR transformer.
    Code 4.
    Code 4. CUDA Warp Vote example.

    5.2 Support Warp-level Functions

    In GPU architectures, when threads within a warp invoke the warp functions, the GPU will have internal communications to accumulate and/or communicate the local variables within the warps. To support these features on CPUs, the accumulation and communication need to be explicitly performed. The implementation for warp-vote function on CPU backend is shown in Code 5. In this section, we introduces the detail to apply this transformation.
    In the initialization, an array warp_vote with length 32 is allocated. The warp_vote should be stored in CPU thread local memory, as a CPU thread is used to simulate a GPU block. A GPU warp vote instruction is translated to the following CPU instructions: For threads within a warp, first, each thread stores its local variable into a unique position in warp_vote. After all the elements are set, the result for this warp vote can easily be computed. The function warp_all is defined in a runtime library. To utilize the computation resource of x86, warp_all is implemented with the AVX instructions. The benefits of AVX are evaluated in Section 7.2.3. The ways to support warp shuffle are quite similar.
    Code 5.
    Code 5. Implement CUDA Warp Vote on CPU.

    5.3 Insert Implicit Barriers

    In Steps 3, 4, and 5, hierarchical collapsing needs barrier information to identify the Parallel Region and generate intra/inter-warp loops. Thus, it is important to insert implicit barriers that are not shown in the input GPU codes but are necessary for identifying the Parallel Region. The implicit barriers are sourced from barriers in conditional statements. An example is shown in Figure 4. The input CUDA kernel (Figure 4(a)) has a barrier in the for-loop construct. As there is a barrier in the conditional statement, implicit barriers are inserted (Figure 4(b)) to guide the generation of intra/inter-warp loops in future steps. Finally, according to the barriers, two pairs of nested for-loops are generated (Figure 4(c)). Note, all transformations in COX are done in the LLVM IR level. This source code level example is only used for explanation.
    Fig. 4.
    Fig. 4. An example of implicit barriers needed for identifying PRs.
    To make hierarchical collapsing work, extra block-level barriers are inserted at the beginning of the entry block and at the end of the exit block. The rest of the implicit barriers are needed when there are barriers inside the conditional statements (e.g., if–then construct, for-loop construct). As CUDA is a high-level flexible language, even a single concept can generate quite different CFGs. For example, the loop concept can be implemented by different CFGs, such as do-while-loop, while-loop, and for-loop. Fortunately, with existing LLVM transformations, COX can convert the input CFGs to canonical formats and we only need to focus on these canonical formats in the below discussion. Here we list some important features in the canonical format:
    Each branch instruction has only two successors; most input CFGs already have this feature except the CFGs that use switch-case constructs.
    All loops are in canonical format such that (1) they all have pre-headers; (2) each loop has only a single latch, in other words, a single backedge; and (3) the loop headers dominate all exit blocks.
    COX invokes LLVM loopSimplify and lowerSwitch passes to get canonical format CFGs. After being transformed to the canonical format, the only two kinds of conditional statements existing in CFGs are the IfThen construct and ForLoop construct.
    Barriers in if–then construct. The CFG of a classical if–then construct is shown in the left side of Figure 5(a). There is a barrier in the block if.body. According to Reference [41], for a block/warp barrier, none or all threads within the block/warp can reach this barrier. Thus, COX can safely apply loop peeling on the CFG, which uses the branch directions of the first thread as all other threads’ branch directions. In other words, COX peels the first thread to evaluate the branch direction and the rest of the threads within the warp/block can just follow the same direction. See Code 3 for a loop peeling example. The result after inserting implicit barriers and block split is shown in the right side of Figure 5(a). Several details are worth mentioning:
    Fig. 5.
    Fig. 5. After transformation, the inserted barriers are shown in bold. The PRs identified in further step are also shown in the figure.
    insert implicit barriers with the same type as the barrier in if.body. In the example, there is a warp barrier in if.body; thus, COX also inserts warp barriers as implicit barriers;
    after transformation, all blocks will be wrapped with intra-warp loops, except if.cond, which is used for loop peeling;
    if.cond should contain only a single conditional-branch instruction and it should not have any side effects. In Figure 5(a), all computation instructions are put into if.head, which are executed b_size times, as the original GPU program does.
    The detailed algorithm for inserting implicit barriers derived from barriers in an if–then construct is described in Algorithm 1. COX has to do some additional checking to avoid infinite loops caused by for-loop constructs in CFGs. For simplicity, this checking part is not shown in Algorithm 1.
    Barriers in for-loop construct. The canonical loops contain single latch blocks and loop headers that dominate all exit blocks.3 COX inserts implicit barriers before/after the branch instructions (back edge of the loop). Figure 5(b) shows the example of inserting implicit barriers for a for-loop construct that contains a block barrier. Just as with an if–then construct, all these inserted implicit barriers (shown by bold text in the figure) should have the same type as the barriers in for.header (block barrier in the example).

    5.4 Split Blocks before/after Each Barrier

    The instructions before/after a barrier should be wrapped with different intra/inter-warp loops. Thus, in this step, COX splits the blocks that have barriers inside. See Step 3 in Figure 3 for an example.

    5.5 Hierarchical Parallel Region

    After insert implicit warp/block barriers, loops are generated to wrap instructions between barriers. The instructions between two consecutive barriers are called Parallel Region, which should be wrapped into the same loops. Compared with flat collapsing that only use single-layer loops to wrap PRs, hierarchical collapsing uses nested loops: inter-warp loop and intra-warp. Thus, two kinds of Parallel Regions are required: warp-level Parallel Region and block-level Parallel Region. These two kinds of PRs will be wrapped with intra/inter-warp loops separately. It is obvious that a warp-level PR will always be a subset for a block-level PR (a GPU warp is always within a GPU block); thus, the new concept is called Hierarchical Parallel Region. An example of a Hierarchical Parallel Region is shown in Figure 6. Thus, the rest of the steps are for finding the block/warp level PRs and wrapping them with inter/intra-warp loops. Algorithm 2 is used for finding the set of warp-level PRs. The algorithm for finding the block-level PRs is very similar, except it only concerns the block barrier. COX has to find the PRs for the warp level and block level sequentially: COX first finds all warp-level PRs and generates intra-warp loops to wrap these PRs. Then, COX finds the block-level PRs in the new CFG and wraps them with inter-warp loops.
    Fig. 6.
    Fig. 6. Due to the warp barrier in block1, there are two warp-level PRs ({ block1},{block2}) and a single block-level PR ({block1, block2}).

    5.6 Wrap PR with For-Loop

    In this step, COX wraps warp/block-level PRs with intra/inter-warp loop (Figure 3(e) and (f)). Although this step is quite straightforward, it requires proving the correctness:4 After inserting intra/inter-warp loops, each instruction from the input GPU kernel is executed b_size times (b_size is the block size), except the instructions used for loop peeling.
    Finally, after generating intra/inter-warp loops, some local variables need to be replicated: For local variables that are used in several warp-level PRs but only used in a single block-level PR, they are replicated with arrays of length 32. For local variables that are used among different block-level PRs, they are replicated by arrays of length equals to block size.

    5.7 Limitation of Hierarchical Collapsing

    Both flat collapsing and hierarchical collapsing are compile-level transformations. Thus, only the static features are addressed. The latest CUDA version supports several dynamic features. For example, for warp-level collective operations, users can assign only a sub-group of threads within the warp to do the collective operations. The sub-group is organized by a mask argument at runtime. For the cooperative group, users can also group all activated threads into a group at runtime. Neither flat collapsing nor hierarchical collapsing can easily support these dynamic features. However, although these dynamic features provide more flexibility, they can be harmful for performance, as they may incur warp-divergence. Thus, most high-performance implementations [33, 40] use static warp-level collectives and the cooperative group to avoid warp-divergence. In this article, only these static use cases are of concern. For the same reason, only the aligned barriers [38] are of concern. In other words, for a block/warp barrier, it is assumed that all or none of the threads within a block/warp can reach this barrier.

    6 Runtime System

    Besides CUDA kernel programs, to execute CUDA programs on CPUs, CUDA host functions are also required to be migrated. To support executing CUDA host programs on CPUs, the CUDA runtime APIs, which involves memory management and kernel execution, needs to be implemented on CPUs. As both host and device are CPUs, CUDA malloc and memcpy are replaced by C malloc and memcpy. COX uses pthreads fork/join to replace kernel launch/synchronization in CUDA. There are several potential optimization for the runtime system, such as using thread-pool instead of fork/join for kernel launching and synchronization, using regular expression or LLVM transformation to automatically generate COX host programs from the CUDA source code. These optimizations are beyond the scope of this article and are open for future research.

    7 Evaluation

    In this section, we try to answer the following questions:
    Can hierarchical collapsing successfully bring higher coverage? (Section 7.1)
    What is the performance between hierarchical collapsing and flat collapsing to these kernels that do not use warp-level features? (Section 7.2.1)
    What kind of technology can be used to generate high-performance CPU programs? (Section 7.2.27.2.3)
    What is the kernel performance compared with existing frameworks that also support CUDA on CPUs? (Section 7.3)
    Although several frameworks support executing CUDA on CPUs, most of them were developed decades ago and are not activated [12, 18, 55]. We select two frameworks to compare: HIP-CPU [5] and DPC++ [52]. HIP-CPU is developed by AMD that supports executing HIP programs on CPUs, while DPC++ is developed by Intel that supports executing SYCL [3] on CPUs. To support CUDA programs, we uses HIPIFY [6] and DPCT [19] to translate CUDA into HIP and SYCL. Detailed information about these frameworks is described in Section 8. The environment setting is described below:
    Software: Ubuntu 18.04, LLVM 10.0, gcc 7.5.0 (for DPC++ and COX), gcc 11.1.0 (for HIP-CPU), CUDA 10.1, DPC++ Ver. : 2022.0.0., HIPIFY [6] (commit: 8389fa1), HIP-CPU(commit: 56f559c);
    Hardware (for classical linear algebra tasks): 1 socket * Intel Xeon Silver 4210 CPU;
    Hardware (for real benchmarks tasks): 1 socket * 11th Generation Intel Core i7 Processors;
    Benchmarks: CUDA SDK samples 10.1, Hetero-mark [56], Crystal [50], Rodinia [10];
    Profiling tool: Intel SDE [22].
    We choose to use CUDA SDK samples to evaluate the coverage for different frameworks, while we use Rodinia, Hetero-mark, and Crystal benchmarks for evaluating the performance. Rodinia and Hetero-mark consist of classical HPC applications (e.g., Gaussian Elimination, AES encryption). The Crystal benchmark is a GPU database benchmark that contains 13 high-performance implementations of SQL quires. The evaluation mainly focuses on generated kernel functions for CPU backend, not for the runtime system (CPU multi-thread runtime system). However, for most CUDA kernels, the workloads are highly related to the input data value that affect the memory locality and branch behaviors. Thus, to make our evaluation close to the real cases, instead of using dummy input and evaluating the execution time for a single kernel launch and single GPU block, we use the original input and runtime configurations (e.g., grid size, block size) provided by the benchmarks.

    7.1 Coverage

    In this section, we verify that using hierarchical collapsing can result in higher coverage. Table 1 analyzes examples in CUDA SDK samples 10.1 that do not require special hardware support (e.g., tensorcore, unified memory). As shown in the table, the existing frameworks can only automatically support at most 21 kernels (coverage = 68%). Those failed kernels use new CUDA features. However, COX supports 28 kernels (coverage = 90%).
    Table 1.
    Kernel nameFeaturesDPCTHIPIFYCOX
    initVectors 
    gpuDotProductwarp cooperative group
    gpuSpMV 
    r1_div_x 
    a_minus 
    gpuConjugateGradientgrid sync
    multigpuConjugateGradientmulti grid sync
    MatrixMulCUDA 
    matrixMul 
    copyp2p 
    reduce0block cooperative group
    reduce1block cooperative group
    reduce2block cooperative group
    reduce3block cooperative group
    reduce4warp cooperative group
    reduce5warp cooperative group
    reduce6warp cooperative group
    shfl_intimage_rowswarp shuffle
    shfl_vertical_shflwarp shuffle
    shfl_scan_testwarp shuffle*
    uniform_add 
    reducewarp cooperative group
    reduceFinalwarp cooperative group
    simpleKernel 
    VoteAnyKernel1warp vote
    VoteAllKernel2warp vote
    VoteAnyKernel3warp vote
    spinWhileLessThanone 
    matrixMultiplyKernel 
    vectorAdd 
    filter_arractivated thread sync
    coverage 68%52%90%
    Table 1. Coverage of COX Compared to Other Frameworks
    *Enabled by manual code migration [57].
    The CUDA features supported by DPCT, HIPIFY, and COX are also shown in Figure 7.
    Fig. 7.
    Fig. 7. The CUDA features supported by HIPIFY, DPCT, and COX .
    Although COX achieves the highest coverage by using hierarchical collapsing, three kernels still cannot be supported. \( gpuConjugateGradient \) and \( multiGpuConjugateGradient \) rely on synchronization between different grids and devices, which utilize the grid cooperative group and multi-grid cooperative group separately. \( filter\_arr \) uses a dynamic cooperative group: it dynamically groups all activated threads. As discussed in Section 2.2.2, all of these features should be supported at the runtime level: Frameworks should schedule threads accordingly at runtime, and each thread can only know whether it is activated during runtime. Supporting runtime features is included for future work.
    We also evaluate these three frameworks on Rodinia, Hetero-mark, and Crystal benchmarks. In all frameworks, COX can achieve higher coverage than other frameworks. However, in Rodinia and Hetero-mark, the high coverage is not the result of hierarchical collapsing, as all CUDA kernels do not use warp-level functions. The reason is that as DPCT and HIPIFY use source-to-source translation, they cannot handle complex C++ use cases (e.g., complex macro). For Crystal, there are 13 examples, and three of them use warp-level features. COX can support all 13 examples, while HIPIFY can only support the remaining 10 examples that do not use warp-level features. DPC++ does not support atomicCAS functions on CPUs. Thus, it cannot support any of applications in Crystal. The evaluation results for these real benchmarks are recorded in Section 7.3.

    7.2 Performance on Classical Tasks

    In this section, we evaluate and analyze the performance on classical tasks, which are simple to analyze with different configurations.

    7.2.1 Flat collapsing vs. hierarchical collapsing.

    Compared with flat collapsing, hierarchical collapsing can achieve higher coverage by using nested loops. However, for CUDA programs that do not use warp-level features, nested loops are unnecessary. In this section, we evaluate the performance between flat collapsing and hierarchical collapsing on classical linear operations that do not use warp-level features (Figure 8). In all configurations, programs generated by hierarchical collapsing are slower than those generated by flat collapsing . This is due to nested loops involving more computation and instructions, and also making the programs difficult to be optimized by compilers. Thus, COX only applies hierarchical collapsing to CUDA kernels that use warp-level features. For the rest of the kernels, COX applies flat collapsing instead.
    Fig. 8.
    Fig. 8. The slow-down rate of programs generated by hierarchical collapsing compared with programs generated by flat collapsing .

    7.2.2 Variable vs. Hardcoding.

    COX supports two modes: The variable mode generates CPU programs that keep CUDA runtime configurations (e.g., grid size, block size) as variables. Instead, the hardcoding mode replaces all these configuration variables with constants. Although the programs generated by these two modes will be forwarded to LLVM’s optimizer with \( -O3 \) flag, they have an obvious difference, especially when compiling complicated kernels. Figure 9 shows the speed up using hardcoding mode, compared with variable mode. These two modes have a relatively small difference for the VectorAdd kernel, as it is quite simple and can easily be vectorized with compiler optimizations even the block size is not provided at compile time. However, for more complicated kernels, the hardcoding mode generates programs with higher performance. Although the hardcoding mode can generate programs with higher performance, it requires recompiling when changing the CUDA runtime configurations, which will incur extra compilation time. In the following sections, the variable mode is used for comparison with other frameworks.
    Fig. 9.
    Fig. 9. Compared with variable mode, hardcoding can generate faster programs.

    7.2.3 SIMD Instructions.

    For CPU programs, SIMD instructions are necessary to achieve high performance [2, 9, 24, 45]. The warp vote function execution time with/without AVX is shown in Table 2. With AVX instructions, around 10 \( \times \) speedup is achieved for both functions. The benefit is due to fewer instructions and branches.
    Table 2.
    Function w/ AVXw/o AVX
     time (µs)0.2412.542
     instructions1,447,901,85223,472,339,251
    vote anybranches100,110,5934,260,452,162
     time (µs)0.2362.992
     instructions1,384,476,02129,552,745,486
    vote anybranches100,108,1775,220,517,219
    Table 2. Both Functions gain around 10 \( \times \) Speed up When Using AVX Instructions

    7.3 Performance on Real Benchmarks

    The runtime configuration and kernel execution time for executing on COX, DPC++, and HIP-CPU are recorded in Table 3. Since DPC++ does not support the atomicCAS instruction on CPUs, it cannot support any applications in Crystal.
    Table 3.
    BenchmarkApplicationsNo. of kernel launchGrid sizeBlock sizeCOXDPC++HIP-CPU
    Hetero-markaes116,38464500.3169.7519.3
    bs2,048326499.74365.5136.9
    ep40511647038.7959.32415
    fir2,0486464151.489.55227.7
    ga25616642585.3171.241858
    kmeans1441,563642991.61411196.5
    pr1,0003264330.63288.02168.7
    hist112864228.87221.359193.8
    crystalq1110,000132259.525  
    q1210,000132302.632  
    q1310,000132298.223  
    q2140,000132443.308 1180.49
    q2240,00011281003.57 2057.56
    q2340,0001128923.145 1984.86
    q3140,00011281165.66 2174.03
    q3240,00011281112.99 2112.73
    q3340,00011281184.54 2102.14
    q3440,00011281093.68 2056.16
    q4150,00011281428.59 2659.43
    q4250,00011281497.65 2745.15
    q4350,00011281387.41 2648.31
    Rodiniahotspot3D1001,024256699131893.8
    particlefilter1157128121.776 32.884
    bpnn_forward14,09625614.94861.576 
    bfs241,95451239.72650.458272.7
    btree110,00025611.18120.9 
    gaussian414151228.3556.8200.805
    hotspot11,8492565.9163.51233.773
    srad416,38425682.347150933.18
    pathfinder54632563342.3168.72
    nn15,1202563.54241.818.64
    Table 3. Runtime Configuration and Total Kernel Execution Time (ms) on Hetero-mark, Rodinia, and Crystal Benchmarks
    We mark the frameworks of the lowest kernel execution time in red.

    7.3.1 COX vs. HIP-CPU.

    To execute CUDA on a CPU, COX maps GPU threads to iterations in for-loops generated by hierarchical collapsing . However, HIP-CPU maps GPU threads to fibers (see Section 8 for more detail). HIP-CPU’s mapping strategy is more straightforward and easy to implement. However, the original CUDA programs always have lightweight kernels, which are friendly for throughput-oriented architectures. Wrapping these lightweight kernels with for-loops can generate programs with heavier workloads that are more friendly for latency-oriented architectures (CPU) integrated with SIMD instructions. Besides, although fiber has lightweight context switching, the overhead is still larger than iterations among for-loops. Thus, in most cases, COX is faster than HIP-CPU. The higher performance is the result of high utilization of SIMD instructions and lower overhead for thread switching, which lead to fewer instructions (Table 4). In all applications in Crystal and some applications in Rodinia and Hetero-mark (gaussian, nn, fir), COX can utilize more SIMD instructions for executing, which leads to fewer executed instructions.
    Table 4.
    BenchmarkApplicationMetricsCOXHIP-CPU
    crystalq21sse-packed802548859188
    avx1282398900
    avx2565997250
    # of inst5719678501123266901
    q31sse-packed825341117468
    avx1282399140
    avx2565997850
    # of inst5800121831168657481
    q41sse-packed1049937243728
    avx1282998620
    avx2567496550
    # of inst7302205361478500813
    Rodiniagausssse-packed27061704
    avx128828414
    avx2562070414
    # of inst16687613450999
    nnsse-packed31347
    avx12822
    avx25652
    # of inst1038518597
    particlefiltersse-packed78972949565778475
    avx12810584731289510335
    avx25633924592
    # of inst1.02885E+1141436192956
    Hetero-markepsse-packed93609171036
    avx12812061446
    avx2561310013715
    # of inst1.48391E+1133645862337
    firsse-packed1280666813069300
    avx128251943436
    avx2565621010830
    # of inst1380361480814015215887
    prsse-packed25998732724917
    avx12823605742350348
    avx256862755841493
    # of inst3921735954714129291883
    histsse-packed6342597148
    avx12815272259
    avx2562202478522026540
    # of inst121021938723942375563
    atomic3918728739202294
    Table 4. Profiling Data for COX and HIP-CPU
    However, in some cases, HIP-CPU are faster than COX . In PR, the major kernel workload is irregular memory access, which cannot profit from SIMD instructions. As this application contains one thousand kernel launches, the runtime framework is the critical component, which is beyond the scope of this article. The particlefilter and EP contain heavy CUDA kernels (Code 6) that can be optimized with SIMD instructions. HIP-CPU relies on GNU 11 to apply auto-vectorization for the source code, while COX applies LLVM passes on transformed LLVM IRs for auto-vectorization. The transformed LLVM IRs are too complex to be fully optimized. Thus, in these cases, HIP-CPU has higher SIMD utilization and achieves higher performance. For the hist application, the critical component is atomic instructions. Since COX and HIP-CPU have the similar number of atomic instructions, they achieve close performance.
    Code 6.
    Code 6. CUDA kernels in EP and Particlefilter.

    7.3.2 COX vs. DPC++.

    As DPC++ is not fully open sourced, we cannot fully analyze it and compare it with COX but can only use profiling data to get some inferences. From the profiling data in Table 5, for all cases, DPC++ contains more atomic instructions, which may be generated by DPC++ runtime system for CPU thread scheduling. For gaussian and nn applications, both COX and DPC++ achieve high SIMD utilization. As DPC++ has many more atomic instructions and memory access in these two applications, DPC++ is slower than COX . As for EP and FIR, DPC++ achieves much higher SIMD utilization and results in many fewer executed instructions. Thus, DPC++ is much faster than COX . Other applications, including (e.g., aes and kmeans), involve a large amount of blocks, and the kernel workloads are relatively low compared with runtime overhead. In these situations, the performance mainly depends on the runtime system for how to implement kernel launch and synchronization. Further explanation for DPC++ requires reverse engineering, which is beyond the scope of this article.
    Table 5.
    ApplicationMetricsCOXDPC++
    gaussmem atomic2360544533963
    mem652226992934676835
    sse-packed47579026797775
    avx128881556047607
    avx256900726350822
    # of inst17689293201489909537
    nnmem atomic1050692627954105
    mem19522318172551670355
    sse-packed403494916555799
    avx1281051907716726252
    avx2563040392938158606
    # of inst41371019865547944991
    epmem atomic3612814992385
    mem680256216701269791602
    sse-packed9360918724231
    avx12812069662776
    avx2561310092355635
    # of inst1.48391E+112936592756
    firmem atomic84540523656931
    mem66581608382514792478
    sse-packed1280666842218234
    avx1282519425188562
    avx2565621041781477
    # of inst138036148086283526400
    Table 5. Profiling Data for COX and DPC++

    8 Related Work

    The CPU architecture belongs to MPMD, while the GPU architecture is SPMD. Although users can naively execute a GPU thread with a CPU thread, due to the limited parallelism in CPUs, the system can only execute around 100 CPU threads simultaneously, which is much smaller than the parallelism in the GPU architecture. Thus, to achieve the same number of threads as a GPU, the CPU has to create more threads than it can actually execute simultaneously, which will incur a large amount of thread context-switching overhead. Two methods solve this issue. The first is to accelerate the context switching time in the CPU. Some researchers extend the CPU architecture to accelerate context switching [11]; these hardware-level extensions are beyond the scope of this article. In the software level, Reference [15] proposes to use lightweight threading to accelerate context switching. Context switching only stores and reloads a few registers while maintaining the stack memory. Most modifications are in the runtime level, and users can directly use the original GPU source code. As reported in Reference [54], the AMD CPU OpenCL implementation is based on this technology. However, even with these optimizations, there is still significant overhead for context switching, around 10 ns per switching.
    Thus, another direction is being explored: increasing the workload of each CPU thread. For each CPU thread, instead of executing a single GPU thread, it executes all GPU threads within a block. This mechanism can elicit two benefits. First, it can increase the CPU execution time to make it much larger so that context switching overhead becomes negligible. Besides, with more workload in a single thread, there are more opportunities for optimizations (e.g., vectorization and loop transformation).
    This mechanism has several different names: microthreading [53], thread aggregation [58], thread-fusion [12], region-based serialization [54], loop chunking [51], and kernel serialization [8]. In this article, this mechanism is given a new name: flat collapsing . In References [53, 55], the authors propose wrapping an SPMD kernel with loops, and the loop size equals the block size. Thus, each loop iteration can simulate a GPU thread within a block, and a CPU thread is mapped to a GPU block. An important technical detail is supporting synchronization instructions: Compilers should separately wrap instructions before/after a synchronization instruction into different loops to maintain the correctness. A similar technology is also discussed in Reference [51] that utilizes loop transformations (e.g., loop strip-mining, interchange, distribution, unswitching) to transform SPMD execution models with synchronization to Task Parallel execution models. The authors of Reference [58] propose improved static analysis to vectorize the generated loop-programs to improve the performance and also propose another algorithm to wrap the original kernels with loops to avoid additional synchronization points in previous works. In some GPU architectures, such as NVIDIA GPU, there is an implicit lock step within a group of threads. The author of Reference [16] propose transformations to detect these implicit warp-level synchronizations and maintained them during transformations. The authors of Reference [54] propose using C Extensions for Array Notation to further accelerate the generated CPU programs with SIMD execution and better spatial locality.
    Several projects have been proposed to execute CUDA on non-NVIDIA devices. In the early days, NVIDIA provided an emulation framework [1] to execute CUDA on a CPU; each thread within a GPU block is executed by a CPU thread. Horus [14] is another emulator. It supports parsing and executing NVIDIA PTX instructions on CPU devices. These emulators are for debugging rather than for performance. In MCUDA [55], the authors also use a source-to-source translation to translate CUDA to C with flat collapsing . Ocelot [12] uses the same mechanism, but instead of source-to-source translation, it converts in the PTX level to avoid recompiling. MapCG [18] is a hybrid computing framework that uses source-to-source translation to translate CUDA kernels to C programs. The author of Reference [28] proposes another framework for hybird-computing based on Ocelot to translate GPU programs on the PTX level. Cumuls [8] uses Clang to parse the CUDA programs and modifies them on the AST level. Cumuls is mainly concerned with CUDA runtime support; as for the compilation part, it reuses the transformation in MCUDA. Instead of directly translating CUDA/PTX to CPU executable files, other projects utilize the portability of other front-end languages. The authors of References [17, 49] propose using source-to-source translation to translate CUDA to OpenCL. Instead of source-to-source translation, References [44, 48] implement the translations with LLVM IR. The DPC++ Compatibility Tool [19] and HIPIFY [6] are tools that translate CUDA to source languages for Intel and AMD devices.
    Most related works only focus on supporting old CUDA features. However, the rapid evolution of GPU hardware and software stacks brings lots of new features that are important to achieve high performance, such as warp-level collectives, unified memory, and CudaGraph. Achieving high coverage on these new features is an ongoing project. The researchers in Reference [42] propose using explicit barriers and memory exchanges to support warp shuffle on OpenMP, which shares the same insight with COX .
    OpenCL is an open standard that allows executing SPMD programs on MPMD architectures. POCL [23] is an open source OpenCL implementation that supports CPU backend. To support SPMD programs on CPU, POCL implements flat collapsing on the LLVM IR level. The authors in Reference [26] also propose using flat collapsing on OpenCL but with a different method to insert extra synchronization and find the Parallel Regions, which results in fewer extra synchronization barriers. However, this method is not extendable for the Hierarchical Parallel Regions; thus, it cannot be utilized to support warp-level features. In Reference [27], another OpenCL implementation has been proposed that mainly focuses on supporting OpenCL programs on multi-device clusters with heterogeneous devices.

    8.1 Dpc++

    DPC++ [52] is a programming language designed by Intel for data parallel programming. Intel also provides oneAPI framework to execute DPC++ programs on Intel CPUs and GPUs. To support CUDA on Intel devices, Intel provides DPCT [19], a tool to migrate CUDA programs to DPC++ programs. DPC++, the same as CUDA, provides an interface for SPMD programs. Currently, there are seldom documents that record detailed information of how oneAPI framework executes DPC++ programs on CPUs. From the Intel community [20], we can infer that it also implements flat collapsing mechanism. From the evaluation in Section 7.3, we find that the major optimizations in DPC++ for CPU backend are for applications with a large number of kernel launch and a large number of blocks. DPC++ only supports Intel CPUs/GPUs; thus, it should also contains compiler optimizations that are customized for Intel CPU architectures.

    8.2 HIP-CPU

    HIP [4] is a GPU programming language designed by AMD for AMD GPUs. The design philosophy for HIP is to make HIP close enough to CUDA so that developers can easily translate CUDA programs to HIP. AMD also provides HIPIFY [6] to automatically translate CUDA to HIP. HIP-CPU [5] is a framework developed by AMD, to support executing HIP on CPUs. This framework uses a new mapping mechanism: Instead of mapping GPU threads into loop iterations, it maps GPU threads to CPU fibers. Although this mechanism does not generate explicit loops, the fibers form implicit loops by context switching. Since there is no concept of CUDA warp, this transformation is also regarded as a sub-class of flat collapsing .
    The SPMD-to-MPMD transformation in DPC++, HIP-CPU, and COX is shown in Figure 10.
    Fig. 10.
    Fig. 10. The SPMD-to-MPMD transformations implemented by DPC++, HIP-CPU, and COX .

    9 Conclusion

    COX is a framework that supports executing CUDA on CPUs. In this article, we focus on the compilation part in this framework. More particularly, as existing frameworks cannot support CUDA warp-level functions, we propose hierarchical collapsing to support these functions on CPUs. With hierarchical collapsing, COX can achieve higher coverage on CUDA10.1 SDK and Crystal benchmarks compared with existing frameworks. We also compare the performance of the generated kernel programs with existing frameworks and find the performance has a great variance among different applications. We use profiling tools to analyze the programs and find that, although in most applications hierarchical collapsing can achieve comparable or even higher performance, for some applications, the generated programs are too complex to be fully optimized by compilers. Future works will provide compiler optimizations to increase the SIMD instruction utilization for programs generated by hierarchical collapsing .

    Acknowledgment

    We also thank Blaise Tine, Jeffrey Young, Jiashen Cao, Jun Chen, Bhanu Garg, and the hardware support from NSF CCRI #2016701, and Intel.

    Footnotes

    1
    Warp is now officially a part of the programming model instead of being a microarchitecture concept.
    2
    This example is a simplified version of reduction_kernel.cu in CUDA 10.1 SDK.
    3
    A latch is a node in the loop that has an edge to the loop header. An exiting edge is an edge from inside the loop to a node outside of the loop. The source of such an edge is called an exiting block, its target is an exit block [29].
    4
    To make the article brief, we move the proof into Appendix.

    Supplementary Material

    3554736-app (3554736-app.pdf)
    Supplementary material

    References

    [1]
    Compiling and Executing CUDA Programs in Emulation Mode. Retrieved from https://developer.nvidia.com/cuda-toolkit.
    [2]
    Jose-Ignacio Agulleiro and Jose-Jesus Fernandez. 2015. Tomo3D 2.0—Exploitation of advanced vector extensions (AVX) for 3D reconstruction. J. Struct. Biol. 189, 2 (2015), 147–152.
    [3]
    Aksel Alpay and Vincent Heuveline. 2020. SYCL beyond OpenCL: The architecture, current state and future direction of hipSYCL. In Proceedings of the International Workshop on OpenCL.
    [7]
    Abu Asaduzzaman, Alec Trent, S. Osborne, C. Aldershof, and Fadi N. Sibai. 2021. Impact of CUDA and OpenCL on parallel and distributed computing. In Proceedings of the 8th International Conference on Electrical and Electronics Engineering (ICEEE’21). IEEE, 238–242.
    [8]
    Vera Blomkvist Karlsson. 2021. Cumulus - translating CUDA to sequential C++: Simplifying the process of debugging CUDA programs. Dissertation. KTH ROYAL INSTITUTE OF TECHNOLOGY.
    [9]
    Berenger Bramas. 2017. Fast sorting algorithms using AVX-512 on intel knights landing. arXiv preprint arXiv:1704.08579, 305 (2017), 315.
    [10]
    Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, 44–54.
    [11]
    Kuan-Chung Chen and Chung-Ho Chen. 2018. Enabling SIMT execution model on homogeneous multi-core system. ACM Trans. Arch. Code Optim. 15, 1 (2018), 1–26.
    [12]
    Gregory Diamos, Andrew Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 353–364.
    [13]
    G. F. Diamos, A. R. Kerr, S. Yalamanchili, and N. C. Ocelot. 2010. A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10), Vol. 10. 353–364.
    [14]
    Amr S. Elhelw and Sreepathi Pai. 2020. Horus: A modular GPU emulator framework. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’20). IEEE, 104–106.
    [15]
    Jayanth Gummaraju, Ben Sander, Laurent Morichetti, Benedict R. Gaster, Michael Houston, and Bixia Zheng. 2010. Twin peaks: A software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 205–215.
    [16]
    Ziyu Guo, Eddy Zheng Zhang, and Xipeng Shen. 2011. Correctly treating synchronizations in compiling fine-grained spmd-threaded programs for cpu. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, 310–319.
    [17]
    Matt J. Harvey and Gianni De Fabritiis. 2011. Swan: A tool for porting CUDA programs to OpenCL. Comput. Phys. Commun. 182, 4 (2011), 1093–1099.
    [18]
    Chuntao Hong, Dehao Chen, Wenguang Chen, Weimin Zheng, and Haibo Lin. 2010. MapCG: Writing parallel program portable between CPU and GPU. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques. 217–226.
    [23]
    Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, and Heikki Berg. 2015. pocl: A performance-portable OpenCL implementation. Int. J. Parallel Program. 43, 5 (2015), 752–785.
    [24]
    Hwancheol Jeong, Sunghoon Kim, Weonjong Lee, and Seok-Ho Myung. 2012. Performance of SSE and AVX instruction sets. In Proceedings of the 30th International Symposium on Lattice Field Theory.
    [25]
    Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’11). 141–150.
    [26]
    Ralf Karrenberg and Sebastian Hack. 2012. Improving performance of OpenCL on CPUs. In Proceedings of the International Conference on Compiler Construction. Springer, 1–20.
    [27]
    Jungwon Kim, Sangmin Seo, Jun Lee, Jeongho Nah, Gangwon Jo, and Jaejin Lee. 2012. SnuCL: An OpenCL framework for heterogeneous CPU/GPU clusters. In Proceedings of the 26th ACM International Conference on Supercomputing. 341–352.
    [28]
    Changmin Lee, Won Woo Ro, and Jean-Luc Gaudiot. 2014. Boosting CUDA applications with CPU–GPU hybrid computing. Int. J. Parallel Program. 42, 2 (2014), 384–404.
    [29]
    LLVM. 2013. LLVM Loop Terminology. Retrieved from https://llvm.org/docs/LoopTerminology.html.
    [31]
    Saeed Maleki, Yaoqing Gao, Maria J. Garzar, Tommy Wong, David A. Padua, et al. 2011. An evaluation of vectorizing compilers. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, 372–382.
    [32]
    Aaftab Munshi. 2009. The opencl specification. In Proceedings of the IEEE Hot Chips 21 Symposium (HCS’09). IEEE, 1–314.
    [33]
    Lifeng Nai, Yinglong Xia, Ilie G Tanase, Hyesoon Kim, and Ching-Yung Lin. 2015. GraphBIG: Understanding graph computing in the context of industrial solutions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). IEEE, 1–12.
    [34]
    Dorit Nuzman and Richard Henderson. 2006. Multi-platform auto-vectorization. In International Symposium on Code Generation and Optimization (CGO’06). IEEE.
    [35]
    Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization of interleaved data for SIMD. ACM SIGPLAN Not. 41, 6 (2006), 132–143.
    [36]
    Dorit Nuzman and Ayal Zaks. 2006. Autovectorization in GCC—Two years later. In Proceedings of the GCC Developers Summit. 145–158.
    [37]
    [40]
    NVIDIA. 2021. CUB. Retrieved from https://nvlabs.github.io/cub/.
    [42]
    Atmn Patel, Shilei Tian, Johannes Doerfert, and Barbara Chapman. 2021. A virtual GPU as developer-friendly OpenMP offload target. In Proceedings of the 50th International Conference on Parallel Processing Workshop. 1–7.
    [43]
    Hugh Perkins. 2016. Cltorch: A hardware-agnostic backend for the torch deep neural network library, based on opencl. arXiv preprint arXiv:1606.04884 (2016).
    [44]
    Hugh Perkins. 2017. CUDA-on-CL: A compiler and runtime for running NVIDIA CUDA C++ 11 applications on OpenCL 1.2 Devices. In Proceedings of the 5th International Workshop on OpenCL. 1–4.
    [45]
    Matt Pharr and William R Mark. 2012. A SPMD compiler for high-performance CPU programming. In Proceedings of the Innovative Parallel Computing: Foundations & Applications of GPU, Manycore, and Heterogeneous Systems.
    [46]
    Vasileios Porpodas. 2017. Supergraph-slp auto-vectorization. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT’17). IEEE, 330–342.
    [47]
    Ira Rosen, Dorit Nuzman, and Ayal Zaks. 2007. Loop-aware SLP in GCC. In GCC Developers Summit. Citeseer.
    [48]
    Jaewon Lee Jaewoong Sim Hyesoon Kim Ruobing Han, Blaise Tine. 2021. Supporting CUDA for an extended RISC-V GPU architecture. In Proceedings of the 5th Workshop on Computer Architecture Research with RISC-V (CARRV’21).
    [49]
    Paul Sathre, Mark Gardner, and Wu-chun Feng. 2019. On the portability of cpu-accelerated applications via automated source-to-source translation. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. 1–8.
    [50]
    Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A study of the fundamental performance characteristics of GPUs and CPUs for database analytics. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’20). 1617–1632.
    [51]
    Jun Shirako, Jisheng M. Zhao, V. Krishna Nandivada, and Vivek N. Sarkar. 2009. Chunking parallel loops in the presence of synchronization. In Proceedings of the 23rd International Conference on Supercomputing. 181–192.
    [52]
    André Silveira, Rafael Bohrer Avila, Marcos E. Barreto, and Philippe Olivier Alexandre Navaux. 2000. DPC++: Object-oriented programming applied to cluster computing. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’00).
    [53]
    John A. Stratton, Vinod Grover, Jaydeep Marathe, Bastiaan Aarts, Mike Murphy, Ziang Hu, and Wen-mei W Hwu. 2010. Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. 111–119.
    [54]
    John A. Stratton, Hee-Seok Kim, Thoman B. Jablin, and Wen-Mei W. Hwu. 2013. Performance Portability in Accelerated Parallel Kernels. Center for Reliable and High-Performance Computing.
    [55]
    John A. Stratton, Sam S. Stone, and W. Hwu Wen-mei. 2008. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing. Springer, 16–30.
    [56]
    Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. 2016. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’16). IEEE, 1–10.
    [57]
    Yuhsiang M. Tsai, Terry Cojean, and Hartwig Anzt. 2021. Porting a sparse linear algebra math library to Intel GPUs. arXiv:2103.10116 [cs.DC].
    [58]
    Yao Zhang, Mark Sinclair, and Andrew A. Chien. 2013. Improving performance portability in OpenCL programs. In Proceedings of the International Supercomputing Conference. Springer, 136–150.

    Cited By

    View all
    • (2024)CuPBoP: Making CUDA a Portable LanguageACM Transactions on Design Automation of Electronic Systems10.1145/365994929:4(1-25)Online publication date: 21-Jun-2024
    • (2023)OpenMP Kernel Language Extensions for Performance Portable GPU CodesProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624164(876-883)Online publication date: 12-Nov-2023
    • (2023)Implementation Techniques for SPMD Kernels on CPUsProceedings of the 2023 International Workshop on OpenCL10.1145/3585341.3585342(1-12)Online publication date: 18-Apr-2023

    Index Terms

    1. COX : Exposing CUDA Warp-level Functions to CPUs

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 19, Issue 4
      December 2022
      361 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/3544007
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 September 2022
      Online AM: 02 August 2022
      Accepted: 21 July 2022
      Revised: 14 June 2022
      Received: 18 December 2021
      Published in TACO Volume 19, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. GPU
      2. code migration
      3. compiler transformations

      Qualifiers

      • Research-article
      • Refereed

      Funding Sources

      • Booz Allen Hamilton Inc.

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,971
      • Downloads (Last 6 weeks)144

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)CuPBoP: Making CUDA a Portable LanguageACM Transactions on Design Automation of Electronic Systems10.1145/365994929:4(1-25)Online publication date: 21-Jun-2024
      • (2023)OpenMP Kernel Language Extensions for Performance Portable GPU CodesProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624164(876-883)Online publication date: 12-Nov-2023
      • (2023)Implementation Techniques for SPMD Kernels on CPUsProceedings of the 2023 International Workshop on OpenCL10.1145/3585341.3585342(1-12)Online publication date: 18-Apr-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media