An Evaluation of Directive-Based Parallelization on the GPU Using a Parboil Benchmark

Đukić, Jovan; Mišić, Marko

doi:10.3390/electronics12224555

Open AccessArticle

An Evaluation of Directive-Based Parallelization on the GPU Using a Parboil Benchmark

by

Jovan Đukić

^*

and

Marko Mišić

School of Electrical Engineering, University of Belgrade, Bulevar Kralja Aleksandra 73, 11000 Belgrade, Serbia

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(22), 4555; https://doi.org/10.3390/electronics12224555

Submission received: 14 September 2023 / Revised: 29 October 2023 / Accepted: 1 November 2023 / Published: 7 November 2023

(This article belongs to the Special Issue Emerging Technologies and Applications of High-Performance Computer Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Heterogeneous architectures consisting of both central processing units and graphics processing units are common in contemporary computer systems. For that reason, several programming models have been developed to exploit available parallelism, such as low-level CUDA and OpenCL, and directive-based OpenMP and OpenACC. In this paper we explore and evaluate the applicability of OpenACC, which is a directive-based programming model for GPUs. We focus both on the performance and programming effort needed to parallelize the existing sequential algorithms for GPU execution. The evaluation is based on the benchmark suite Parboil, which consists of 11 different mini-applications from different scientific domains, both compute- and memory-bound. The results show that mini-apps parallelized with OpenACC can achieve significant speedups over sequential implementations and in some cases, even outperform CUDA implementations. Furthermore, there is less of a programming effort compared to low-level models, such as CUDA and OpenCL, because a majority of the work is left to the compiler and overall, the code needs less restructuring.

Keywords:

GPU computing; directive-based parallelization; parallel algorithms; parallel programming; performance analysis

1. Introduction

Graphics processing units (GPUs) have been used for general purpose computation for almost two decades. Due to their specific, manycore architecture, they offer massive parallelism and computational power compared to multicore central processing units (CPUs). Contemporary GPUs significantly surpass CPUs in terms of available bandwidth and raw computational power, with a large portion of the transistors devoted to computing units rather than caches and control logic [1].

There are a number of commercial and scientific domains where they prove to be very efficient, yielding significant speedups and offering good scalability [2,3,4,5,6]. It is believed that heterogeneous architectures, combining CPUs and GPUs, will be prevalent in the future, both in high performance computing (HPC) and consumer electronics domains [7], with the GPU performance steadily increasing [8].

Low-level programming interfaces are mostly used for GPU programming. Notable examples include the vendor-specific Compute Unified Device Architecture (CUDA) used for NVIDIA GPUs, the ROCm compute platform for AMD GPUs, and Open Computing Language (OpenCL), which is an open standard for programming heterogeneous systems. The aforementioned programming models offer control over different aspects of code execution. Programmer is responsible both for program correctness and optimization, explicitly managing thread execution configuration, data movement, synchronization, etc. Although very good performance-wise, those programming models lack the productivity of high-level programming languages and consume a significant amount of time to write and optimize program code [9]. The other approach to parallelize sequential code for GPU execution is to use programming models with a directive-based approach. The two most notable examples are OpenMP (Open Multi-Processing) and OpenACC (Open Accelerators). OpenMP is a well-known, proven programming model for multicore CPUs. GPU support was added to the 4.0 specification. Programmer uses directives to annotate regions of code that should be automatically parallelized by the compiler. The main sources of parallelism in sequential codes are loops. To allow parallel execution, loop iterations should be independent. Based on ideas from the OpenMP directive-based programming model for multicore CPUs, the OpenACC programming model for GPUs was introduced in 2011. OpenMP introduced basic support for GPUs in version 4.0 in 2013, while more enhanced support was added later with versions 4.5 and 5.0.

In the OpenACC programming model, a set of compiler directives is used to annotate sequential code regions that should be offloaded for execution on the GPU. Basically, compiler directives are used to offload loops for GPU execution and manage the data movement between CPU and GPU. Although such an approach cannot always yield the performance of low-level programming models, it can obtain a reasonably good performance with much less programming effort. Also, performance portability remains high, as not all hardware details are exposed to the programmer. Many programmers find the directive-based approach useful, as it preserves the flexibility of running parallel code both on the CPUs and the GPUs, while also avoiding vendor-specific language lock-in [10].

In studies, such as [11,12], it is suggested that OpenACC offers a more descriptive programming model than OpenMP, which is described as prescriptive. In OpenACC, much work on parallelization is performed automatically by the compiler, while OpenMP requires the programmer to specify more precisely how to parallelize the code. OpenACC is considered a more mature programming model for the GPUs in several studies in terms of compiler and platform support [11,13]. The descriptive nature of the programming model and the maturity of the whole OpenACC ecosystem were the main reasons we chose to evaluate OpenACC in this work.

In this paper, we re-implemented applications from the Parboil GPU benchmark suite [14] using OpenACC to evaluate the effectiveness of the directive-based parallelization on the GPU. The Parboil benchmark consists of eleven different computing applications from various scientific domains, including image processing, graph theory, biomolecular simulation, fluid dynamics, and astronomy. It was carefully selected to represent diverse workloads in typical high-performance computing applications and exhibit different algorithmic approaches, memory access patterns, and scalability issues. To the best of our knowledge, no such complete OpenACC implementation of the Parboil benchmark is available in the open literature. Our main focus was on the effort needed to apply this programming model to the given applications and on the performance itself. Some sequential codes require certain transformations to produce correct parallel programs. In addition, to improve the results, several optimizations were performed in sequential code after the directives were applied. We primarily focused on comparing existing CUDA and OpenCL implementations with the new OpenACC implementations. The main contributions of this paper and additions to the field are summarized as follows.

We implemented applications from the Parboil benchmark suite using OpenACC and made them publicly available for further research.
We evaluated OpenACC implementations of the applications from the Parboil benchmark suite and compared them to their CUDA and OpenCL counterparts.
We discussed the programming effort needed to parallelize the application on the GPU using the directive-based programming model compared to low-level programming models.
We proposed several recommendations and optimizations that can be used to utilize the OpenACC programming model more efficiently.

Related work, our motivation, the choice of benchmark applications, and the results are discussed in the rest of the paper. The paper is organized as follows. In Section 2, we review papers and results relevant to the topic at hand. In Section 3, we briefly describe the OpenACC programming model and its applications in GPU computing. Also, we discuss benchmark suites for GPU computing in general. Section 4 describes the Parboil benchmark suite and its applications which we parallelized via OpenACC. Implementation details were given in Section 5. Results and discussion are presented in Section 6 and Section 7. The Section 8 briefly concludes the paper with directions for future work.

2. Related Work

The OpenACC programming model has been used to parallelize scientific applications in numerous fields. OpenACC has been extensively used to parallelize simulations in nuclear physics [10,15]. A hybrid MPI/OpenACC implementation of an iterative eigensolver for many-body calculations can be found in [10]. The implementation was derived from a previous OpenMP-based solution. The authors denote that those architectural differences between CPU and GPU changed the way OpenACC directives were inserted in the original code compared to the OpenMP implementation, especially for sparse matrix–vector multiplications with multiple vector operation. They report that the GPU implementation achieved significant speedup over the CPU implementation. Another example is the radiation transport mini-application presented in [15], based on the combined MPI (Message Passing Interface)/OpenACC implementation. It performs slightly better than the CUDA implementation of the same algorithm. The study [12] presented a novel methodology to implement scientific applications with MPI, OpenMP, and OpenACC They demonstrated the semi-automatic methodology on four real-life applications in the domain of physics and material science. The OpenACC implementation slightly outperformed CUDA Fortran in one of the examples used.

Benchmarking multicore CPU and manycore GPU architectures was the focus of numerous research efforts in the past. Most of the benchmark suites are based on the concept of mini-applications (mini-apps) [16]. Mini-apps are parts of much larger, real-life applications from various domains that consist only of core algorithms rather than complete solutions from their respected fields. Still, they possess much of the performance characteristics of the original applications, but they are easier to maintain and experiment with and also explore the parameter space defined by the choices of the hardware platform, programing model, runtime environment, compiler, etc.

One of the first benchmark suites for GPU computing was Rodinia [17]. Rodinia introduced mini-apps based on the dwarf application taxonomy from Berkley. In version 3.1, it consists of 23 different mini-apps implemented in CUDA, OpenCL, and OpenMP. Another notable example is the Parboil benchmark [14] which consists of 11 different applications in sequential and several parallelized forms, together with a well-developed testing framework in Python. The NAS Parallel benchmark is a standard benchmark suite for parallel computing based on MPI and OpenMP. The NAS benchmark consists of five kernels and three mini-applications. However, the GPU version of the suite, based on CUDA, became available only recently [18], and directive-based implementations for the GPU are not available.

Performance comparisons of the OpenACC and CUDA implementations of different applications have been studied in the open literature [19,20,21]. However, most of the studies focused on one type of application, such as computational fluid dynamics (CFD) solvers [19,21], molecular modeling code [20], and a combustion modelling simulation [22]. Four mini-apps from physics and materials science domains were ported to OpenACC and OpenMP and evaluated on several heterogeneous supercomputing systems in [11]. Those mini-apps utilized common methods found in the domain of high performance computing. The experiences from the study suggest that porting efforts are dependent on compiler maturity in implementing both sets of directives on a given platform.

The OpenMP, OpenACC, and Kokkos implementations of the complete Navier–Stokes flow solver mini-application were compared in [13]. Navier–Stokes flow solvers are frequently used for computational fluid dynamics problems. A performance portability study [23] evaluated OpenMP, OpenACC, Kokkos, and RAJA directive-based models with a newly developed metric based on the data reported in 324 case studies from the open literature. It reports a performance portability of over 80% cases. They also indicate no significant differences between architectures and compilers. Evaluation of the performance of three HPC-style mini-applications written in OpenCL and SYCL is given in [24]. SYCL is a newly introduced C++-based parallel programming model for implementing single source applications running on heterogeneous platforms. The results show that SYCL implementations achieve similar performance to a direct OpenCL implementation.

A performance comparison of the OpenMP, OpenACC, and CUDA programming models on NVIDIA TESLA V100 GPU is presented in [25]. The authors denote that OpenMP and OpenACC compilers were able to produce efficient parallel codes in simple cases, such as matrix multiplication. For more complex numerical simulations, OpenACC/OpenMP solutions are up to 80% slower in some cases compared to the CUDA solution. The General Plasmon Pole (GPP) kernel application is studied in [26] using OpenMP, OpenACC, and CUDA. While focusing predominantly on the OpenMP CPU and GPU implementation of the GPP code, it also compares both frameworks regarding register pressure and data movement. However, the CUDA implementation outperforms both OpenMP and OpenACC by a factor of two for the GPP application. The OpenACC and CUDA implementations of 19 kernels from 10 mini-applications chosen mainly from the Rodinia benchmark suite were evaluated in [27]. Various optimizations have been employed to improve performances. The results show that the PGI compiler produces a slightly less performant code compared to the hand-written CUDA code. On the other hand, they observed much faster data movements with an increased number of memory copy operations.

The SPEC ACCEL 1.2 benchmark is one of the latest efforts to evaluate performance of different accelerator platforms. It comprises of 13 applications implemented in OpenCL, OpenACC, and OpenMP 4.5. It was carefully examined on several different supercomputing platforms using closed-source (PGI) and open-source (GCC) compilers [28]. The study suggests that OpenACC compilers are functionally more mature than the OpenMP 4.5 counterparts. However, access to the SPEC ACCEL 1.2 benchmark is limited, as it is a relatively high-cost commercial product. The Cactus suite [29] presents a novel top-down approach to GPU benchmarking using 10 complex, real-life applications. The authors state that workloads presented in their benchmark suite exhibit more complex and diverse execution behavior while covering a wider set of computational problems than previous solutions.

Since its introduction in 2011, OpenACC was mostly backed up by the industry consortium led by Cray, CAPS, NVIDIA, and PGI. That affected the development of compiler infrastructure needed for the successful adoption of the standard, as well as portability on other architectures. Although there were several academic efforts, such as accULL [30], OpenUH [31], and Omni compiler infrastructure [32], PGI compilers has been the most mature and widely available compilers with an extensive OpenACC support for NVIDIA GPUs. The alternatives include gcc compiler, which supports OpenACC from version 10 and Clacc based on Clang and LLVM [33]. However, those OpenACC compiler implementations are not stable, bug-free, and well documented, as reported in several studies such as [28,34,35], which indicate the highest conformance of the PGI compiler with the OpenACC standard.

3. OpenACC Programming Model

OpenACC is an open standard for directive-based parallel programming on heterogeneous systems. It is used for offloading regions of code for execution on accelerator devices, such as GPUs, but also CPUs and FPGA boards. It is implemented on all major parallel platforms, including CPUs and GPUs of different vendors, operating systems, and compiler infrastructure. Supported programming languages include C, C++, and Fortran, which allows easy porting of existing applications via the annotation of source codes implemented in those languages [36]. Similar to OpenMP directives that use sentinel #pragma omp, OpenACC directives for C/C++ use sentinel #pragma acc to annotate the code. The current version of the specification is 3.3.

OpenACC offers an implicit, descriptive-based programming model [10]. The programmer annotates code segments that should be executed in parallel on the accelerator device. The programmer is responsible for defining the execution context which include parallel loops and dependencies, as well as the data involved and their movement. Compiler and runtime environment use that information during the compilation and execution time, respectively, to allow for the efficient offloading of the code onto the accelerator. In that sense, parallelization can be seen as assisted, rather than fully automated.

Generally, the OpenACC execution model shares similarities with the NVIDIA CUDA low-level programming model. The CPU, which serves as a host, configures and launches computed kernels using a large number of parallel threads. To support different GPU architectures, three levels of parallelism are available in the OpenACC execution model: gang, worker, and vector [37]. Those are mapped to corresponding hardware units depending on the hardware used. This is usually performed by the compiler, but the programmer has some degree of control over the whole process. The execution model assumes that the GPU contains multiple processing units running in parallel that can efficiently perform vector-like (SIMD) operations. In the case of NVIDIA GPUs, the processing unit is a streaming multiprocessor. The gang is mapped to a thread block, the worker represents a warp of threads, and the vector corresponds to a CUDA thread.

Two slightly different directives are used to create an OpenACC parallel region. The first one is a parallel construct, which is used to annotate work-sharing loops. The other one is a kernel directive, which allows the assembling of kernels that contain more than one loop [38]. The OpenACC parallel directive is more similar to the same OpenMP construct. It allows more explicit, user-defined parallelism, offering good fine-tuning prospects and control over the parallelization process. As a result, only one kernel will be generated and executed at the launch time with the predetermined number of gangs and workers in the gang.

The kernels construct offers more flexibility, as it offers the compiler more space for optimizations. It defines a code region that can be divided into a series of kernels. Typically, execution configuration for each loop transferred into the kernel can differ, as they do not have to use the same number of workers and gangs. Clauses are used to control the level of parallelism in parallel and kernel directives. However, those clauses should be used with caution to avoid possible performance-related issues on different platforms. Like OpenMP for the directive, the loop construct is available to specify which loops are to be parallelized. It is used for work-sharing purposes.

Similar to CUDA and OpenCL programming languages, OpenACC does not offer direct synchronization or data sharing among gangs. The main reason for such assumptions is to support the scalable execution of code on architectures with different numbers of processing units. It means that only workers within the same gang are able to share the data, affecting the way parallelism in the code has to be mapped in the kernels [39].

Available memory spaces are usually physically separated in heterogeneous CPU/GPU architectures. Data transfers are needed to feed the data to the device and collect the results. OpenACC specification offers support for memory management and memory transfers. There are directives, such as copyin, copyout, update, and similar, to control the data movement and define data regions. To minimize the frequency of data transfers between the CPU and the GPU, OpenACC offers programmers the data directive. This directive allows programmers to create data regions in which memory transfers are executed before parallel regions, thus reducing the memory footprint in cases where multiple parallel regions are executed for the same data.

Multiple memory spaces are the source of performance-related problems in heterogeneous systems. Generally, the capacity of the GPU memory is smaller than the CPU memory. For that reason, large datasets are usually split into chunks which are constantly transferred to the device. In such cases, memory bandwidth limits the performance of the code executed on the device. For that reason, it is important to reuse data on the device, as much as possible, via the hardware- and software-managed caches on the device [40]. OpenACC offers the cache directive for such purposes. It is shown that proper usage of the cache directive can lead to the performance close to the hand-written CUDA code [40].

Further performance optimizations can be achieved with the directives and clauses added in the OpenACC 2.0 specification. The atomic directive adds support for atomic updates on global variables, similar to the feature present in CUDA via intrinsic functions. The tile clause of the loop directive can be used to control the locality of the loop nests. Several other features are available to the programmer to control the data movements, asynchronous execution, and similar. OpenACC specification has been constantly improving. The current specification 3.2 includes a number of new features [35], such as enhanced support for C++ lambda functions, array reduction operators, new constructs, and similar. New features have also been proposed by the research community, such as the concept of static graphs [41].

4. Parboil Benchmark Suite

The Parboil benchmark suite consists of eleven throughput-oriented mini-applications useful for studying GPUs and similar computing architectures [14]. Benchmarks are selected from several key scientific and commercial fields in high-performance computing, such as fluid dynamics, molecular simulation, image processing and medical imaging, linear algebra, graph processing, etc. Table 1 provides the overview of benchmark suite mini-apps in the Parboil suite with short descriptions based on [14].

Parboil provides versions of the code with varying levels of optimizations. The baseline, OpenMP, OpenCL, and CUDA implementations are available for all mini-apps, while for some of them, even more optimized versions for particular architectures are given. This benchmark architecture enables the programmers and compiler developers to evaluate source and compiler optimizations on different architectures. Those are followed by a comprehensive testing framework written in Python, which eases the evaluation of parallelized code and their comparison both for correctness and performance. Each mini-app is accompanied by datasets used for testing which consists of inputs and expected (golden) outputs. Our choice of the Parboil benchmark was mostly based on the fact that it represents a comprehensive framework for developing, testing, and comparing the results of parallel applications using different programming models.

Breadth-First Search (BFS) is a commonly used graph traversal algorithm used in many graph-related problems. It is interesting because of the irregular data access patterns. CUTCP computes a short-range component of the electrostatic potential field produced by charged atoms distributed throughout a volume [42]. The histogram application represents a classic operation in which the number of occurrences of output values are calculated based on the input data set. LBM is a widely used method of solving the systems of partial differential equations [6]. MRI-Gridding implements one of the preprocessing steps in magnetic resonance imaging. Each input sample is interpolated onto the grid points using Kaiser–Bessel function. MRI-Q represents one of the phases in MRI image reconstruction. The Sum of Absolute Differences (SAD) is a method used in a full-search motion estimation algorithm in the H.264/AVC encoder.

SGEMM is a frequently used operation in numerical linear algebra codes and a very-well studied example in the HPC domain. It is part of the BLAS (Basic Linear Algebra Subprograms) library [43] and its numerous descendants. Sparse matrix-vector multiplication (SpMV) is used in many iterative solvers. It shows off both regular and irregular data access patterns. The Stencil mini-app represents a partial differential equations solver based on an iterative Jacobi solver on a 3D-structured grid. The Two-Point Angular Correlation Function (TPACF) is a measure of the distribution of massive bodies in space. Essentially, it represents a version of the histogram application, as it calculates the histogram of angular distances between all pairs of given bodies.

5. Implementation Details

This section describes the implementation details of each benchmark using the OpenACC programming model. The focus is on specific differences between the OpenACC implementations and the ones provided in the Parboil benchmark. More detailed explanations of each benchmark can be found in the original Parboil technical report [14].

One recurring problem that we faced during the implementation phase was the inability to parallelize some of the applications using the atomic directives provided by the OpenACC framework. However, working on a CUDA platform allowed us to use CUDA style atomics within OpenACC parallel regions, which therefore solved our problem. The only requirement is that these functions are declared using the #pragma acc routine seq directive before their usage. If this declaration is omitted, it is assumed that these are user-defined functions and a compilation of the application is unsuccessful. In addition, some applications leveraged CUDA memory operations, like cudaMemset, to obtain better performance.

5.1. Breadth-First Search (BFS)

The OpenACC implementation of BFS is based on the CUDA implementation presented in [44]. In each iteration threads process, the queue representing the current "frontier" of nodes form a new queue, representing a “frontier” of nodes for the next iteration. For this application to work correctly, a comparison of the current and the calculated cost of visitation, as well as the update of the node color, need to be atomic. These are far too complex for the OpenACC atomic directives and require CUDA-style atomics for their implementation. In addition, as with the CUDA version, all necessary buffers are transferred to the GPU only once before the parallel region with the usage of the data directive to minimize the memory footprint. The code of the OpenACC implementation is given in Listing 1. The variables input_queue and output_queue represent the current and the next “frontier” of nodes for visiting in the current and the next iteration.

Listing 1. OpenACC implementation of the BFS application.

5.2. Cutoff-Limited Coulombic Potential (CUTCP)

Cutoff-limited Coulombic Potential has three different implementations which use the OpenACC programming model. The first, and the simplest one (labeled as OPENACC_BASE), uses the provided baseline implementation, which uses geometric hashing for each of the cells. Parallelization is achieved by adding the necessary OpenACC directives above the main loop, which iterates over all grid cells and atoms in those grid cells.

The remaining two implementations are based on the idea presented in [45]. The atoms are first distributed in bins based on their position in the grid. Each bin represents a chunk of space and can hold a limited number of atoms. The overflow atoms that did not fit into these bins are processed using the provided algorithm on the CPU, whereas the atoms in the bins are processed on GPU concurrently. This concurrent execution is achieved using the async and wait directives. The first of the mentioned solutions (labeled as OPENACC_ATOM) iterates over the bins and updates the grid cells affected by the atoms in those bins. The second (labeled as OPENACC_LATTICE) iterates over the grid cells and reads the atoms from the bins, which are within the cutoff radius.

5.3. Histogram (HISTO)

We provided two histogram implementations which use the OpenACC programming model. In both cases, the OpenACC data directive is used to reduce the number of data transfers between the CPU and GPU. There are a total of two memory transfers, one from CPU to GPU (image) and one from GPU to CPU (histogram). The first implementation (labeled as OPENACC_BASE) is based on the provided CPU implementation, whereas the second (labeled as OPENACC_BINS) implementation is based on the idea presented in [46]. In both implementations, the cudaMemset function is used to initialize the histogram to increase the performance. The implementation of OPENACC_BASE can be seen in Listing 2.

Listing 2. OPENACC_BASE implementation of the HISTO application.

5.4. Lattice Boltzman Method Simulation in Fluid Dynamics (LBM)

LBM is a memory-bound application. Therefore, the main objective in the case of LBM is to reduce the number of global memory accesses on the GPU and the number of data transfers between the CPU and the GPU. A reduced number of data transfers is achieved with the data directive. To reduce the number of global memory accesses on the GPU, we modified the original algorithm to use the same memory access pattern as the provided CUDA implementation. This pattern consists of reading all necessary values once per iteration and their storage in temporary variables. These variables are later used for all necessary calculations. However, this pattern required modification of the three main loops.

5.5. MRI Cartesian Gridding (Gridding)

This application has only one implementation which uses the OpenACC programming model. It optimizes the original implementation by applying directives over the loop which iterates over the samples. Each thread is given one sample to process.

5.6. MRI Non-Cartesian Q Matrix Calculation (MRI-Q)

This application has two OpenACC implementations. The main goal of both is to avoid the usage of atomic operations. In the first implementation (labeled as OPENACC_INTERCHANGE), this is achieved by interchanging the two main loops. However, only the outer loop can be parallelized because parallelizing both would require the use of atomic directives. The second implementation (labeled as OPENACC_REDUCTION) goes a step further by allocating additional memory. This enables the usage of the collapse directive on the two main loops, which results in a higher level of parallelism because the result of each iteration is stored in a different memory location. However, this approach requires an additional step. The result has to be calculated with the OpenACC reduction directive. The code for the OPENACC_INTERCHANGE implementation is given in Listing 3.

Listing 3. OPENACC_INTERCHANGE implementation of the MRI-Q application.

5.7. Sum of Absolute Differences (SAD)

SAD has only one implementation which uses the OpenACC programming model. OpenACC directives are used to parallelize the baseline implementation. The only change to the original code is the elimination of a function calculating the result of a single block. This was performed to achieve more consecutive loops that can be collapsed using the OpenACC collapse directive.

5.8. SGEMM

The SGEMM application has two implementations. The first implementation (labeled as OPENACC_TILE) uses the tile directive which enables the tilling of the two outermost loops. The final value of a cell is calculated using the reduction directive. The second implementation (labeled as OPENACC_CACHE) is based on the idea given by Volkov and Demmel and presented in [47]. A part of one matrix is cached using the cache directive and then it is used to calculate the values of the cells relevant to it. In addition, two main loops are collapsed to achieve a greater level of parallelism. The code for the OPENACC_TILE implementation is given in Listing 4.

Listing 4. OPENACC_TILE implementation of the SGEMM application.

5.9. Sparse Matrix-Dense Vector Multiplication (SpMV)

The main hurdle of this application is its irregular memory access pattern. Therefore, to achieve peak performance, the OpenACC implementation of this application includes loop unrolling like the one present in the CUDA implementation. Moreover, the usage of the data directive reduced the number of memory transfers between the CPU and the GPU.

5.10. Stencil

The Stencil application has been implemented using the collapse directive to merge the three main loops to achieve the highest level of parallelism. Furthermore, the data directive is used to reduce the number of data transfers between the CPU and the GPU.

5.11. Two-Point Angular Correlation Function (TPACF)

The OpenACC implementation of the TPACF application reads all files with data points and transfers their content to the GPU before the main part of the application. In addition, the three main loops have been merged using the collapse directive to achieve a higher level of parallelization.

6. Results

For our experiments, we used the eight-core 11th Gen Intel(R) Core(TM) i7-11700K with 128 GB RAM. This CPU has a unified 16 MB L3 cache shared by all sixteen contexts executing on the eight cores. Each core has a private 4 MB L2 cache. Finally, private 256 KB instruction L1i and 384 KB data L1d caches connect to each core’s load/store units. A single NVIDIA GeForce RTX 3080 Ti GPU with 12 GB RAM was used to run GPU applications. The codes were offloaded to the GPU using the OpenACC programming model. The reference OpenMP, OpenCL, and CUDA implementations from the original Parboil benchmark suite were used for comparison purposes. All experiments were carried out on the datasets originally provided in the Parboil benchmark suite.

The performance of the both sequential and parallel programs is usually measured in terms of execution time. Usually, the wall-clock time of the observed code part is measured. However, to allow for an easier comparison of results on different platforms, the performance is usually presented in the form of the speedup

(S_{p})

. The speedup

(S_{p})

is defined as a ratio of the time required by the sequential algorithm to solve a problem,

(T_{s})

to the time required by the parallel algorithm using p processors to solve the same problem,

(T_{p})

as in the following formula:

S p e e d u p = \frac{T_{s}}{T_{p}}

Each of the following charts depict the speedup of various programming models relative to the baseline CPU implementation from the Parboil benchmark suite. The baseline implementation is compiled using GCC 11.3.0, whereas all other programming implementations are compiled using NVIDIA HPC SDK 22.1, which employs a high-performance PGCC compiler. In addition, all implementations, except the baseline implementation, are compiled using the most aggressive optimizations. The execution time is intentionally omitted because it is dependent on the platform on which the applications were executed.

6.1. Breadth First Search (BFS)

The BFS application is tested with four different datasets. As can be seen from Figure 1, none of the programming models (OpenMP, CUDA, OpenCL, and OpenACC) provide a faster implementation than the baseline implementation. This is a direct consequence of the BFS algorithm. Among these, the CUDA and OpenCL implementations, explained in the [44], give the best result. The reason for this is the smaller number of GPU kernel invocations in both the CUDA and OpenCL implementations compared to the OPENACC implementation. However, the fastest implementation is the sequential one compiled with PGCC compiler, which is, again, a consequence of the nature of the BFS algorithm, which has an irregular memory access pattern and a high load imbalance.

It should be noted that efficient BFS implementations in CUDA require significant code and data structure restructuring. For example, dynamic parallelism and specific graph layouts to reduce overheads were used in [48] which are not available in directive-based programming models.

6.2. Cutoff-Limited Coulombic Potential (CUTCP)

The CUTCP application is tested with two different datasets. OPENACC_BASE represents the basic version of the algorithm explained in Section 5, whereas OPENACC_ATOM and OPENACC_LATTICE represent the implementations in which the atoms are distributed in the bins. The former iterates over the atom bins, whereas the latter iterates over all grid cells.

The best results were achieved with OPENACC_BASE implementation. The OPENACC_ATOM implementation with bins that cover 4 units of space and store up to 16 atoms also gave significantly good results; however, the speedup is smaller than the one achieved with the OPENACC_BASE implementation, as shown in Figure 2. Such results can be explained by the density of the atoms. If the atoms are densely packed, not all of them will be placed in the atom bins during the geometric hashing. Therefore, there will be a lot of overflow atoms that need to be processed on the CPU and the speedup will be less than ideal.

6.3. Histogramming Operation (HISTO)

The original CUDA, OpenCL, and OpenMP implementations gave exceedingly bad results. Therefore, we replaced them with newer and better implementations. The CUDA and OpenCL implementation optimizes the basic algorithm using atomic functions for the bin updates. The OpenMP implementation uses the idea presented in [46]. The same idea is used for the OPENACC_BINS implementation, whereas the OPENACC_BASE implementation optimizes the basic algorithm.

The HISTO application is tested using two different datasets. From Figure 3, we can see that the greatest increase in performance is achieved using the OpenCL programming model, even though it represents the most basic optimization of the original algorithm. It is interesting to note that the OpenACC programming model achieves better performance when using the original algorithm rather than the one present in [46]. We concluded that the reason for this is a smaller memory footprint. Lastly, we noticed that the OpenMP implementation gave very bad results. The execution time for the large dataset is much higher than the baseline execution time, which is why it is barely visible in Figure 3. We concluded that the reason for this is a significantly higher number of atomic updates, which reduces the overall throughput.

6.4. Lattice Boltzman Method Simulation in Fluid Dynamics (LBM)

The LBM application is tested using two different data. As can be seen from Figure 4, the best performances are achieved using the OpenCL programming model. However, the OpenACC programming model gives exceedingly good results with a negligible performance decrease caused by an overhead provided by the programming model itself.

It should be noted that high speedups of the GPU-based implementation of the LBM application have been observed in the open literature. For example, four different OpenCL implementations of LBM application have been tested on eight different commodity configurations including CPUs and GPUs from major vendors in [6]. Depending on the implementation and configuration, they observed speedups in the range from 80 to 900 times over the sequential implementation.

6.5. MRI Cartesian Gridding (Gridding)

The provided CUDA implementation performed even worse than the OpenMP implementation. The reason behind this performance drop is the fact that the provided CUDA version was optimized for an outdated GPU architecture (Fermi). Therefore, a new implementation is provided and it optimizes the provided sequential code. As can be seen from Figure 5, OpenACC and CUDA provide similar speedup, with CUDA having an insignificant advantage because of the overhead generated by the OpenACC programming model.

6.6. MRI Non-Cartesian Q Matrix Calculation (MRI-Q)

In Figure 6, the last two bars labeled with OPENACC_INTERCHANGE and OPENACC_REDUCTION represent the speedup of the MRI-Q implementations described in Section 5. OPENACC_REDUCTION does not give very good results because of the additional memory usage and subsequent transfers. It can also be seen that the OpenCL and OPENACC_INTERCHANGE implementations give the best results. However, the slight difference in performance of the OpenCL implementation compared to OPENACC_INTERCHANGE is due to the overhead provided by the OpenACC programming model. It should be noted that the OpenCL implementation significantly outperforms the CUDA implementation due to loop unrolling optimization applied to the code.

6.7. Sum of Absolute Differences (SAD)

The SAD application is tested using two different datasets. In the case of the default dataset, OpenMP gives better results because it does not include data transfers from CPU to GPU, as presented in Figure 7. In the case of a large dataset, the CUDA implementation gives the best results; however, the OpenACC programming model is not far behind. As in the previous case, the overhead provided by the programming model accounts for the drop in performance.

6.8. SGEMM

The application is tested using two different datasets. Bars denoted with OPENACC_TILE in Figure 8 represent the speedup of the implementation with the tile and reduction directives. Bars denoted with OPENACC_CACHE in Figure 8 represent the implementation with the cache directive. In the case of the small dataset, OpenMP gives the best results as it does not include data transfers from the CPU to GPU. In the case of the large dataset, the OPENACC_TILE version gives the best results.

However, the results from the open literature, such as [25], show that hand-optimized CUDA versions of the SGEMM application can achieve almost 2× higher throughputs, especially for larger matrix dimensions. The authors also denote weaker scaling of both the OpenMP and OpenACC implementations with an increase in the input matrix size, which indicates that the compiler is not able to generate efficient code without a particular knowledge of specific GPU architecture.

6.9. Sparse Matrix-Dense Vector Multiplication (SpMV)

The SpMV application is tested using three different datasets. The results are shown in Figure 9. Loop unrolling gives very good results; however, the application still contains random memory accesses due to the usage of the JDS (Jagged Diagonal Storage) sparse matrix format. CPU has a better handling of this kind of memory access pattern and this is the reason why the OpenMP implementation gives better results in the case of the small and medium dataset. In the case of the large dataset, throughput plays a much bigger role and, for this reason, all three implementations have similar results. CUDA and OpenCL give slightly better results because of the overhead provided by OpenMP and OpenACC.

6.10. Stencil

The Stencil application is tested using two different datasets. The results are shown in Figure 10. In the case of the small dataset, the OpenMP implementation gives slightly better results than the CUDA, OpenCL, or OpenACC implementations because of the memory transfers to and from the GPU. However, in the case of the default dataset CUDA, OpenCL and OpenACC significantly outperform OpenMP because of the size of the dataset, with the OpenACC implementation giving slightly better results. It should be noted that the CUDA and OpenCL are optimized to have a smaller number of memory accesses. However, the number of cores available in modern GPU hardware is large enough to achieve similar results without significant code modifications. Therefore, even though the OpenACC implementation is not optimized for better memory access, it gives similar results because of the large number of threads executed in parallel.

6.11. Two-Point Angular Correlation Function (TPACF)

This application is tested using three different datasets. As can be seen from Figure 11, the CUDA and OpenCL implementations have the greatest speedup. However, it should be noted that the real difference in execution time of the CUDA, OpenCL, and OpenACC implementations is not as big as depicted in the previous figure. The speedup is calculated relative to the baseline execution time, and in the case of TPACF, this time it is significantly greater than the execution time of the CUDA, OpenCL, and OpenACC programming models. Therefore, a small difference in execution time of these programming models gives a big difference in speedup.

7. Discussion

After an overall evaluation of the OpenACC programming model, we can conclude that it has a lot of merits. Low-level programming models like CUDA and OpenCL, which give access to a lot of hardware-specific mechanisms, do indeed achieve the best performance in most cases. The difference between implementations using these programming models are insignificant except in the cases where applications rely heavily on optimizations.

However, these programming models require a lot of knowledge about the underlying hardware and its features. In addition, the implementation itself can be very long and challenging, as in most cases it requires significant code restructuring. On the other hand, OpenACC hides all hardware specifics behind its directives and allows users to parallelize their code with minimal effort. However, there are a few things that need to be considered before using a programming model like OpenACC.

Firstly, like CUDA and OpenCL, OpenACC does not contribute much to applications with an irregular control flow (a lot of conditional jumps). As can be seen from the BFS application, the solution in these cases is to use a proficient compiler which better utilizes the underlying CPU features. Furthermore, these programming models also have a small impact on applications with irregular memory access patterns [44] or strided memory accesses [21]. A good example in this case is the SpMV application. Although, unlike the problem with irregular control flow, a large amount of data can hide this caveat.

Secondly, some of the applications require a certain degree of code modifications. Compute-bound applications, like CUTCP and MRI-Q, only required a simple loop interchanging. In the case of memory-bound applications, the modifications are more complicated. For applications like LBM, the code needed to be modified to achieve a different memory access pattern in order to improve memory coalescing similar to [27]. Applications like SAD and TPACF required a degree of code inlining. The aim of this code inlining was grouping more loops in order to use the collapse directive and achieve a higher level of parallelism. A similar approach is used for sparse matrix multiplication in [10]. Finally, applications like SpMV with an irregular memory access pattern required loop unrolling.

Thirdly, some of the applications cannot be implemented without using low-level CUDA features or they benefit from them. For example, BFS cannot be correctly implemented using the atomic directives provided by the OpenACC programming model. To solve this problem, CUDA-style atomics were used as we executed our code on an NVIDIA GPU. Furthermore, applications like HISTO benefit from the CUDA utility functions. Rather than entering a parallel region for memory initialization, we used the cudaMemset function, which is faster. However, this choice of action can have an impact on portability since the availability of atomic operations is dependent on the platform itself. Low-level programming models share similar constructs under different names; thus, portability issues can be easily fixed in most cases. Morever, nowadays lightweight portability layers exist for major architectures, such as AMD HIP, which allows us to use low-level programming functions across different platforms, e.g., NVIDIA and AMD GPUs.

Finally, it should be noted that typical parallelization of an application via OpenACC requires significantly less programming time and code restructuring, as also noted in other studies in the field [12]. Also, it allows us to maintain a similar codebase with the sequential application, which is beneficial in the case of further application development. Moreover, as stated in [10], avoiding model-specific and hardware-specific features enables portability between different platforms.

Performance-wise, some of the applications, such as CUTCP, SGEMM, MRI-Q, and Stencil, show slightly better performances in the case of OpenACC implementations in this study. As pointed out in [12], such results can be attributed to limited effort put into optimizing native CUDA implementation, which in turn requires massive efforts from experienced users.

Previous studies [25] also indicated that the performance gap between CUDA and OpenCL on one side and OpenMP and OpenACC on the other side can increase with the complexity of the code and the size of the inputs. It is in accordance with the findings in this study that shows potential for further compiler optimizations. The maturity of different compilers in implementing directives and optimizations is also highlighted, as is the problems and topics for future research in [11], especially for more complex, real-life applications.

8. Conclusions

In this paper, we evaluated the OpenACC programming model on a Parboil benchmark suite. Our aim was to evaluate OpenACC on performance and applicability. All applications within this benchmark suite were implemented using this programming model and afterwards tested using the provided datasets. Our conclusion is that OpenACC provides similar—in some cases, even better—performance than traditional programming models like CUDA and OpenCL. For that reason, it can serve as a good starting point for the parallelization of scientific applications on the GPU. In addition, it does not demand deep knowledge of underlying hardware features since they are hidden by the OpenACC directives.

Further work includes expanding the benchmark suite with new applications like the ones included in the Rodinia suite. The results from these applications can provide further insight and can help us better understand the merits and limitations of the OpenACC programming model. The other interesting direction for research in that sense is a comprehensive comparison of high-level programming models, such as OpenACC, OpenMP, Kokkos, RAJA, and SYCL, using different compilers and architectures. The emphasis should be both on the applicability for different types of problems, but also on compiler infrastructure and architectural support. In addition, the OpenACC programming model itself can be implemented and tested on other platforms and accelerators, such as AMD graphic processors and FPGA boards, in order to provide more portability and overcome some hardware specific issues.

Author Contributions

Conceptualization, M.M.; methodology, M.M.; software, J.Đ.; validation, J.Đ. and M.M.; formal analysis, J.Đ.; investigation, J.Đ. and M.M.; resources, M.M.; data curation, J.Đ.; writing—original draft preparation, J.Đ. and M.M.; writing—review and editing, J.Đ. and M.M.; visualization, J.Đ.; supervision, M.M.; project administration, M.M.; funding acquisition, M.M. All authors have read and approved the final manuscript.

Funding

This work was partially financially supported by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia under contract number: 451-03-47/2023-01/200103.

Data Availability Statement

All applications and datasets within the Parboil benchmark suite are available in the following link: http://impact.crhc.illinois.edu/parboil/parboil.aspx accessed on 29 October 2023. All OpenACC implementations are available in the following link: https://github.com/jovan-djukic/parboil accessed on 29 October 2023.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BFS	Breadth-First Search
BLAS	Basic Linear Algebra Subprograms
CFD	Computational Fluid Dynamics
CPU	Central Processing Unit
CUDA	Compute Unified Device Architecture
CUTCP	Cutoff-limited Coulombic Potential
FPGA	Field Programmable Gate Array
GCC	GNU Compiler Collection
GPP	General Plasmon Pole
GPU	Graphics Processing Unit
HISTO	Histogramming
HPC	High Perfomance Computing
JDS	Jagged Diagonal Storage
LBM	Lattice-Boltzman Method
MPI	MEssage Passing Interface
MRI-Gridding	MRI Cartesian Gridding
MRI-Q	MRI non-cartesian Q matrix calculation
OpenACC	Open Accelerators
OpenCL	Open Computing Language
OpenMP	Open Multi-Processing
PGI	Portland Group, Inc.
RAM	Random Access Memory
SAD	Sum of Absolute Differences
SGEMM	Dense single precision general matrix multiplication operation
SPMV	Sparse Matrix-Dense Vector Multiplication
TPACF	Two-Point Angular Correlation Function

References

Mišić, M.J.; Đurđević, Đ.M.; Tomašević, M.V. Evolution and trends in GPU computing. In Proceedings of the 2012 35th International Convention MIPRO, Opatija, Croatia, 21–25 May 2012; pp. 289–294. [Google Scholar]
Navarro, C.A.; Hitschfeld-Kahler, N.; Mateu, L. A survey on parallel computing and its applications in data-parallel problems using GPU architectures. Commun. Comput. Phys. 2014, 15, 285–329. [Google Scholar] [CrossRef]
Wang, H.; Peng, H.; Chang, Y.; Liang, D. A survey of GPU-based acceleration techniques in MRI reconstructions. Quant. Imaging Med. Surg. 2018, 8, 196. [Google Scholar] [CrossRef] [PubMed]
Tran, H.N.; Cambria, E. A survey of graph processing on graphics processing units. J. Supercomput. 2018, 74, 2086–2115. [Google Scholar] [CrossRef]
Świrydowicz, K.; Darve, E.; Jones, W.; Maack, J.; Regev, S.; Saunders, M.A.; Thomas, S.J.; Peleš, S. Linear solvers for power grid optimization problems: A review of GPU-accelerated linear solvers. Parallel Comput. 2022, 111, 102870. [Google Scholar] [CrossRef]
Tekic, J.; Tekic, P.; Rackovic, M. Performance Comparison of Different OpenCL Implementations of LBM Simulation on Commodity Computer Hardware. Adv. Electr. Comput. Eng. 2022, 22, 69–76. [Google Scholar] [CrossRef]
Mittal, S.; Vetter, J.S. A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Comput. Surv. 2015, 47. [Google Scholar] [CrossRef]
Sun, Y.; Agostini, N.B.; Dong, S.; Kaeli, D. Summarizing CPU and GPU design trends with product data. arXiv 2019, arXiv:1911.11313. [Google Scholar]
Yu, X.; Wang, H.; Feng, W.C.; Gong, H.; Cao, G. cuart: Fine-grained algebraic reconstruction technique for computed tomography images on gpus. In Proceedings of the 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Cartagena, Colombia, 16–19 May 2016; pp. 165–168. [Google Scholar] [CrossRef]
Maris, P.; Yang, C.; Oryspayev, D.; Cook, B. Accelerating an iterative eigensolver for nuclear structure configuration interaction calculations on GPUs using OpenACC. J. Comput. Sci. 2022, 59, 101554. [Google Scholar] [CrossRef]
Vergara Larrea, V.G.; Budiardja, R.D.; Gayatri, R.; Daley, C.; Hernandez, O.; Joubert, W. Experiences in porting mini-applications to OpenACC and OpenMP on heterogeneous systems. Concurr. Comput. Pract. Exp. 2020, 32, e5780. [Google Scholar] [CrossRef]
Aldinucci, M.; Cesare, V.; Colonnelli, I.; Martinelli, A.R.; Mittone, G.; Cantalupo, B.; Cavazzoni, C.; Drocco, M. Practical parallelization of scientific applications with OpenMP, OpenACC and MPI. J. Parallel Distrib. Comput. 2021, 157, 13–29. [Google Scholar] [CrossRef]
Eichstädt, J.; Vymazal, M.; Moxey, D.; Peiró, J. A comparison of the shared-memory parallel programming models OpenMP, OpenACC and Kokkos in the context of implicit solvers for high-order FEM. Comput. Phys. Commun. 2020, 255, 107245. [Google Scholar] [CrossRef]
Stratton, J.A.; Rodrigues, C.; Sung, I.J.; Obeid, N.; Chang, L.W.; Anssari, N.; Liu, G.D.; Hwu, W.M.W. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Cent. Reliab. High Perform. Comput. 2012, 127, 27. [Google Scholar]
Searles, R.; Chandrasekaran, S.; Joubert, W.; Hernandez, O. MPI+ OpenACC: Accelerating radiation transport mini-application, minisweep, on heterogeneous systems. Comput. Phys. Commun. 2019, 236, 176–187. [Google Scholar] [CrossRef]
Crozier, P.S.; Thornquist, H.K.; Numrich, R.W.; Williams, A.B.; Edwards, H.C.; Keiter, E.R.; Rajan, M.; Willenbring, J.M.; Doerfler, D.W.; Heroux, M.A. Improving Performance via Mini-Applications; Technical report; Sandia National Laboratories (SNL): Albuquerque, NM, USA; Livermore, CA, USA, 2009. [Google Scholar]
Che, S.; Boyer, M.; Meng, J.; Tarjan, D.; Sheaffer, J.W.; Lee, S.H.; Skadron, K. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09), Austin, TX, USA, 4–6 October 2009; pp. 44–54. [Google Scholar] [CrossRef]
Araujo, G.; Griebler, D.; Rockenbach, D.A.; Danelutto, M.; Fernandes, L.G. NAS Parallel Benchmarks with CUDA and beyond. Softw. Pract. Exp. 2023, 53, 53–80. [Google Scholar] [CrossRef]
Hoshino, T.; Maruyama, N.; Matsuoka, S.; Takaki, R. CUDA vs OpenACC: Performance case studies with kernel benchmarks and a memory-bound CFD application. In Proceedings of the 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, Delft, The Netherlands, 13–16 May 2013; pp. 136–143. [Google Scholar] [CrossRef]
Krommydas, K.; Scogland, T.R.; Feng, W.C. On the programmability and performance of heterogeneous platforms. In Proceedings of the 2013 International Conference on Parallel and Distributed Systems, Seoul, Republic of Korea, 15–18 December 2013; pp. 224–231. [Google Scholar] [CrossRef]
Vincent, J.; Gong, J.; Karp, M.; Peplinski, A.; Jansson, N.; Podobas, A.; Jocksch, A.; Yao, J.; Hussain, F.; Markidis, S.; et al. Strong scaling of OpenACC enabled Nek5000 on several GPU based HPC systems. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, Kobe, Japan, 12–14 January 2022; pp. 94–102. [Google Scholar] [CrossRef]
Levesque, J.M.; Sankaran, R.; Grout, R. Hybridizing S3D into an exascale application using OpenACC: An approach for moving to multi-petaflops and beyond. In Proceedings of the SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Washington, DC, USA, 10–16 November 2012; pp. 1–11. [Google Scholar] [CrossRef]
Marowka, A. On the performance portability of OpenACC, OpenMP, Kokkos and RAJA. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, Kobe, Japan, 12–14 January 2022; pp. 103–114. [Google Scholar] [CrossRef]
Deakin, T.; McIntosh-Smith, S. Evaluating the performance of HPC-style SYCL applications. In Proceedings of the International Workshop on OpenCL, Bristol, UK, 10–12 May 2020; pp. 1–11. [Google Scholar]
Khalilov, M.; Timoveev, A. Performance analysis of CUDA, OpenACC and OpenMP programming models on TESLA V100 GPU. J. Phys. Conf. Ser. 2021, 1740, 012056. [Google Scholar] [CrossRef]
Gayatri, R.; Yang, C.; Kurth, T.; Deslippe, J. A case study for performance portability using OpenMP 4.5. In Proceedings of the Accelerator Programming Using Directives: 5th International Workshop, WACCPD 2018, Dallas, TX, USA, 11–17 November 2018; Springer: Berlin/Heidelberg, Germany, 2019; pp. 75–95. [Google Scholar] [CrossRef]
Li, X.; Shih, P.C. Performance comparison of cuda and openacc based on optimizations. In Proceedings of the 2018 2nd High Performance Computing and Cluster Technologies Conference, Beijing, China, 22–24 June 2018; pp. 53–57. [Google Scholar] [CrossRef]
Boehm, S.; Pophale, S.; Vergara Larrea, V.G.; Hernandez, O. Evaluating performance portability of accelerator programming models using SPEC ACCEL 1.2 benchmarks. In Proceedings of the High Performance Computing: ISC High Performance 2018 International Workshops, Frankfurt/Main, Germany, 28 June 2018; Revised Selected Papers 33. Springer: Berlin/Heidelberg, Germany, 2018; pp. 711–723. [Google Scholar] [CrossRef]
Naderan-Tahan, M.; Eeckhout, L. Cactus: Top-down GPU-compute benchmarking using real-life applications. In Proceedings of the 2021 IEEE International Symposium on Workload Characterization (IISWC), Storrs, CT, USA, 7–9 November 2021; pp. 176–188. [Google Scholar] [CrossRef]
Reyes, R.; López-Rodríguez, I.; Fumero, J.J.; De Sande, F. accULL: An OpenACC implementation with CUDA and OpenCL support. In Proceedings of the European Conference on Parallel Processing, Rhodes Islands, Greece, 27–31 August 2012; pp. 871–882. [Google Scholar]
Tian, X.; Xu, R.; Chapman, B. OpenUH: Open Source OpenACC Compiler; University of Houston: Houston, TX, USA, 2014. [Google Scholar]
Tabuchi, A.; Nakao, M.; Sato, M. A source-to-source OpenACC compiler for CUDA. In Proceedings of the Euro-Par 2013: Parallel Processing Workshops: BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013, Aachen, Germany, 26–27 August 2013; Revised Selected Papers 19. Springer: Berlin, Germany, 2014; pp. 178–187. [Google Scholar]
Denny, J.E.; Lee, S.; Vetter, J.S. Clacc: Translating openacc to openmp in clang. In Proceedings of the 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), Dallas, TX, USA, 12 November 2018; pp. 18–29. [Google Scholar]
Barba, D.; Gonzalez-Escribano, A.; Llanos, D.R. TORMENT OpenACC2016: A benchmarking tool for OpenACC compilers. In Proceedings of the 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), St. Petersburg, Russia, 6–8 March 2017; pp. 246–250. [Google Scholar]
Jarmusch, A.; Liu, A.; Munley, C.; Horta, D.; Ravichandran, V.; Denny, J.; Friedline, K.; Chandrasekaran, S. Analysis of Validating and Verifying OpenACC Compilers 3.0 and Above. In Proceedings of the 2022 Workshop on Accelerator Programming Using Directives (WACCPD), Dallas, TX, USA, 13–18 November 2022; pp. 1–10. [Google Scholar]
OpenACC-Standard.org. The OpenACC Application Programming Interface, Version 3.3. Available online: https://www.openacc.org/specification (accessed on 3 July 2023).
Farber, R. Chapter 1-From serial to parallel programming using OpenACC. In Parallel Programming with OpenACC; Farber, R., Ed.; Morgan Kaufmann: Boston, MA, USA, 2017; pp. 1–28. [Google Scholar] [CrossRef]
Lebacki, B.; Wolfe, M.; Miles, D. The PGI Fortran and C99 OpenACC Compilers. In Proceedings of the Cray User Group, Stuttgart, Germany, 29 April–3 May 2012; p. 42. [Google Scholar]
Mišić, M.J.; Dašić, D.D.; Tomašević, M.V. An analysis of OpenACC programming model: Image processing algorithms as a case study. Telfor J. 2014, 6, 53–58. [Google Scholar] [CrossRef]
Lashgar, A.; Baniasadi, A. Openacc cache directive: Opportunities and optimizations. In Proceedings of the 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD), Salt Lake City, UT, USA, 14 November 2016; pp. 46–56. [Google Scholar] [CrossRef]
Toledo, L.; Valero-Lara, P.; Vetter, J.S.; Peña, A.J. Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs. Electronics 2022, 11, 1307. [Google Scholar] [CrossRef]
Hardy, D.J.; Stone, J.E.; Vandivort, K.L.; Gohara, D.; Rodrigues, C.; Schulten, K. Chapter 4-Fast Molecular Electrostatics Algorithms on GPUs. In GPU Computing Gems Emerald Edition; Wen-Mei, W.H., Ed.; Applications of GPU Computing Series; Morgan Kaufmann: Boston, MA, USA, 2011; pp. 43–58. [Google Scholar] [CrossRef]
Blackford, L.S.; Petitet, A.; Pozo, R.; Remington, K.; Whaley, R.C.; Demmel, J.; Dongarra, J.; Duff, I.; Hammarling, S.; Henry, G.; et al. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 2002, 28, 135–151. [Google Scholar] [CrossRef]
Luo, L.; Wong, M.; Hwu, W.M. An effective GPU implementation of breadth-first search. In Proceedings of the 47th Design Automation Conference, Anaheim, CA, USA, 13–18 July 2010; pp. 52–55. [Google Scholar] [CrossRef]
Rodrigues, C.I.; Hardy, D.J.; Stone, J.E.; Schulten, K.; Hwu, W.M.W. GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications. In Proceedings of the 5th Conference on Computing Frontiers, Ischia, Italy, 5–7 May 2008; pp. 273–282. [Google Scholar] [CrossRef]
Ikeda, K.; Ino, F.; Hagihara, K. An OpenACC Optimizer for Accelerating Histogram Computation on a GPU. In Proceedings of the 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), Heraklion, Greece, 17–19 February 2016; pp. 468–477. [Google Scholar] [CrossRef]
Volkov, V.; Demmel, J.W. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the SC’08: The 2008 ACM/IEEE Conference on Supercomputing, Austin, TX, USA, 15–21 November 2008; pp. 1–11. [Google Scholar] [CrossRef]
Tödling, D.; Winter, M.; Steinberger, M. Breadth-first search on dynamic graphs using dynamic parallelism on the gpu. In Proceedings of the 2019 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 24–26 September 2019; pp. 1–7. [Google Scholar] [CrossRef]

Figure 1. Speedup of BFS.

Figure 2. Speedup of CUTCP.

Figure 3. Speedup of HISTO.

Figure 4. Speedup of LBM.

Figure 5. Speedup of MRI Gridding.

Figure 6. Speedup of MRI-Q.

Figure 7. Speedup of SAD.

Figure 8. Speedup of SGEMM.

Figure 9. Speedup of SpMV.

Figure 10. Speedup of Stencil.

Figure 11. Speedup of TPACF.

Table 1. Parboil Benchmark Suite Applications.

Application	Short Description	Type
BFS	Breadth-First Search graph traversal algorithm	Memory bound
CUTCP	Cutoff-limited Coulombic Potential physics simulation	Compute bound
Histo	Histogramming operation	Memory bound
LBM	Lattice-Boltzman Method simulation in fluid dynamics	Memory bound
MRI-Gridding	MRI Cartesian Gridding, a step in magnetic resonance imaging	Memory bound
MRI-Q	MRI non-cartesian Q matrix calculation used in MRI image reconstruction	Compute bound
SAD	Sum of Absolute Differences, part of the H.264/AVC video encoder	Memory bound
SGEMM	Dense linear algebra single precision general matrix multiplication operation	Memory bound
SpMV	Sparse Matrix-Dense Vector Multiplication for iterative solvers	Memory bound
Stencil	Iterative Jacobi PDE solver of the heat equation on a 3D-structured grid	Memory bound
TPACF	Two-Point Angular Correlation Function calculation in astronomy	Memory bound

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Đukić, J.; Mišić, M. An Evaluation of Directive-Based Parallelization on the GPU Using a Parboil Benchmark. Electronics 2023, 12, 4555. https://doi.org/10.3390/electronics12224555

AMA Style

Đukić J, Mišić M. An Evaluation of Directive-Based Parallelization on the GPU Using a Parboil Benchmark. Electronics. 2023; 12(22):4555. https://doi.org/10.3390/electronics12224555

Chicago/Turabian Style

Đukić, Jovan, and Marko Mišić. 2023. "An Evaluation of Directive-Based Parallelization on the GPU Using a Parboil Benchmark" Electronics 12, no. 22: 4555. https://doi.org/10.3390/electronics12224555

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Evaluation of Directive-Based Parallelization on the GPU Using a Parboil Benchmark

Abstract

1. Introduction

2. Related Work

3. OpenACC Programming Model

4. Parboil Benchmark Suite

5. Implementation Details

5.1. Breadth-First Search (BFS)

5.2. Cutoff-Limited Coulombic Potential (CUTCP)

5.3. Histogram (HISTO)

5.4. Lattice Boltzman Method Simulation in Fluid Dynamics (LBM)

5.5. MRI Cartesian Gridding (Gridding)

5.6. MRI Non-Cartesian Q Matrix Calculation (MRI-Q)

5.7. Sum of Absolute Differences (SAD)

5.8. SGEMM

5.9. Sparse Matrix-Dense Vector Multiplication (SpMV)

5.10. Stencil

5.11. Two-Point Angular Correlation Function (TPACF)

6. Results

6.1. Breadth First Search (BFS)

6.2. Cutoff-Limited Coulombic Potential (CUTCP)

6.3. Histogramming Operation (HISTO)

6.4. Lattice Boltzman Method Simulation in Fluid Dynamics (LBM)

6.5. MRI Cartesian Gridding (Gridding)

6.6. MRI Non-Cartesian Q Matrix Calculation (MRI-Q)

6.7. Sum of Absolute Differences (SAD)

6.8. SGEMM

6.9. Sparse Matrix-Dense Vector Multiplication (SpMV)

6.10. Stencil

6.11. Two-Point Angular Correlation Function (TPACF)

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI