1. Introduction
In computational fluids dynamics (CFD) analysis, efficient computation still remains a challenge in long-term, large-scale, and high-resolution simulations. Many methodologies/techniques have been developed to alleviate computing efficiency problems, such as the development of efficient numerical schemes [
1,
2]; model surrogating, i.e., one-dimensional (1D) models surrogating two-dimensional (2D) models [
3,
4] and data-driven models with machine learning techniques surrogating conventional physics-based models [
5,
6]; model couplings, i.e., 1D and 2D model coupling [
7,
8] and 2D and three-dimensional (3D) model coupling [
9]; multi-block algorithm [
10]; and parallel computing [
11,
12,
13,
14,
15,
16,
17,
18,
19,
20].
Since CFD models involve solving a set of highly non-linear partial differential equations (PDEs), their computing efficiency depends largely on numerical solvers, which are designed for different computational mesh systems. Structured meshes and unstructured meshes have been widely used in CFD analysis. For examples, the CCHE2D model [
21] and the Delft 3D model [
22] are based on the structured mesh system, while the MIKE 21 flow model [
23], the HEC-RAS 2D model [
8], the SRH-2D model [
24], and the CCHE2D-hybrid model [
25] use the unstructured system. One big advantage of the structured mesh system is that there are fast numerical solvers available, despite its generation difficulties in complex geometries being more than that of the unstructured mesh system. Among those most widely used solvers, two implicit solvers stand out: the ADI (Alternating Direction Implicit) solver for the tri-diagonal matrix system [
26,
27] and the SIP (Strong Implicit Procedure) [
28] solver for the penta-diagonal matrix system.
The ADI solver is simple and easy to implement, although its implicitness and dependency are not as strong as the SIP solver, and thus its convergence is slower. In fact, the strong dependency of the SIP solver makes it unfriendly to parallelization. In parallel computing, implicit solvers are more difficult to parallelize than explicit solvers because of their strong data dependency. For the ADI solver, based on the Parallel Cyclic Reduction (PCR) algorithm proposed by Hockeny and Jesshope [
29], it has been successfully parallelized and applied to flow simulations in CFD analyses [
16,
17,
30] and heat conduction simulations [
18,
31]. Compared to the ADI solver, the SIP solver is fully implicit and thus computationally more stable. In general, the SIP solver can use a larger time step than the ADI solver. Due to the incomplete LU decomposition, the SIP solver converged much faster than the ADI solver as well.
For the SIP solver, attempts have been made for parallelization. For example, Reeve et al. [
32] attempted to parallelize the SIP solver on multi-CPUs (Central Processing Units) using the MPI (Message Passing Interface) framework. Based on a dependency analysis, they proposed two parallel methods, namely, the Wave Front (WF) method and the Red-Black Checkerboard (RBC) method. Deserno et al. [
33] also analyzed the effects of nodal numbering on the implementation of the SIP solver on a high performance computing (HPC) platform with shared memory multi-processors in detail and proposed a diagonal process strategy, essentially the same as the WF method. Igounet et al. [
34] implemented two variants of the SIP method on the GPU by exploring the possibilities of using shared memory. According to the implementation of the RBC method for the SIP solver for flow simulations, Zhang and Jia [
16] recognized that the RBC was just approximations of the SIP algorithm, so that it can be used only for simple problems. Dufrechou et al. [
35] designed and implemented a self-scheduling procedure to avoid the overhead of CPU–GPU synchronization implied by the so-called hyper-planes strategy (another version of the WF method).
In this study, the parallel ADI solver (P-ADI) based on the PCR algorithm and the parallel SIP (P-SIP) solver based on the WF method were implemented on the GPU with the CUDA Fortran programming technique [
36] and then integrated into a 2D hydrodynamic model, the CCHE2D (Center of Computational Hydroscience and Engineering) model. This study proposed a new mapping strategy for the P-SIP solver, distinguished from the conventional WF method; that is, the matrix coefficients and the solution variables are defined on the WF mesh, and the computation is performed on the WF mesh as well. Two mapping methods of the P-ADI solver and two mapping methods of the P-SIP solvers are validated and compared by examples and applications. It demonstrated that both parallel solvers are capable of significantly improving computing efficiency in complex geometries, such as the flow simulations in the Arkansas River pool 7 with many dikes for navigations, the tidal flow simulations in Lake Pontchartrain, and the sediment transport simulations with wetting-and-drying process in the Vistula River with dikes and the meandering East Fork River.
3. CUDA Programming
Parallel computing is an important technique to improve computing efficiency use hardware. Since NVIDIA released the General-Purpose computing Graphic Process Unit (GPGPU) with the CUDA (Computer Unified Device Architecture) framework in 2006, CUDA programming on the GPU has been successfully applied in many disciplines, including CFD analysis [
16,
17,
19,
20,
30,
39].
In NVIDIVA’s CUDA architecture (NVIDIA, [
40]), the computing units on the GPU are organized in a hierarchical way as block threads (
Figure 2). Each thread in blocks that are laid out in a grid would process data in parallel. Basically, the CUDA programming can be divided into three simple steps: (1) copy data from the host memory (CPU) to the device memory (GPU); (2) the host (CPU) calls kernels that perform parallel computing in the device memory (GPU); and (3) copy data from the device memory (GPU) back to the host memory (CPU).
One of the keys for successful CUDA programming is how to map the problem onto the block-threads grid. For a 2D structured mesh (Imax × Jmax), an intuitive and straightforward mapping is to use Imax blocks with Jmax threads or Jmax blocks with Imax threads, which is equivalent to a 2D loop. However, for implicit numerical solvers, such as the ADI solver and the SIP solver, the mapping would become more complicated.
4. Parallel ADI Solver
When using the ADI method to solve Equation (9), a time level is halved so that in each half, one spatial direction is solved implicitly with the other direction explicitly.
where
n denotes the time level, and
i and
j denote two directions in the computational space.
Equation (10) is a tri-diagonal matrix system, which conventionally is solved by the Thomas algorithm for Tri-diagonal Matrix (TDMA) [
26]. Rewriting Equation (10) in the form of [
A] {
}
= {
d}, one will have
where
d is the source term matrix, and
A is the
Imax ×
Jmax coefficient matrix as follows:
Equation (11) will be used to illustrate the ADI method hereafter in this study.
4.1. TDMA
The TDMA has only two steps. The forward elimination step eliminates the lower diagonal:
where the superscript “new” denotes the updated new values.
The backward substitution step is to obtain the solution:
4.2. Cyclic Reduction
Hockeny [
27] proposed an alternative algorithm to solve the tri-diagonal system, the Cyclic Reduction (CR) algorithm. Similar to the TDMA, there are also two steps in the CR. In the forward reduction step, odd-numbered rows are successively eliminated by reducing a system to a smaller system with a half number of unknowns until a system of two unknowns is formed. For even-numbered row
i, a new equation can be obtained by row operations on
i − 1,
i and
i + 1; thus, one can obtain
where
and
.
The backward substitution step successively updates the other half of the unknowns using previously known values:
As can be seen, compared to the TDMA, the CR has more computations in the forward reduction steps; thus, it is computationally slower. However, it is much parallel-friendlier in the sense that the TDMA demonstrates stronger dependency than the CR. In the TDMA, the new values of c and d at i depend on their adjacent new values at i – 1, while in the CR, the new values of a, b, c, and d are only related to their previous old values.
4.3. Parallel Cyclic Reduction
Hockeny and Jesshope [
29] extended the CR to the Parallel Cyclic Reduction (PCR) due to its parallel-friendly characteristics. The PCR has only the forward reduction step since it halves the system recursively into smaller systems.
Considering that in
i direction, the coefficient matrix
A is
Jmax ×
Imax, while in
j direction, the matrix
A will be
Imax ×
Jmax, Zhang et al. [
31] and Heng [
41] adopted a simple mapping that each block solves one tri-diagonal system and each thread calculates one node, which is denoted by the one-block mapping (OBM) in this study (
Figure 3). Obviously, the OBM is limited by the max number of threads in one block, and the size of the tri-diagonal system to be solved must be less than the square and the cubic of the max thread number in 2D and 3D, respectively.
When using the OBM mapping, in i direction the number of blocks is Jmax, the number of threads is Imax, and the required shared memory will be 4 × Imax × 4 (bytes) for single-precision computation and 4 × Imax × 8 (bytes) for double-precision computation.
To resolve the size limitation problems, Zhang and Jia [
16,
17] proposed the improved OBM method by allowing the calculations of multiple nodes in one thread, denoted by one-block mapping with iterations (OBM-it). In the OBM-it method, each block is solving one tri-diagonal system, but each thread is calculating multiple nodes. In this method, a partition strategy was proposed as follows:
where
mi and
mj are the number of partitions in
i and
j directions and also denote the number of nodes calculated in each thread, and
Ip (
max thread number) and
Jp (
max thread number) denote the number of nodes in each partition in
i and
j directions, respectively.
Figure 4 shows the mapping of the OBM-it method. In
i direction, Thread_Per_Block × Block_Per_Grid =
Jmax ×
, while in
j direction, Block_Per_Grid × Thread_Per_Block =
Imax ×
. When using OBM-it method, in
i direction the number of blocks is
mj ×
Jp, but the number of threads is
Ip, and the required shared memory will be 4 ×
Ip ×
mi × 4 (bytes) for single-precision computation and 4 ×
Ip ×
mi × 8 (bytes) for double-precision computation.
Wei at al. [
18] proposed the multi-block mapping (MBM) strategy that the whole system is mapped to multiple blocks (
mi ×
Ip × mj ×
Jp). When using the MBM method, in
i direction, the number of threads is
Ip, but the number of blocks is (
mi,
mj ×
Jp), and the required shared memory will be 4 ×
Ip × 4 (bytes) for single-precision computation and 4 ×
Ip × 8 (bytes) for double-precision computation. Compared to the OBM-it method, in the MBM method, the partitions (Equation (16)) are mapped onto the blocks and threads simultaneously [
9].
Figure 5 shows the MBM. In
i direction, Thread_Per_Block × Block_Per_Grid =
×
, while in
j direction, Block_Per_Grid × Thread_Per_Block =
×
.
4.4. Relaxation
In practice, in order to improve computation stability and enhance convergences, Zhang and Jia [
16] proposed an iteration process with relaxation (
) for the PCR, in which the coefficients
a,
b, and
c are kept unchanged, but the source term
d is updated with the new solution.
The OBM-it and MBM of the P-ADI solver are illustrated using pseudo Fortran code in
Table 1 and
Table 2 as follows. Note the different calculations of the block number and thread number in these two mappings.
5. Parallel SIP Solver
5.1. SIP Solver for Penta-Diagonal System
There are four steps in the SIP algorithm, which are simply listed as follows:
The first step LU decomposition factorizes matrix
A using
A = LU so that
where
r is a coefficient in the range of [0, 1] for fast convergence.
For the second step, the residual is calculated by
In the forward elimination step, the
V is computed as
and the increment of the solution
can then be computed using
.
The backward substitution step calculates the increment of the solution by
so that the solution can be updated via
.
Finally, check convergence if is small enough. If not, go to the second step.
As can be seen, the SIP algorithm demonstrates strong sequential characteristics. In the
LU decomposition step, Equation (17) defines the
L and
U matrix using the coefficient of
A in a recursive way; that is,
Lw,
LS, and
LP at (
i,
j) rely on
UN and
UE at (
i − 1,
j) and (
i,
j − 1), respectively, and new values of
UN and
UE at (
i,
j) require not only
Lw,
LS, and
LP at (
i,
j) but also their adjacent new values at (
i,
j − 1) and (
i − 1,
j). Similarly, in the forward substitution step described by Equation (19), the new value of
V at (
i,
j) depends on its adjacent values at (
i − 1,
j) and (
i,
j − 1). This dependency is indicated in
Figure 6a from the lower index to the higher index values. In the backward substitution step, on the contrary, the increment calculation described in Equation (20) should start from the opposite corner as shown in
Figure 6b.
5.2. Wave Front Method
As demonstrated by
Figure 6, the diagonal dashed lines can be described as the wave front lines, propagating from the left bottom corner to the right top corner for the LU decomposition step and the forward elimination step of the SIP algorithm but in the opposite direction for the backward substitution step. In other words, along the propagating direction, the mesh nodes on one diagonal line have dependency only on the diagonal line behind.
This observation inspired a general parallel strategy for the SIP solver as follows: the parallel computing can be carried out along the wave fronts (WF) line by line, which is exactly the core concept of the WF method proposed by Reeve et al. [
32] and the diagonal process strategy by Deserno et al. [
33].
For the WF method, the first step is to identify the WF lines. As shown in
Figure 7, for a mesh with the dimension of
Imax ×
Jmax, the number of the WF lines
Nw is
Imax +
Jmax − 1. Along one WF line, there are at least one mesh node and at most
Jmax nodes including the boundary nodes. Excluding the boundary nodes, a WF line with at least one internal or non-boundary mesh node is defined as an effective WF line. Therefore, there are
Imax +
Jmax − 3 effective WF lines, which will be computed line by line on the GPU. As identified by Deserno et al. [
33], along each WF line, the nodal index (
i,
j) satisfies the following condition: (
i +
j − 1) = current WF line number (constant).
Straightforwardly, for one WF line, four kernels can be designed for the LU decomposition step, the residual calculation step, the forward elimination step, and the backward substitution step. For any variable
(=
u,
v,
η,
C,
qb) in Equations (1)–(9), the parallel SIP solver can be used in the following way, demonstrated by the pseudo Fortran code in
Table 3. As can be seen, three kernels are called in an iterative way by going through all wave front lines, and each kernel is called line by line (number of WF lines) by the host (CPU), which may induce a CPU–GPU overhead problem.
To alleviate the above CPU–GPU overheads, the kernels should be designed for all WF lines instead of one WF line on the GPU and then be called only once on CPUs (
Table 4). However, due to the strong dependency of the SIP solver, the calculation for all nodes must follow a fixed consecutive order of the WF lines. Moreover, according to the CUDA programming guide, the order of blocks within a grid and warps within a block are undefined. All these concerns raise a question of mapping strategy for the SIP solver: how to carry out the calculation with a fixed order on the block-thread framework of the GPU. Dufrechou et al. [
35] identified this problem and proposed a synchronization method to alleviate it. In their method, a global array was defined to record the block number for each WF line. In the main kernel, before computing the next WF line, each block must wait actively for the block number of that global array to become zero. That global array will be decreased by each thread block with a block index lower than the current WF line.
5.3. Wave Front Mesh
To fulfill the parallel strategy of the WF method, a mapping needs to be established between the nodes on WF lines and the original computational mesh nodes. Therefore, an auxiliary WF mesh is constructed, through which the computational nodes can be mapped onto the hierarchical block-threads framework of the GPU.
The WF mesh is structured with the size of NW × Wp. Each WF line has Np nodes including the computational nodes and the virtual nodes outside of domain. Then this WF mesh is mapped onto the block-thread framework on the GPU as follows: B blocks per grid and Wp threads per block on GPU. All WF lines are nearly equally distributed to each block, so the number of wave lines in each block is Dw = NW/B. To enhance coalescing in memory accessing, the number of nodes along each WF line is adjusted by the warp size (32) so that Wp = 16 × (Wp − 1)/16. In such a case, in the WF mesh there may be more virtual nodes along each WF line. When Wp is greater than the maximum thread number, iterations will be applied for each thread so that more than one node will be calculated in one thread.
Figure 8 shows two WF meshes: one with effective WF lines only (denoted by WF-e) and the other with all WF lines (denoted by WF-all), which correspond to two different mapping strategies.
In addition to the WF-e mesh being smaller in size than the WF-all mesh, the key difference between the WF-e and the WF-all lies in on which mesh the matrix coefficients and solution variable are defined. In the WF-e, the matrix coefficients and the solution variable are defined on the original computational mesh; the mapping between the WF mesh and the original mesh is limited only for the effective WF lines and needed dynamically through an index converter (an array stores the nodal number of the original mesh) in each thread. In the WF-e, the computation is essentially on the original mesh, which is mapped onto block threads using the WF mesh. The index mapping between the original mesh and the WF mesh in each thread hinders fast data retrieving and storing.
In contrast, in the WF-all, the matrix coefficients and the solution variable are defined on the WF mesh instead, and the mapping between the WF mesh and the original mesh is conducted before the SIP computation. The computation is carried out on the WF at each thread, and after computations, the solution needs to be mapped back onto the original mesh.
Figure 9 shows the mapping between the WF mesh and the original mesh. In a nine-node stencil, as can be seen, the east (
I + 1,
j) and west (
i − 1,
j) nodes on two meshes overlap, but the north node of the original mesh (
i,
j + 1) corresponds to the northeast node of the WF mesh (
iw + 1,
jw + 1), and the south node of the original mesh (
i,
j − 1) corresponds to the southwest node of the WF mesh (
iw − 1,
jw − 1).
Since the computation is carried out on the WF mesh, the SIP algorithm for the original mesh (Equations (17)–(20)) needs modification. Following the mapping in
Figure 9, the SIP algorithm is modified for the WF mesh as follows:
Note the index difference between Equations (17), (19) and (20) and their counterparts, Equations (21)–(23).
For completeness, two mappings of the P-SIP solver, including the kernel for the LU decomposition step (the kernels for the forward elimination step and the backward substitution step are similar), are listed in
Table 5 and
Table 6 using pseudo Fortran code. As can be seen, in the kernel the WF-e needs dynamic mapping from the WF mesh to the OM, and there is one more kernel used to map the solution from WF mesh back to OM in WF-all. This kernel can also be combined into the backward calculation so that the solution mapping can be conducted simultaneously in the backward substitution step.
6. Implementation
In the serial version of the CCHE2D model, most of the variables were defined as 2D or 3D arrays, which are conceptually clear and simple. However, in parallelization, to allow faster access of memories on devices (GPUs), one-dimensional arrays in the device memory were used to store the global variables in order to reduce the amount of memory transfer to a minimum. Specifically, for most of the flow variables defined by 2D arrays (Imax, Jmax), a 1D index is used to map a 2D index (i, j); for the fractional sediment variables defined in 2D arrays (Np, NSize) with Np = Imax × Jmax, a 1D index is used; for the layered bed-composition variables defined in (Np, Nlayer), a 1D index is used.
The CCHE2D model consists of many modules which make it capable of simulating flows of all regimes and sediment transports. It uses a time-marching solution procedure. All modules within the time-marching loop were parallelized. Basically, the parallel computing of the CCHE2D model involves three steps, as mentioned previously: (1) in the “initialization” step, all host (CPU) data and device (GPU) data were declared, initialized, and loaded with input data, and data transfers between host and device were carried out as well. For the P-SIP solver, a mapping between the original mesh and the WF mesh is established. For WF-e, a global array is used to store the index of the original mesh along each WF line (WF → OM), while for WF-all, a global array is used to store the index of the WF mesh for each original mesh node (OM → WF) in order to define matrix coefficients and solution variables on the WF mesh; (2) in the time-marching loop, the host calls kernels to perform computations on the GPU using the device data; and (3) after the time-marching loop is finished, the device data are transferred and copied to the host.
In the implementations, the double-precision floating point data type was used. All the testing examples and application were conducted on a PC with Intel Core i7 (2.93 GHz), 14GB Memory, and a Tesla 2070 GPU card with 48k shared memory. The PGI CUDA-Fortran developed by the Portland Group [
36] was used to parallelize the CCHE2D model. Note that since the Portland Group was purchased by NIVIDIA, the PGI CUDA-Fortran used in the current parallelization is fixed on PGI Visual Fortran 12.1 with the CUDA Toolkit 4.2, which is way behind the latest CUDA Toolkit 12.3. Therefore, this parallel implementation cannot use the latest advanced CUDA capabilities, which may affect the whole performance of the model.
7. Example and Application
7.1. 2D Heat Diffusion
The heat-transfer problem was first used to test the P-ADI solver and the P-SIP solver. In a rectangle domain, the heat diffusion equation as follows is solved:
where
T is temperature, ρ is the density of conduction media,
c is specific heat, and
k is the diffusion coefficient.
The test domain was discretized into 1000 × 1000 nodes, and there are 1999 WF lines including 1995 effective WF lines and 1008 warp-size-adjusted nodes along each WF line. Seven solvers with different configurations, namely, ADI-TDMA on the CPU, SIP for original mesh (SIP-OM) on the CPU, SIP using WF lines (SIP-WF) on the CPU, P-ADI OBM-it on the GPU, P-ADI MBM on the GPU, P-SIP WF-e on the GPU, and P-SIP WF-all on the GPU were compared.
Table 7 lists the CPU running time for these seven configurations, where “s” denotes second, and “m” denotes minute.
As can be seen, on the CPU, the ADI with TDMA is faster than the two SIP configurations, and the SIP-WF using the WF lines is the slowest, which demonstrates how much the mapping between the original mesh and the WF mesh affects the SIP computations. The SIP for the original mesh requires Imax × Jmax iterations for each SIP step, while the WF version needs Jmax × (Imax + Jmax − 5) iterations. When Imax = Jmax, i.e., the current case, the WF version would need almost double the iterations. In addition, when following the WF lines, the dynamic mapping in each iteration step affects the memory access speed significantly.
On the GPU, correspondingly, two P-ADI configurations are faster than two P-SIP configurations. The WF-all is 3.67 times faster than the WF-e and 5.42 times faster than the SIP-OM; the speedup ratio of WF-e is only 1.48 compared to SIP-OM, while the speedup ratio for the P-ADI OBM-it and the P-ADI BMB is 7.89 and 6.96, respectively. For the P-ADI, Wei et al. [
18] reported that the MBM could achieve up to about 70 times speedup when using the single-precision floating point data type.
In this heat-diffusion problem, with the same time step, the ADI solver is faster than the SIP solver due to its lighter computing loads. Computationally, the TDMA has only two steps (two loops in computation) and the SIP has four steps (four loops in computation) with inner iterations to check the residual error. Therefore, the CPU version of ADI is about two times (1.81) faster than that of the SIP.
Similarly, the P-ADI solvers only has two kernels, but the P-SIP solver has four kernels with inner iterations, which partially explains why the P-ADI demonstrated higher efficiency than the P-SIP. With the matrix coefficients and the solution variable defined on the WF mesh, the WF-all performed at a much higher computing speed than the WF-e with the dynamic mapping between the original mesh and the WF mesh in each thread.
7.2. Flow Simulation
The P-ADI and the P-SIP on the GPU were integrated into theCCHE2D model and applied to two flow simulation cases: steady flow in the Arkansas River pool 7 and the unsteady flow in Lake Pontchartrain.
Table 8 and
Table 9 list the CPU running time for the model using both solvers with different configurations, respectively. Note that the listed CPU running times were for the comprehensive performances of the whole model with many other parallel implementations in the entire time marching loop, not purely for the solvers themselves. According to the CPU time comparisons, the P-ADI solver demonstrated higher computing efficiency than the P-SIP solver in both cases, especially for the WF-e mapping method.
7.2.1. Steady Flow Simulation in Arkansas River Pool 7
In the Arkansas River pool 7 (
Figure 10), there are more than 30 dikes serving to improve navigation conditions, which make the flow field complicated. Velocity profiles at six cross sections from mile 147 to mile 141.4, measured by USGS, were selected for validating the simulation flow fields.
Figure 11 compares those simulated velocity profiles by the CPU-serial model with the SIP solver and the GPU-parallel model with the P-ADI and P-SIP solvers to the measurements. Note that two configurations for the P-SIP solver yielded almost the same result, so they are denoted by GPU-SIP in this figure. Similarly, the results from the P-ADI solver are denoted by GPU-ADI. These notations will be applied to other cases hereafter.
Table 10 lists the L
2 error norm defined by
where
N denotes the numerical result and
M denotes the measured one.
As can be seen, all the simulated profiles generally agreed with the measured ones. However, due to the dikes, the flow field is very complicated, which makes it difficult to accurately validate those velocity profiles. Although identical results were obtained, discrepancies between the CPU model and the parallel GPU model can be observed as well, which is expected and mainly due to (1) the different computing capabilities of the CPU and GPU, and (2) the differences between the parallelization coding and the sequential coding when implementing those numerical solvers.
The -ADI solver presented a little oscillation for Y velocity at mile 142.7, which may lie in its alternating implicit characteristics so that it is not as robust as the full implicit P-SIP solver.
7.2.2. Unsteady Flow Simulation in Lake Pontchartrain
Lake Pontchartrain (
Figure 12) is located in southeastern Louisiana and serves not only as a brackish estuary connected to the Gulf of Mexico, Lake Borgne, and Lake Maurepas but also a flood release area for protecting the city of New Orleans.
As shown in
Figure 12, the tidal flow goes through Rigolets and Chef Menteur into the lake. The simulation results for Lake Pontchartrain on CPU and GPU are validated by using (1) the measured water surface elevation at Mandeville and Westend, and (2) the measured velocity at South Lake.
Figure 13 shows the time series profile of the water surface elevation at two different gauging stations: the Mandeville station at the north and the Westend station at the south. All simulated water level profiles captured well the general trend under tidal flow boundary conditions. The P-SIP solver yielded closer results to the CPU results than the P-ADI solver because of the same solver.
As for the velocity components at South Lake station (
Figure 14), generally speaking the basic trend of the measurements was well captured by all simulated velocity profiles, despite their much more frequent fluctuations over time than the water level. Similarly, the P-SIP solver yielded closer results to the CPU results than the P-ADI solver, as observed.
7.3. Sediment Transport Simulations
7.3.1. Steady Sediment Transport in Vistula River
These two parallel solvers (P-ADI and P-SIP) were further implemented in the sediment transport module of CCHE2D-SED. The steady sediment transport simulation in Vistula River and the unsteady sediment transport simulation in East Fork River were used to validate and test these two parallel solvers.
Vistula River is the largest river of Poland and the drainage basin of the Baltic Sea. As shown in
Figure 15, there are 30 dikes in the study reach. In the sediment transport simulation in Vistula River (
Figure 15), there are seven size classes (
Table 11). A constant flow discharge of 570 m
3/s, a suspended-load concentration of 11.4 kg/m
3, and a bed-load transport rate of 0.032 kg/m/s are imposed at the inlet, while at the outlet a constant water surface elevation 10.868 m is maintained. The total simulation time is 10,000 s, and the time step is 2 s.
Figure 16 compares the sediment fields: distributions of bed-load transport rate, suspended-load concentration, and bed change between the serial version on the CPU and the parallel version on the GPU. Identical results from both the CPU and GPU were obtained, but differences due to different computing capabilities of CPU and GPU were also observed.
Table 12 compares the computing efficiency of the P-ADI solver and the P-SIP solver for the steady sediment transport simulation in Vistula River. Again, note that the CPU running time listed in
Table 12 includes all other parallel implementations in the entire time-marching loop. As can be seen, the P-ADI solver is about two times faster than the P-SIP solver with WF-e mapping.
7.3.2. Unsteady Sediment Transport in East Fork River
The unsteady sediment transport simulation in a 3.2 km reach of the East Fork River (
Figure 17) located in Wyoming State selected the simulation period of 2–19 June 1979.
Figure 18 shows the corresponding hydrograph and the stage graph at the inlet and outlet.
In this case, there are nine size classes of sediment mixtures, ranging from 0.088 mm (fine sand) to 32 mm (coarse gravel). Bed sample information was obtained from the USGS technical report [
42] for the selected reach. In this simulation, the total simulation time is 18 days, and the time step is 2 s
Figure 19 compares the simulated sediment flux at the bed trap of the outlet with the measurements. The general trend was well captured by all solvers, but the first peak was over-predicted, the second peak was under-predicted, and again oscillations occurred in the P-ADI solvers. For this unsteady sediment transport simulation, more discrepancies were observed between different solvers. The reason may lie in the complexity of unsteady multi-sized sediment transport that is highly non-linear and depends not only on the solutions of those transport equations but also on empirical/semi-empirical formulas, which tends to create more uncertainties [
25].
Table 13 lists the CPU time used for all solvers. Similar to the Vistula case, the P-ADI has better performances than the P-SIP solver in this unsteady sediment transport simulation as well.
As can be seen, in general, sediment transport simulations are more complicated with more computations than flow simulations. Correspondingly, the speedup ratio decreases for both parallel solvers.