Dynamic load balancing generally boosts the simulation performance of a distributed system from several perspectives, as stated before. However, partition optimization and the commensurate particle relocation may require significant computation and communication time. In addition, various material behaviors may lead to divergent partition results when using different workload elements. Therefore, we conduct several experiments with multiple material behaviors to evaluate the proposed load balancing algorithms in this section. The results and discussions can help users find the best choice for their simulation objectives.
7.2.1 Sand Injection.
In this experiment, we focus on comparing the behavior of the
static,
dynamic grid, and
DPP methods and analyzing how partitioning frequency influences the performance. As displayed in Figure
8, we design a 4-MPI-rank
sand injection scene, where each rank injects sand from two sourcing points with random velocities pointing toward the shelf (collision object) sitting at the domain center. Throughout the simulation, sand material sometimes splashes and finally settles down, leading to dynamically varying workload distribution.
Dynamic Partitioning V.S. Static Partitioning. For a thorough analysis, we illustrate the detailed timing on all four ranks for
static,
dynamic grid, and
dynamic particle partitions in Figure
14. In addition to the timing of each separate kernel, we also show waiting time, which refers to the duration when faster ranks finish computation/communication and wait for other ranks. We show the data with dynamic partitions performed every 50 steps without loss of generality. In the first row of Figure
14, static partitioning pushes more work to lower ranks (rank 0-0-0 and 1-0-0) as the sand particles fall to the ground. The upper ranks (rank 0-1-0 and 1-1-0), on the other hand, contain fewer and fewer particles and thus sit idle, wasting time waiting for the lower ranks. This issue is mitigated when dynamic partition is adopted (rows 2–3 in Figure
14).
In this test, some sand particles splash out in the upper sub-domains while the others pile up at the bottom. This uneven particle-per-grid-tile distribution leads to different behaviors of DGP and DPP. When applying DGP, each rank contains a similar number of grids, but the upper ranks need to handle more particles. This fact means that more parallel work is required for upper ranks to perform P2G and G2P, and thus the lower ranks become idle, especially after frame 90. For DPP the roles reverse as the lower ranks need to handle more of the grid, making the upper ones wait.
Dynamic Partition Frequency. We compare the speedup of
dynamic grid/particle partitioning with different frequencies to static partitioning in Figure
15. This specific simulation takes around 233 steps per frame, and the sand particles are continuously injected until frame 80. We choose partitioning step intervals to be 50, 200, 1000, 2000, and 4000 for testing,
i.e., performing dynamic load balancing per about 0.25, 1, 4, 8, and 17 frames.
As illustrated in Figure
15, all choices achieve over 1.4x speedup and can reach 2.3x in some frame ranges. In theory, the best speedup would be about 2x, as the extreme case is that the lower two ranks handle all workload and the upper two do nothing but wait. This speedup can be better in practice when considering the overhead of parallel scheduling, memory access, and communication.
Overall,
DPP outperforms the grid-based method for the splashing materials. Moreover, each partitioning frequency has different speedup trend through frames 0–25, 25–80, and 80–150. This indicates that the particle/grid number (problem scale), material behavior (sourcing, splashing, and falling), as well as the motion (if particles are moving toward the same direction as the partition boundaries) will all influence the actual performance. Despite the partitioning frequency, our dynamic load balancing algorithm, compared to the static case, can always accelerate the simulation process as summarized in Table
1, and it can gain more speedup for large-scale cases that consume more time (after frame 80 when sourcing stops as in Figure
15).
In particular, we observe that DPP per 4000 steps behaves better than other cases after frame 80. There are two possible reasons. First, frequent partition changes prompt immediate particle relocation among ranks. It will also introduce extra particle communication work in the following steps, especially when particles and partition boundaries move in the same direction. Second, a relatively perfect particle partition leads to an undesirable grid partition for splashing sands. Nevertheless, delayed partitioning alleviates this situation by pushing more particles to lower ranks but more grid to the upper ranks, thus leading to more rank-balanced particle-grid computations. DGP, however, cannot benefit from this partitioning delay.
7.2.2 Elastic Toys.
This experiment shows how partition methods behave when materials splash considerably less. Initially, we assign four MPI ranks the same number and type of toys and thus the same amount of particles, and we drop them as shown in Figure
9. With toys falling down, particles are communicated to lower ranks if they pass the upper-lower partition interface. Here, the overall acceleration rates (173% for
DGP and 180% for the particle case) are similar and are close to the best theoretical speedup (200%). In a detailed timing statistics (Figure
16), we notice better acceleration with
DPP before frame 30. It happens because the toys are of irregular shape and random orientation. Thus, some active grid
tiles contain only a sharp toy corner with few particles. Lower toys reach the bottom while falling, but the upper ones are still placed evenly in the sky. As a result, more grids will be activated in the upper domain, leading to a partition boundary closer to the upper toy group. The toy’s falling direction makes the workload less balanced in the steps prior to the next round of partitioning. One of the representative frames is shown in the first row of Figure
9.