Keywords

1 Introduction

In the last decade, the architectural complexity of High Performance Computing (HPC) platforms has strongly increased. To cope with this complexity, programming paradigms are being revisited. Among others, one major trend consists of writing the algorithms in terms of task graphs and delegating to a runtime system both the management of the data consistency and the orchestration of the actual execution. This paradigm has been first intensively studied in the context of dense linear algebra [1,2,3, 6,7,8, 11, 12] and is now a common utility for related state-of-the-art libraries such as PLASMA, MAGMA, FLAME, DPLASMA and Chameleon. Dense linear algebra algorithms were indeed excellent candidates for pioneering in this direction. First, their regular computational pattern allows one to design very wide task graphs so that many computational units can execute tasks concurrently. Second, the building block operations they rely on, essentially level-three Basic Linear Algebra Subroutines (BLAS), are compute intensive, which makes it possible to split the work in relatively fine grain tasks while fully benefiting from GPU acceleration. As a result, these algorithms are particularly easy to schedule in the sense that state-of-the-art greedy scheduling algorithms may lead to a performance close to the optimum, including on platforms accelerated with multiple Graphics Processing Units (GPUs). Because sparse direct methods rely on dense linear algebra kernels, a large effort has been made to turn them into task-based algorithms [4, 9].

In this paper, we tackle another class of algorithms, the Krylov subspace methods, which aim at solving large sparse linear systems of equations of the form \(\mathcal{A}_{} x_{} =b_{} \), where \(\mathcal{A}_{}\) is a sparse matrix. Those methods are based on the calculation of approximated solutions in a sequence of embedded spaces, that is intrinsically a sequential numerical scheme. Second, their unpreconditioned versions are exclusively based on non compute intensive kernels with irregular memory access pattern, Sparse Matrix Vector products (SpMV) and level-one BLAS, which need very large grain tasks to benefit from GPU acceleration. For these reasons, designing and scheduling Krylov subspace methods on a multi-GPUs platform is extremely challenging, especially when relying on a task-based abstraction which requires to delegate part of the control to a runtime system. We discuss this methodological approach in the context of the Conjugate Gradient (CG) algorithm on a shared-memory machine accelerated with multiple GPUs using the StarPU runtime system [5] to process the designed task graph. The CG solver is a widely used Krylov subspace method, which is the numerical algorithm of choice for the solution of large linear systems with symmetric positive definite matrices [13].

The objective of this study is not to optimize the performance of CG on an individual GPU, which essentially consists of optimizing the matrix layout in order to speed up SpMV. We do not either consider the opportunity of reordering the matrix in order to improve the SpMV. Finally, we do not consider numerical variants of CG which may exhibit different parallel patterns. These three techniques are extremely important but complementary and orthogonal to our work. Instead, we rely on routines from vendor libraries (NVIDIA cuSPARSE and cuBLAS) to implement individual GPU tasks, we assume that the ordering is prescribed (we do not apply permutation) and we consider the standard formulation of the CG algorithm [13]. On the contrary, the objective is to study the opportunity to accelerate CG on multiple GPUs by designing an appropriate task flow where each individual task is processed on one GPU and all available GPUs are exploited to execute these tasks concurrently. We first propose a natural task-based expression of CG. We show that such an expression fails to efficiently accelerate CG. We then propose successive improvements on the task flow design to alleviate the synchronizations, exhibit more parallelism (wider graph) and reduce the volume of exchanges between GPUs.

The rest of the paper is organized as follows. We first propose a natural task-based expression of CG in Sect. 2. We then present the experimental set up in Sect. 3. We then show how the baseline task-based expression can be enhanced for efficiently pipelining the execution of the tasks in Sect. 4. We present a performance analysis of a multi-GPU execution in Sect. 5. Section 6 presents concluding remarks together with preliminary experiments in the fully heterogeneous case.

2 Baseline Sequential Task Flow (STF) Conjugate Gradient Algorithm

In this section, we present a first task-based expression of the CG algorithm whose pseudo-code is given in Algorithm in Fig. 1a. This algorithm can be divided in two phases, the initialization phase (lines 1–5) and the main iterative loop (lines 6–16). Since the initialization phase is executed only once, we only focus on an iteration occurring in the main loop in this study.

Three types of operations are used in an iteration of the algorithm: SpMV (the sparse matrix-vector product, line 7), scalar operations (lines 9, 13, 14) and level-one BLAS operations (lines 8, 10, 11, 12, 15). In particular three different level-one BLAS operations are used: scalar product (dot, lines 8 and 12), linear combination of vectors (axpy, lines 10, 11 and 15) and scaling of a vector by a scalar (scal, line 15). The scal kernel at line 15 is used in combination with an axpy. Indeed, in terms of BLAS, the operation \(p \leftarrow r + \beta p\) consists of two successive operations: \(p \leftarrow \beta p\) (scal) and then \(p \leftarrow r + p\) (axpy). In our implementation, the combination of these level-one BLAS operations represents a single task called scale-axpy. The key operation in an iteration is the SpMV (line 7) and its efficiency is thus critical for the performance of the whole algorithm.

Fig. 1.
figure 1

Conjugate Gradient (CG) linear solver. (Color figure online)

According to our STF programming paradigm, data need to be decomposed in order to provide opportunities for executing concurrent tasks. We consider a 1D decomposition of the sparse matrix, dividing the matrix in multiple block-rows. The number of non-zero values per block-rows is balanced and the rest of the vectors follows the same decomposition.

After decomposing the data, tasks that operate on those data can be defined. The tasks derived from the main loop of Algorithm in Fig. 1a are shown in Fig. 1b, when the matrix is divided in six block-rows. Each task is represented by a box, named after the operation executed in that task, and edges represent the dependencies between tasks.

The first instruction executed in the main loop of Algorithm in Fig. 1a is the SpMV. When a 1D decomposition is applied to the matrix, dividing it in six parts implies that six tasks are submitted for this operation (the green tasks in Fig. 1b): \(q_i \leftarrow A_i p, i \in [1,6]\). For these tasks, a copy of the whole vector p is needed (vector p is unpartitioned). But in order to extract parallelism of other level-one BLAS operations where vector p is used (lines 8 and 15 in Algorithm in Fig. 1a), in respect with our programming, the vector p needs to be partitioned. The partitioning operation is a blocking call; it thus represents a synchronization point in this task flow. Once vector p is partitioned, both vectors p and q are divided in six parts. Thus six dot tasks are submitted. Each dot operation accesses \(\alpha \) in read-write mode, which induces a serialization of the operation. This sequence thus introduces new synchronizations in the task flow each time we need to perform a dot operation. The twelve axpy tasks (six at line 10 and six at line 11) can then all be executed in parallel. Another dot operation is then performed (line 12) and induces another serialization point. After the scalar operations at lines 13 and 14 in Algorithm in Fig. 1a, the last scale-axpy operation of the loop is executed, which updates the vector p. At this stage, the vector is partitioned in six pieces. In order to perform the SpMV tasks for the next iteration, an unpartitioned version of the vector p is needed. This is done with the unpartition operation, similar to the partition operation, which is a blocking call.

All in all, this task flow contains four synchronization points per iteration, two for the partition/unpartition operation and two issued from the dot operations. The task flow is also very thin. Section 4.1 exhibits the induced limitation in terms of pipelining, while Sects. 4.2, 4.3 and 4.4 propose successive improvements allowing us to alleviate the synchronizations and design a wider task flow, thus increasing the concurrency and the performance.

3 Experimental Setup

All the tests presented in Sect. 5 have been run on a cache coherent Non Uniform Memory Access (ccNUMA) machine with two hexa-core processors Intel Westmere Xeon X5650, each one having 18 GB of RAM, for a total of 36 GB. It is equipped with three NVIDIA Tesla M2070 GPUs, each one equipped with 6 GB of RAM memory. The task-based CG algorithm proposed in Sect. 2 is implemented on top of the StarPU v1.2. We use the opportunity offered by StarPU to control each GPU with a dedicated CPU core. To illustrate our discussion we consider the matrices presented in Table 1. All needed data is prefetched to the target GPU before the execution and assessment of all the results presented in this paper.

Table 1. Overview of sparse matrices used in this study. The 11pts-256 and 11pts-128 matrices are obtained from a 3D regular grid with 11pt discretization stencil. The Audi_kw and af_0_k101 matrices come from structural mechanics simulations on irregular finite element 3D meshes.

Scheduling and Mapping Strategy. As discussed in Sect. 2, the task flow derived from Algorithm in Fig. 1a contains four synchronization points per iteration and is very thin, ensuring only a very limited concurrency. Pipelining this task flow efficiently is thus very challenging. In particular, dynamic strategies that led to close to optimum scheduling in dense linear algebra [2] are not well suited here. We have indeed experimented such a strategy (Minimum Completion Time (MCT) policy), but all studied variants failed to achieve a very high performance. We have thus implemented a static scheduling strategy. We perform a cyclic mapping of the block-rows on the available GPUs in order to ensure load balancing.

Building Block Operations. In order to explore the potential parallelism of the CG algorithm, we first study the performance of its building block operations, level-one BLAS and SpMV. The granularity does not penalize drastically the performance for SpMV operation. Additionally when three GPUs are used, a speed-up of 2.93 is obtained. On the other hand, in order to efficiently exploit multiple GPUs, vector with sizes of at least few millions are needed.

Fig. 2.
figure 2

Performance of the building block operations used in the CG algorithm. All data is prefetched before execution and performance assessment.

4 Achieving Efficient Software Pipelining

In accordance with the example discussed in Sect. 2, the matrix is split in six block-rows and three GPUs are used. We pursue our illustration with matrix 11pts-128.

4.1 Assessment of the Proposed Task-Based CG Algorithm

Figure 3 shows the execution trace of one iteration of the task flow (Fig. 1b) derived from Algorithm in Fig. 1a with respect to the mapping proposed in Sect. 3. Figure 3 can be interpreted as follows. The top black bar represents the state of the CPU RAM memory during the execution. Each GPU is represented by a pair of bars, one for the state of the GPU and the black bar which depicts the memory state of the GPU. When data movement occurs between different memory nodes, they are highlighted by an arrow from the source to the destination. The top bar for each GPU represents its activity. The activity of a GPU may have one of the three following states: active computation (green), idle (red) or active waiting for the completion of a data transfer (purple).

Fig. 3.
figure 3

Execution trace of an iteration with the CG task flow of Fig. 1b using three GPUs. (Color figure online)

An iteration starts with the execution of a SpMV operation (line 7 in Algorithm in Fig. 1a) which corresponds time interval [\(t_0,t_1\)] in Fig. 3. Following the cyclic mapping strategy presented in Sect. 3, each GPU is thus in charge of two SpMV tasks. At time \(t_1\), the vector p is unpartitioned. The vector p is partitioned into six \(p_i\) pieces, \( i \in [1,6]\), with respect to the block-row decomposition of the matrix. However, this data partitioning operation is a blocking call (see Sect. 2) which means that no other task can be submitted until it is completed at time \(t_1\) (the red vertical bar after the SpMV tasks in Fig. 1b). Once vector p is partitioned, tasks for all remaining operations (lines 8–15) are submitted. The dot tasks are executed sequentially with respect to the cyclic mapping strategy. The reason for this, as explained in Sect. 2, is that the scalar \(\alpha \) is accessed in read-write mode. In addition, \(\alpha \) needs to be moved to GPUs between each execution of a dot task (time interval [\(t_1,t_2\)] in Fig. 3). Once the scalar product at line 8 is computed, the scalar division follows (line 9) executed on GPU 1 (respecting the task flow in Fig. 1b). The execution of the next two instructions follows (lines 10 and 11). But before the beginning of the execution of the axpy tasks on GPU 2 and GPU 3, the new value of \(\alpha \) is sent (the purple period at \(t_2\) in Fig. 3). The axpy tasks (yellow tasks in Fig. 1b) are then executed during the period [\(t_2,t_3\)] in parallel. The scalar product at line 12 is then executed during the time interval [\(t_3,t_4\)], following the same sequence as explained above for line 8. Next, \(\beta \) and \(\delta _{old}\) are computed on GPU 1 at time \(t_4\) in Fig. 3, representing the scalar operations from lines 13 and 14 of Algorithm in Fig. 1a. Tasks related to the last operation of the iteration (scale-axpy tasks in Fig. 1b) are then processed during the time interval [\(t_4,t_5\)]. When all the new vector blocks \(p_i\) are calculated, the vector p is unpartitioned (red vertical bar after the scale-axpy tasks in Fig. 1b). As explained in Sect. 2, this data unpartition is another synchronization point and may only be executed in the RAM. All blocks \(p_i\) of vector p are thus moved by the runtime system from the GPUs to the RAM during the time interval [\(t_5,t_6\)] for building the unpartitioned vector p. This vector is then used to perform the \(q_i \leftarrow A_i \times p\) tasks related to the first instruction of the next iteration (SpMV at line 7). We now understand why the iteration starts with an active waiting of the GPUs (purple parts before time \(t_0\)): vector p is only valid in the RAM and thus needs to be copied on the GPUs.

During the execution of the task flow derived from Algorithm in Fig. 1a (Fig. 1b), the GPUs are idle during a large portion of the time (red and purple parts in Fig. 3). In order to achieve more efficient pipelining of the algorithm, we present successive improvements on the design of the task flow: relieving synchronization points (Sect. 4.2), reducing volume of communication that is achieved using a packing data mechanism (Sect. 4.3) and relying on a 2D decomposition (Sect. 4.4).

4.2 Relieving Synchronization Points

Alternatively to the sequential execution of the scalar product, each GPU j can compute locally a partial sum (\(\alpha ^j\)) and perform a StarPU reduction to compute the final value of the scalar (\(\alpha =\sum _{j=1}^{n\_gpus}\alpha ^j\)). Figure 4a illustrates the benefit of this strategy. The calculation of the scalar product, during the time interval [\(t_0,t_1\)] is now performed in parallel. Every GPU is working on its own local copy of \(\alpha \) and once they have finished, the reduction is performed on GPU 1 just after \(t_1\).

Fig. 4.
figure 4

Execution trace of one iteration when the dot is performed in reduction mode (left) and after furthermore avoiding data partitioning and unpartitioning (right). (Color figure online)

The partition (after instruction 7 of Algorithm in Fig. 1a) and unpartition (after instruction 15) of vector p, that are special features of StarPU, represent two of the four synchronization points within each iteration. They furthermore induce extra management and data movement costs. Indeed, after instruction 15, each GPU owns a valid part of vector p. For instance, once GPU 1 has computed \(p_1\), StarPU moves \(p_1\) to the RAM and then receives it back. Second, vector p has to be fully assembled in the main memory (during the unpartition operation) before prefetching a copy of the fully assembled vector p back to the GPUs (after time \(t_3\) in Fig. 4a). We have designed another scheme where vector p is kept by StarPU in a partitioned form all along the execution (it is thus no longer needed to perform partitioning and unpartitioning operations at each iteration). Instead of building and broadcasting the whole unpartitioned vector p, each GPU gathers only the missing pieces. This enables us to “remove” the two synchronization points related to the partition and unpartition operations, since they are not called anymore, and decrease the overall traffic. Figure 4b illustrates the benefits of this policy. Avoiding the unpartitioning operation allows us to decrease the time required between two successive iterations from 8.8 ms to 6.6 ms. Furthermore, since the partitioning operation is no longer needed, the corresponding synchronization in the task flow control is removed. The corresponding idle time (red part at time \(t_0\) in Fig. 4a) is removed and instructions 7 and 8 are now pipelined (period [\(t_0,t_1\)] in Fig. 4b).

Coming back to Fig. 4a, one may notice that GPUs are idle for a while just before time \(t_1\) and again just before time \(t_2\). This is due to the reduction that finalizes each dot operation (dot(pq) at instruction 8 and dot(rr) at instruction 12, respectively). In Algorithm in Fig. 1a, vector x is only used at lines 10 (in read-write mode) and 6 (in read-only mode). The execution of instruction 10 can thus be moved anywhere within the iteration as long as the other input data of instruction 9, i.e. p and \(\alpha \) have been updated to the correct values. In particular, instruction 10 can be moved after instruction 12. This delay enables StarPU to overlap the final reduction of the dot occurring at instruction 12 with the computation of vector x. The red part before \(t_2\) in Fig. 4a becomes (partially) green in Fig. 4b. The considered CG formulation does not provide a similar opportunity to overlap reduction finalizing the dot operation at instruction 8.

4.3 Reducing Communication Volume by Packing Data

By avoiding data partition and data unpartition operations, the broadcast of vector p has been improved (from period \([t_2,t_4]\) in Fig. 4a to period \([t_3,t_4]\) in Fig. 4b), but still the communication time remains the large performance bottleneck (time interval \([t_3,t_4]\) in Fig. 4b). This volume of communication can be decreased. Indeed, if a column within the block-row \(A_i\) is zero, then the corresponding entry of p is not involved in the computation of the task \(q_i\leftarrow A_i p\). Therefore, p can be pruned.

We now explain how we can achieve a similar behavior with a task flow model. Instead of letting StarPU broadcast the whole vector p on every GPU, we can define tasks that only transfer the required subset. Before executing the CG iterations, this subset is identified with a symbolic preprocessing step. Based on the structure of the block \(A_{i,j}\), we determine which part of \(p_j\) is needed to build \(q_i\). If \(p_j\) is not fully required, we do not transfer it directly. Instead, it can be packed into an intermediate data, \(p_{i,j}\). StarPU provides an elegant support for implemented all these advanced techniques through the definition of new data types. We rely on that mechanism for implementing this packing scheme. Furthermore, the packing operation may have a non negligible cost whereas sometimes the values of \(p_{i,j}\) that needs to be sent are almost contiguous. In those cases, it may thus be worth sending an extra amount of data in order to directly send the contiguous superset of \(p_{i,j}\) ranging from the first to the last index that needs to be transferred. We have implemented such a scheme. A preliminary tuning is performed for each matrix and for each \(p_{i,j}\) block to choose whether \(p_{i,j}\) is packed or transferred in a contiguous way. Although StarPU can perform automatic prefetching, the prefetching operation is performed once all the dependencies are satisfied. In the present context, with the static mapping, this may be too late and further anticipation may be worthy. Therefore, we help the runtime system in performing data prefetching as soon as possible performing explicit prefetching within the callback of the scale-axpy task. We also do so after the computation of the \(\alpha \) and \(\beta \) scalar values (lines 9 and 13 in Algorithm in Fig. 1a) for broadcasting them on all GPUs.

Figure 5 depicts the execution trace. The time interval [\(t_3,t_4\)] in Fig. 4b needed for the broadcasting of the vector p has been reduced to the interval [\(t_0,t_1\)] in Fig. 5. In the rest of the paper we refer to as the full algorithm when all the blocks are transferred, or to as the packed algorithm if this packing mechanism is used.

Fig. 5.
figure 5

Execution trace when furthermore the vector p is packed.

Fig. 6.
figure 6

Execution trace when relying of a 2D decomposition of the matrix (left) and the performance of SpMV kernel when 2D decomposition is applied to the matrix (right).

4.4 2D Decomposition

The 1D decomposition scheme requires that for each SpMV task, all blocks of vector p (packed or not packed) are in place before starting the execution of the task. In order to be able to overlap the time needed for broadcasting the vector p (time interval [\(t_0,t_1\)] in Fig. 5), a 2D decomposition must be applied to the matrix. The matrix is first divided in block-rows, and then the same decomposition is applied to the other dimension of the matrix. Similarly as for a 1D decomposition, all the tasks SpMV associated with the entire block-row will be mapped on the same GPU. Contrary to the 1D decomposition, where we had to wait for the transfer of all missing blocks of the vector p, with the 2D decomposition, time needed for the transfer of the vector p can be overlapped with the execution of the SpMV tasks for which the blocks of the vector p are already available on that GPU. On the other hand, the 2D SpMV tasks yield lower performance then 1D (see Fig. 6b and 2b), since they are executed on lower granularity.

The result of the impact of a 2D decomposition is shown in Fig. 6a. During the time interval [\(t_1,t_2\)] in Fig. 5 there is no communication, while in Fig. 6a communications are overlapped with the execution of the SpMV tasks. In the rest of the paper we refer to either 1D or 2D depending on the data decomposition used. The trade-off between large task granularity (1D) and increased pipeline (2D) will be discussed in Sect. 5.

5 Performance Analysis

We now perform a detailed performance analysis of the task-based CG algorithm designed above. We propose to analyze the behavior of our algorithms in terms of speed-up (\(S_{}\)) and parallel efficiency (\(e_{}\)) with respect to the execution occurring on one single GPU. In order to understand in more details the behavior of the proposed algorithms, we decompose the parallel efficiency into three effects, following the methodology proposed in [10]: the impact on efficiency due to operating at a lower granularity (\(e_{granularity}\)), the impact of concurrency on the performance of individual tasks (\(e_{tasks}\)) and the impact of a suboptimum task pipelining (\(e_{pipeline}\)) due to a lack of concurrency. As shown in [10], the following equality holds:

$$e_{} =e_{granularity} \times e_{tasks} \times e_{pipeline}.$$

We observed that the efficiency of the task is maximum (\(e_{tasks} = 1\)). Indeed, in a multi-GPU context, the different workers do not interfere with each other (they do not share memory or caches) and hence do not deteriorate the performance of one another. In the sequel, we thus only report on \(e_{granularity}\) and \(e_{pipeline}\).

Table 2. Performance (in Gflop/s) of our CG algorithm for the matrix collection presented in Table 1.

Table 2 presents the performance achieved for all matrices in our multi-GPU context. The optimal performance is represented for each scheme in bold value. The first thing to be observed is that for all the matrices, the pack version of our algorithm where only just the needed part of the vector p is broadcasted, yields the optimal performance. Broadcasting entire sub-blocks is too expensive and thus considerably slows down the execution of the CG algorithm. For the matrices that have a regular distribution of the non zeros, i.e. the 11pts-256, 11pts-128 and the af_0_k101 matrices, the 1D algorithm outperforms the 2D algorithm. On the other hand, in the case of the Audi_kw matrix that has an unstructured pattern, the 2D algorithm which exhibits more parallelism, yields the best performance.

Table 3. Obtained speed-up (\(S_{}\)), overall efficiency (\(e_{}\)), effects of granularity on efficiency \(e_{granularity}\) and effects of pipeline on efficiency \(e_{pipeline}\) for matrices presented in Table 1 on 3 GPUs.

Table 3 allows for analyzing how the overall efficiency is decomposed according to the metrics proposed above. Dividing the 11pts-256 matrix in several block-rows does not induce a penalty on the task granularity (\(e_{granularity} = 0.99 \approx 1\)). Furthermore, thanks to all the improvements of the task flow proposed in Sect. 4, a very high pipeline efficiency is achieved (\(e_{pipeline} = 0.97\)), leading to an overall efficiency of the same (very high) magnitude. For the 11pts-128 matrix, the matrix decomposition induces a similar granularity penalty \(e_{granularity} = 0.98\). The slightly lower granularity efficiency is a direct consequence of the matrix order. For smaller matrices, the tasks are performed on smaller sizes, thus the execution time per task is decreased. This makes our algorithm more sensitive to the overhead created by the communications induced by the dot-products and the broadcasting of the vector p, ending up with a less optimal (but still high) pipeline efficiency (\(e_{pipeline} = 0.86\)). The overall efficiency for this matrix is \(e_{} =0.84\). This phenomenon is amplified when the matrix order is getting lower, such as in the case of the af_0_k101 matrix, resulting in a global efficiency of \(e_{} = 0.65\). The Audi_kw matrix yields optimal performance with the 2D algorithm (see Sect. 4.4). Although the 2D algorithm requires to split the matrix in many more blocks inducing a higher penalty on granularity (\(e_{granularity} = 0.87\)), it allows for a better overlap of communication with computation ensuring that a higher pipeline (\(e_{pipeline} = 0.91)\) is achieved. With this trade-off, an overall efficiency equal to \(e_{} = 0.79\) is obtained.

Fig. 7.
figure 7

Traces of an execution of one iteration of the CG algorithm in the heterogeneous case (9 CPU and 3 GPU workers) with different partitioning strategies for the Audi_kw matrix. In (a), the \(nnz\) is balanced per block-row (\(33\,{\upmu }\mathrm{s}\)). In (b) a feed-back from a history based performance model is used for the partitioning of the matrix (\(16\,{\upmu }\mathrm{s}\)).

6 Towards a Fully Heterogeneous CG Solver

One advantage of relying on task-based programming is that the architecture is fully abstracted. We prove here that we can benefit from this design to run on an heterogeneous node composed of all available computational resources. Because the considered platform has 12 CPU cores and three GPUs, but that each GPU has a CPU core dedicated to handle it, we can only rely on 9 CPU workers and 3 GPU workers in total.

Figure 7 presents execution traces of preliminary experiments that rely on two different strategies for balancing the load between CPU cores and GPU. These traces show the ability of task-based programming in exploiting heterogeneous platforms. However, they also show that more advanced load balancing strategies need to be designed in order to achieve a better occupancy. This question has not been fully investigated yet and will be further investigated in a future work.