A Study On Dynamic Load Balancing Algorithms
A Study On Dynamic Load Balancing Algorithms
PC 2
P A DERB ORN
CENTER FOR
C
PARALLEL
COMPUT ING
PC2 - Paderborn Center for Parallel Computing, Universitat-GH Paderborn, D-33095 Paderborn, Germany
Phone: +49 5251 603342 Fax: +49 5251 60 3436 email: pc2-team@uni-paderborn.de
A Study on Dynamic
Load Balancing Algorithms 0
R. Luling, B. Monien, F. Ramme
Abstract
Dynamic load balancing techniques have proved to be the most critical part of an ecient
implementation of various algorithms on large distributed computing systems.
In this paper a classification of dynamic distributed load balancing algorithms for homoge-
neous multiprocessor systems is introduced and a general test bed, using a random branch
& bound load-generator, for evaluating load balancing strategies is described. With its help
a number of well known load balancing strategies are compared with two new algorithms
based on the gradient model method. The behavior of all algorithms on various networks
when running dierent workload patterns is studied.
By our simulations on a recongurable transputer system it is shown that all strategies
perform better on networks with small diameter. The measurements indicate that even on
large networks one of the randomized strategies and our extension of the gradient model
method behaves very well when simulating data-migration, while under process-migration
another extension of the gradient model method is favored. These new algorithms seem to
be very robust to the kind of workload and therefore well suited for an integration into a
distributed operating system running on large networks.
Keywords: Dynamic Distributed Load-Balancing, Gradient Model, Branch & Bound
1 Introduction
In this paper dynamic load balancing techniques (also referred to as resource sharing, resource scheduling,
job scheduling or task migration methods) in large MIMD multiprocessor systems are studied. In our
0
An Extended Abstract of this report is published in the proceedings of the 3rd IEEE SPDP 91 pp. 686-689
1
case a multiprocessor system consists of autonomous processing elements, which are coupled by a point-
to-point connection network. Processors communicate solely by message passing.
Recently, distributed computing systems with several hundred powerful processors have been built. To
achieve a maximumeciency of these large systems the workload has to be distributed equally throughout
the network. In general we can distinguish static and dynamic load balancing algorithms. In the case
of a static load balancing policy, a xed process graph which represents the distributed computation is
mapped onto the interconnection network. In this case the aim is to minimize edge dilation, processor
load dierences and edge congestion. For an overview of this work see [13].
If the load situation changes in an unpredictable way, as it is the case for many applications, it is
necessary to use a dynamic load balancing strategy which is adaptive to this changing load situation. To
be ecient for large distributed systems the load balancing algorithm itself should be distributed. In the
past the problem of designing such algorithms was studied by various groups who used two very dierent
approaches:
The analytical method: In [3, 4, 7, 15, 21] dierent models were used to analyze the behavior of dynamic
load balancing techniques. Most of these models were based on queuing networks. Because of the
complexity of these models only simple strategies could be analyzed.
The simulation method: There is a huge amount of work which has been done in this eld over the past
decade (see Proceedings of the Int. Conf. on Distributed Computing Systems for an overview).
The majority of authors connected less than 16 processors and used a clique network, realized by a
local area network (LAN), as interconnection structure. Most of the times, the migration of large
packets (process migration) was studied on these networks. Only a few publications consider larger
interconnection networks. See e.g. [2] and [8].
Since many of the published load balancing strategies are only variations of one basic principle, we have
reduced some of the known algorithms to their main features and compared them by simulation. The
network topology is assumed to be homogeneous and in our case consists of up to 324 transputers. Cur-
rently, this is the largest dynamically recongurable transputer system in Europe. To study the behavior
in large networks, the transputers were connected to rings of up to 169 processors (large diameter) and
to a 88, 1313 and 1818 processor torus topologies (relatively large diameter, maximum degree of
four).
First measurements made on our latest machine (a partitionable transputer system with 1024 T805
processors) conrmed our expectations.
To compare dierent applications for load balancing strategies we have examined "process migration"
(relatively long packets with strongly varying properties) and "data migration" (relatively small and
homogeneous packets) separately. These load patterns are generated by a random branch & bound load
generator. In process and data migrations, each load unit is able to generate new load units. The
generating procedure, the scheduling technique and the properties of each load unit can be controlled by
parameters of the generator. This
exibility enables us to use the same generator to model both workload
characteristics.
The report is structured as follows: In section 2 we will present our classication of load balancing
strategies. Section 3 introduces our simulation environment and the dierent load classes used. Several
algorithms known from literature and implemented in our simulation environment are shortly described
in section 4.1. In section 4.2 we will present two new algorithms based on the gradient model method.
Our simulation results are discussed in chapter 5. Some conclusions followed by more detailed simulation
results in the appendix conclude this report.
2
static, dynamic, adaptable, ...) in a tabular form. We consider this representation to be too complex for
a general discussion, however.
Any dynamic distributed load balancing strategy can be separated into a decision part and a migration
part. In the decision part of the algorithm a decision to migrate or to keep a load unit is made. This
decision can be based on the local load situation and that of the neighboring processes or it depends on
the load situation of any subset of the whole network. In the rst case we call it a 'local decision base'
whereas the second case is called 'global decision base'.
In the migration part a load unit is send to another processor to decrease the load imbalance of the
system. If load units are migrated to direct neighbors only the strategy has a 'local migration space'.
Otherwise, it is called a strategy with 'global migration space'.
According to this distinction between global and local bases we are introducing a new ordering of load
balancing strategies with respect to the 'decision base' and the 'migration space' of a processor.
Here we distinguish the Local and G lobal concept for decision and migration activities. A further dis-
tinction is achieved by regarding the initiator of load balancing activities. (s ender, r eceiver or combined
(sr )).
Migration−
space However, the
distinction between sender and receiver
G LDGM i GDGM i
is less important, because most of the
i { s , r , sr } load balancing strategies can be formu-
lated in a sender or receiver initiated
L LDLM i GDLM i
manner.
L G Decision−base
3
There are two important characteristics of a distributed computation for the behavior of a load bal-
ancing algorithm. One is the variation of the amount of work associated with a load unit. The other is
the variation of the size of the load units. The rst property can be controlled by some parameters of
the load generator. For the second property we distinguish data- and process migration.
In data migration all packets are relatively small and homogeneous in size. Typical examples for this
class of applications are search algorithms (e.g. branch & bound and alpha-beta search) in the area of
articial intelligence and operations research. We determine the load of one processor as the number of
load units held by this processor.
A common application of process migration is dynamic load balancing within a distributed operating
system. In this case a process consists of program code and corresponding data. This implies that the
packages which have to be migrated are relatively large with strongly varying properties. For our simula-
tion we assume that all tasks are independent, since for most applications the best strategy is to migrate
dependent tasks as one process cluster and to do the whole computation locally afterwards [5].
As an example of this type of strategy we have implemented the global random strategy (g-rnd) which
is similar to the local random algorithm. The only dierence to the above strategy is that load units are
now migrated over the whole network to a randomly chosen processor [16].
The gradient model (GM) method was introduced by Lin and Keller [10]. It belongs to the group of
GDLMr {strategies, because decisions are based on gradient information. Gradients are vectors consisting
of load respectively distance information of (more or less) all processing elements, which means that each
processor tries to achieve a well approximated global state information of the network. Load units are
4
always sent to immediate neighbors (local). A processor can be in one of the three states L (low), N
(normal) or H (high) according to the local load situation. Each processor "knows" the outgoing link
which leads on the shortest path to a processor which is in state L. If a processor is in state H it sends
a load unit on this link in the direction of an underloaded processor. If a processor changes its state or
updates its shortest paths to a processor in state L, this state is sent to all direct neighbors.
b) Bidding-Algorithm (bid)
The Bidding-Algorithm based on [6, 18] is also state-controlled. The number of processing elements
which are able to take load units from a processor in state H depends on the distance between
these processors. The maximum distance of the load receiver and sender is varied dynamically.
Bid replies take account of communication costs.
The basic idea of this algorithm is that a processor in state H tries to migrate a load unit to a
processor with maximal bid value among all processors which have a distance less than d from the
initiating processor. The distance value d is increased (decreased) if the initiator does not receive
enough bids (receives too many bids) for his oered load unit in a xed time interval which also
depends on d. For a complete description of this algorithm see [6, 18].
a) Drafting Algorithm (draft)
In this algorithm a processor can be in one of the three states L (low), N (normal) or H (high)
which represent the actual load situation. Each processor maintains a load table which contains
the most recent information of the so called candidate processors. A candidate processor is a
processor from which a load unit may be received. Ni et. al [14] choose only the direct neighbors
as candidate processors. To achieve a better separation from the strategies with local migration
space, we use every processor of the network as candidate processors. In opposition to [14] a load
unit is allowed to be migrated several times. Every message is extended by the load value of the
sender (piggybacking) which is used to update the values of the load tables. If the local state of
a processor changes signicantly, it is broadcasted through the network. A migration activity is
initiated by a processor which is in state L. This processor selects one of the processors of its local
load table which is in state H, to migrate a load unit to the initiator.
1 0 1 2
2 1 2 3
If the load inside the areas goes down (more or less suddenly, in a large network)[g. a ], then the
5
GM starts rebalancing along the
anks [g. b ]. Larger areas of underloaded processing elements are the
consequence of this strategy, which will be overcome by the new algorithm presented in the next section.
H
{ H
{
N { N {
L { L {
Network Network
g. a g. b
6
t ? 1 and t then i) pt(i) = w(i) 8i 2 V and
ii) st (i) = w(i) 8i 2 V
Proof (sketched) :
i) was shown in [10]
ii) if 6 9 i 2 V such that i is in state H then it follows by denition that st (i) = D(G) + 1 = w(i).
If there is any k 2 V such that k is in state H, then let i 2 V be a processing element and k the
nearest processing element in state H with respect to i. It can be easily shown by induction on
the length of the shortest path from node i to k that st (k)=w(k) holds.
This implies that pressure- and suction- surfaces are well approximated. 2
The following X ? GM algorithm is activated on arrival of a message from a neighboring processor or
when the local load situation changes.
Let fp1, : : :, pk g, f s1 , : : :, sk g be the pressure and suction values of the k neighbors of a processing
element. Initially we set pj := 0 and sj := D(G) + 1 8 neighbors j.
ON event DO
old.p:= p; old.s:= s
CASE local state OF
L : p :=0; ignore pressure values from neighbors
s :=minfD(G)+1, 1+minfsj j 1 j k g g
N : p := minfD(G)+1, 1+minfpj j 1 j k g g
s :=1+minfsj j 1 j k g
IF (s > D(G)+1) THEN s := D(G)+1
ELSE IF (maxfsj j 1 j k g > s) THEN
send one load unit to neighbor j with maximal suction value sj
H : s :=0; p := 1+minfpj j 1 j k g
IF (p > D(G)+1) THEN p := D(G)+1
ELSE send one load unit to the neighbor j with minimal pressure value pj
IF (maxfsj j 1 j k g > s) THEN
send one load unit to neighbor j with maximal suction value sj
IF (p <> old.p) THEN send p to all neighbors
IF (s <> old.s) THEN send s to all neighbors
5 Simulation Results
To study the behavior of load-balancing strategies in large networks we have to choose the network
topology and the kind of load.
7
If diameter and degree of a network are constant, we expect that the behavior of networks like the torus
is comparable to networks with more processors and logarithmic diameter (like the De Bruijn network).
So we selected the 88, 1313 and 1818 torus topology for simulation purpose. We connected the
processors to rings of up to 169 (132) elements to study networks with considerably larger diameter. It
has to be recognized that besides the diameter, the average degree of the network can also in
uence the
behavior of the strategies.
Another decision concerns the kind of load to be simulated. As mentioned in section 3, we considered
data and process migration separately. However, we studied at least one strategy out of every class
introduced in section 2.
To examine the behavior of the algorithms described in section 4, we let each of them run in their speed-
up optimized version on dierent workloads. To avoid nondeterministic eects which are inherent in
distributed computations, we performed each simulation several times. Only average values were used
for comparison.
100
80
80
60
60
40
40
20
1
20
1 64
64
0
7 130
6 130
0
5 7 6 195
4 195 5
3 259 4 259
2 3 2
1
g.1 (torus) g.2 (ring)
0 324 1 324
0
From g.1 (torus topology) we derive that only g-rnd (2), X-GM (3) and GX-GM (7) behave well. The
l-rnd (0) strategy slows down very rapidly, as can be seen in g.2 (ring topology). g-rnd is able to saturate
the networks by extremely high migration activities (see table 3). However, this is less important when
performing data migration, because the package size is relatively small. From g.2 (ring topology) we
can derive that a global migration space is a very important criteria in networks with large diameter and
low degree. Nevertheless, strategies like bid (5) or draft (6) which have a global migration space need a
great amount of control communications. That is the main reason why these strategies behave so badly
on large networks. The main drawback of d-N (1) is its local migration space, which often results in a
clustering of low saturated processors.
8
Summary :
When performing data migration in large networks, the load balancing strategy should have at least one
global component (migration space or decision base) and should not need too many control communica-
tions.
g−rnd
Migration − GX−GM
space
100
80
80
60
60
40
40
20
20
1
0
7 1 64
0
6 7
5 64 6 130
4 130 5
3 4 195
2 195 3
259 2 259
1 1
0 324 0 324
9
Considering g.3 (torus), it is remarkable that only d-N (1), X-GM (3) and GX-GM (7) behave well.
All these strategies have a well-directed migration policy while using a moderate amount of control mes-
sages. g.4 (ring) suggests, that d-N will slow down very much if the diameter increases further. This is
due to the restricted decision base which results in load clustering if the network is large enough. Because
g-rnd has a high migration activity and the package size is large when performing process migration, this
strategy now behaves badly. Strategies with synchronized protocol activities like bid (5) or draft (6) also
behave poorly, because the necessary control messages
ow slowly through the network and are often
inconsistent with the actual system load.
Summary :
When performing process migration in large networks, the load balancing strategy should have a well-
directed migration policy, so that only a few wrong migration decisions are made. This requirement is
fullled by a global decision base, but only if the global load information can be well approximated. A
global migration space or protocols working over long distances seem to be critical, because the edges
of the networks are often blocked by transferring large packages and therefore the protocol information
becomes very inconsistent.
Migration − GX−GM
space
X−GM
As a summary, one can state that g-rnd is well suited when performing data migration on large net-
works. It is easy to implement but requires a global routing facility of the system.
d-N is well suited when performing process migration on networks of average size. The protocol is rela-
tively easy to implement and a global routing facility of the system is not needed.
X-GM behaves very well when performing process migration and quite well when performing data mi-
gration on large networks. So it seems to be relatively robust with respect to the load characteristics.
The X-GM algorithm is easy to implement and is independent of the routing facilities of the system.
These properties makes the X-GM strategy best suited for an integration into an environment where the
workload characteristics are unpredictable. An example of such an environment is a distributed operating
system running on a large network.
10
6 Conclusions
In this paper we have studied dynamic load balancing algorithms using a general purpose simulation
environment running on networks of up to 324 (1024) processors.
We have introduced a new classication scheme for dynamic load balancing algorithms, implemented eight
very dierent strategies (section 4), and evaluated their behavior on dierent ring and torus topologies
for data- and process migration.
Our results indicate that the decision about the most suitable load balancing algorithm depends on the
network as well as on the workload characteristics. We were able to make a promising extension of the
gradient model method introduced by Lin and Keller [10].
It has been shown that in large networks and under data migration a random strategy with global
migration space (4.1.2) and our global variant of the extended gradient model (4.2.2) perform well. In
large networks and under process migration our extended gradient model (4.2.1) has the best behavior.
Because this algorithm leads to a high performance relatively independent of the workload characteristics
it seems to be well suited for an integration into an environment with unpredictable workload patterns.
Appendix
Table one and two present the speed-up measurements (the ratio of sequential and parallel computation
time) when performing data migration (see section 3). The minimum parallel computation time of the
average problem instance is 58.2 seconds.
The values presented are normalized to a zero search-overhead. It can be shown that anomalies of the
distributed branch & bound method [9], which most times results in a search-overhead, are in corre-
lation to the complexity of the used load balancing algorithm. By normalizing the results to a zero
search-overhead, it is possible to deduce statements about the behavior of the algorithms for a xed
computation. Further details are shown in table three.
speed-up values data migration speed-up values data migration
zero search-overhead Torus topology zero search-overhead Ring top.
Alg. 8x8 (64) 13x13 (169) 18x18 (324) Alg. 64 169
l-rnd 41.2 111.8 200.6 l-rnd 31.4 37.2
d-N 54.0 84.5 138.8 d-N 42.8 37.9
g-rnd 60.5 131.7 239.5 g-rnd 56.1 102.0
X-GM 61.2 139.6 228.9 X-GM 48.6 62.8
GM 57.1 124.2 189.0 GM 46.7 45.2
bid 50.5 94.2 130.4 bid 25.7 11.2
draft 57.6 100.6 88.1 draft 51.3 46.6
GX-GM 61.2 139.5 224.8 GX-GM 55.2 73.5
Table 1 Table 2
11
data migration 18x18 (324) Torus top.
Alg. #load.trans #ctrl %search
l-rnd 1689913 | 7.15 Here #load.trans is the total number of
d-N 85227 1128386 17.31 hops load units were migrated, #ctrl is
g-rnd 1740619 | 0.68 the total number of control messages sent
X-GM 556162 649225 5.26 and %search denotes the search-overhead
GM 1144404 488408 17.44 with respect to the sequential case.
bid 103560 8155760 20.63 Table 3
draft 43027 15883144 9.97
GX-GM 648006 668870 4.77
Table four and ve present the speed-up measurements when performing process migration
(see section 3). The minimum parallel computation time of the average problem instance is
154.2 seconds. Additional information is given by table six.
speed-up values process migration speed-up values process mig.
Torus topology Ring topology
Alg. 8x8 (64) 13x13 (169) 18x18 (324) Alg. 64 169
l-rnd 31.1 112.9 159.4 l-rnd 14.3 14.6
d-N 62.8 145.3 221.5 d-N 47.85 44.8
g-rnd 58.1 87.3 122.8 g-rnd 45.9 61.5
X-GM 60.9 142.6 253.6 X-GM 57.9 72.9
GM 49.4 86.7 180.9 GM 50.0 41.3
bid 53.0 127.2 173.5 bid 32.4 16.4
draft 59.1 118.6 171.4 draft 53.4 65.1
GX-GM 59.8 142.1 239.7 GX-GM 54.1 65.4
Table 4 Table 5
12
References
[1] K.M. Baumgartner, B.W. Wah
A Global Load Balancing Strategy for a Distributed Computer System, Workshop on the
Future Trends of Distributed Computing Systems in the 1990s IEEE Comp. Soc. Press,
1988, pp. 93-102
[2] W. Bodenschatz
Multi-Transputer-Maschine zur parallelen Reduktion von Funktionalsprachen, PARS Work-
shop 1989, pp. 128-150
[3] T.L. Casavant, J.G. Kuhl
A Formal Model of Distributed Decision-Making and its Application to Distributed Load
Balancing, IEEE 6 th Int. Conf. on Distributed Computing Systems 1986, pp. 232-239
[4] T.L. Casavant, J. G. Kuhl
Analysis of Three Dynamic Distributed Load Balancing Strategies with Varying Global In-
formation Requirements, IEEE 7 th Int. Conf. on Distributed Computing Systems 1987,
pp. 185-192
[5] A.K. Ezzat, R.D. Bergeron, J.L. Pokoski
Task Allocation Heuristics for Distributed Computing Systems, IEEE 6 th Int. Conf. on
Distributed Computing Systems 1986, pp. 337-346
[6] D. Ferguson, Y. Yemini, C. Nikolaou
Microeconomic Algorithms for Load Balancing in Distributed Computer Systems, IEEE 8 th
Int. Conf. on Distributed Computing Systems 1988, pp. 539-546
[7] C.Y.H. Hsu, J.W.S. Liu
Dynamic Load Balancing Algorithms in Homogeneous Distributed Systems, IEEE 6 th Int.
Conf. on Distributed Computing Systems 1986, pp. 216-223
[8] L.V. Kale
Comparing the Performance of two Dynamic Load Distribution Methods, Parallel Process-
ing, vol. 1, 1988, pp. 8-12
[9] T.H. Lai, S. Sahni
Anomalies in Parallel Branch and Bound Algorithms, Proc. of the Int. Conf. on Parallel
Processing, 1983, pp. 183-190
[10] F.C.H. Lin, R.M. Keller
The Gradient Model Load Balancing Method, IEEE Trans. on Software Engineering 13,
1987, pp. 32-38
[11] R. Luling, B. Monien
Two Strategies for solving the Vertex Cover Problem on a Transputer Network, 3 rd Int.
Workshop on Distributed Algorithms 1989, LNCS 392, pp. 160-170
[12] R. Luling, B. Monien
Load Balancing for Distributed Branch and Bound algorithms, manuscript 1991
[13] B. Monien, H. Sudborough
Embedding one Interconnection Network in Another, Computing Suppl. 7, 1990, pp. 257-282
13
[14] L.M. Ni, C.W. Xu, T.B. Gendreau
Drafting Algorithm - A Dynamic Process Migration Protocol for Distributed Systems, IEEE
5 th Int. Conf. on Distributed Computing Systems 1985, pp. 539-546
[15] S. Pulidas, D. Towsley, J.A. Stankovic
Imbedding Gradient Estimators in Load Balancing Algorithms, IEEE 8 th Int. Conf. on
Distributed Computing Systems 1988, pp. 482-490
[16] F. Ramme
Lastausgleichsverfahren in Verteilten Systemen , Master Thesis, University of Paderborn,
1990
[17] D.R. Smith
Random Trees and the Analysis of Branch and Bound Procedures, JACM, vol. 31, 1984, pp.
163-188
[18] J.A. Stankovic, I.S. Sidhu
An Adaptive Bidding Algorithm for Processes, Clusters and Distributed Groups, IEEE 4 th
Int. Conf. on Distributed Computing Systems 1984, pp. 49-59
[19] J.M. Troya, M. Ortega
A study of parallel branch-and-bound algorithms with best-bound-rst search, Parallel Com-
puting vol. 11, 1989, pp. 121-126
[20] O. Vornberger
Load Balancing in a network of Transputers, 2 nd Int. Workshop on Distributed Algorithms
1987, pp. 116-126
[21] S. Zhou
A Trace-Driven Simulation Study of Dynamic Load Balancing , IEEE Trans. on Software
Engineering, vol. 14, no.9, 1988, pp. 1327-1341
14