Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Reference2 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Algorithms for Self-Organization and Adaptive Service

Placement in Dynamic Distributed Systems


Artur Andrzejak, Sven Graupner,Vadim Kotov, Holger Trinks
Internet Systems and Storage Laboratory
HP Laboratories Palo Alto
HPL-2002-259
September 17th , 2002*

E-mail: {artur_andrzejak, sven_graupner, vadim_kotov, holger_trinks} @hp.com

self-organizing In this paper we consider distributed computing systems which


algorithms, exhibit dynamism due to their scale or inherent design, e.g.
adaptive service inclusion of mobile components. Prominent examples are Grids -
placement, large networks where computing resources can transparently be
distributed shared and utilized for solving complex compute tasks.
systems, grid
systems One of the hard problems in this domain is the resource allocation
problem and the related service placement problem. In this paper
we discuss distributed and adaptive resource allocation algorithms
performed in such dynamic systems. These algorithms assume that
no global information about resource availability and service
demand can be provided due to the scale and dynamism.

Interesting aspects of our approaches are the capabilities of self-


organization and fault-tolerance. We analyze and “factor-out” these
capabilities, making them also usable in the setting of other
dynamic distributed systems, for example in mobile computing.

* Internal Accession Date Only Approved for External Publication


 Copyright Hewlett-Packard Company 2002
Algorithms for Self-
Self-Organization and Adaptive Service
Placement in Dynamic Distributed Systems

Artur Andrzejak, Sven Graupner, Vadim Kotov, Holger Trinks

Hewlett-Packard Laboratories, 1501 Page Mill Road, Palo Alto, CA 94304, USA
{artur_andrzejak, sven_graupner, vadim_kotov, holger_trinks}@hp.com

Abstract environment. For example, the applications will possess


the capability to migrate from site to site during the
In this paper we consider distributed computing systems execution depending on both the changing resource
which exhibit dynamism due to their scale or inherent availabilities and their own needs. We envision this trend
design, e.g. inclusion of mobile components. Prominent also as an answer to both the increasing scale of Grids and
examples are Grids - large networks where computing to the correlated high costs of their manual management.
resources can transparently be shared and utilized for In this paper we anticipate this basic functionality and
solving complex compute tasks. illustrate how it can be used to increase the degree of
One of the hard problems in this domain is the resource automation in Grid systems. Another trend in Grids
allocation problem and the related service placement stressing its dynamic nature is to integrate, develop and
problem. In this paper we discuss distributed and adaptive use services in a grid environment according to the Open
resource allocation algorithms performed in such dynamic Grid Service Architecture (OGSA) [8] (we will therefore
systems. These algorithms assume that no global use the terms application and service interchangeably in
information about resource availability and service this paper).
demand can be provided due to the scale and dynamism. Suitable placement of services or applications on
Interesting aspects of our approaches are the capabilities resources is the primary factor for the economic
of self-organization and fault-tolerance. We analyze and utilization of underlying resources in such dynamic
“factor-out” these capabilities, making them also usable in systems. A good solution for this problem prevents
the setting of other dynamic distributed systems, for overloading server environments or the communication
example in mobile computing. infrastructure, keeps resource utilization and response
times in balance, and achieves higher availability and
fault-tolerance. This paper describes and evaluates several
1 Introduction algorithms which provide suitable service placement
Grid computing arose in the early 1990’s in the while considering fault-tolerance and self-organization.
supercomputing community with the goal of making As a by-product we study universal approaches and
underutilized computing resources easily available for paradigms for controlling large and potentially instable
complex computations across geographically distributed distributed systems under the aspects of self-organization
sites. The idea of the Grid is to provide a transparent and and fault-tolerance. We believe that the resulting insights
secure access to the resources by a software layer installed are useful as building blocks for a multitude of related
on the machines of participating organizations. This layer problems (e.g. resource revocation) in distributed systems
provides a multitude of functions, including resource with dynamic nature, such as those occurring in mobile
virtualization, discovery and search for resources as well computing and ubiquitous computing. These elements are
as the management of running applications. In addition to partially independent of other aspects of the algorithms
proprietary Grid software, two major software and can be “factored out” from the proposed approaches.
frameworks are in use today: the open-source Globus
toolkit [26] and the Grid Engine [24]. Overview of the paper. In Section 2 we discuss several
issues related to management of dynamic distributed
A major development in Grids is the Dynamic Grid systems in more detail. We describe in more depth the
Computing [27]. This research trend focuses on problems and challenges in this field. We further
harnessing dynamic resources in the Grid by providing the illustrate the trade-off between algorithm reactiviness and
applications with self-awareness of their changing

-1-
the solution quality. A part of Section 2 is devoted to system then transparently regulates service demands and
defining functions that evaluate the placements of supplies.
applications.
So far, most integrated management systems (in Grids
The first considered algorithm, based on the so-called and also other computing networks) are limited in regard
Ant Colony Optimization, is presented in Section 3. This to functioning in virtualized environments across
paradigm comes from the study of behavior of real ants organizational boundaries. Besides automated fail-over
and incorporates elements of machine learning via techniques in high-availability systems, management
recording the best partial solution by a “pheromone”. systems typically automate monitoring and information
The approach has been applied successfully to a variety collection. Decisions are made by human operators
of problems, including routing in telecommunication interacting with the management system. Major service
networks, matching problems and the famous Traveling capacity adjustments imply manual involvement in
Salesman Problem. Its strengths are high scalability, the hardware as well as in software. Systems need to be
possibility of balancing solution time against solution adjusted, re-installed and reconfigured, all expensive
accuracy and the robustness against failures of even large manual processes.
parts of the system.
Centralized versus distributed management. The
In Section 4 we discuss an approach taken from the design of an automatic management system for Grids is
coordination of mobile robots, called the Broadcast of closely related to the scale of the managed system and
Local Eligibility (BLE). We extend this method to the rate of system changes. In an ideal case, all
provide better scalability than the original solution and information about system state could be collected in a
suggest improvements in terms of communication costs central instance, and as a consequence an optimal
by applying gossiping algorithms. While this algorithm is placement could be made (modulo the computational
simple and has short reaction time, the placement tractability of the problem). However, with increasing
proposed by the algorithms might be far away from the scale and rate of system changes, this solution becomes
optimum. Therefore the use of this algorithm is mainly inappropriate. Another problem is fault tolerance.
for discharging of “hot-spots”, less for optimizing
Instead, we consider distributed algorithms for solving
service-to-server assignments.
the placement problem. We further strengthen the
The algorithm presented in Section 5 combines a notion scalability property by assuming that each individual
of intelligent agents which represent groups of services distributed component of an algorithm has only partial
with P2P-based overlay networks information services. information about the global state of the system. While
The advantages of this novel approach are exploiting the this assumption leads to reduced communication and
self-organization properties of P2P-networks, high increased reactiviness, the obtained placement of
scalability and the ease of further extensions. services to resources cannot be expected to be optimal,
Section 6 discusses two simple algorithms, which are i.e. only heuristic algorithms can work under these
easy to implement yet do not let us expect a good assumptions.
placement quality. Dynamic Distributed Systems. Computational Grids
In Section 7 we describe related work, while the Section and similar computational distributed systems are
8 is devoted to the conclusion. inherently dynamic due to their large-scale and
complexity. Here by “dynamic” we mean the property of
a frequently changing state of resource availability as
2 Management of Dynamic Distributed well as the state of service requirements. In a system
Systems comprising 1000s of servers, changes such as server
failure, overload or resource revocation might occur
2.1 Problem Domain every few seconds. Similarly, resource demand will
fluctuate in short time intervals.
Balancing demand and supply. A major aspect of grids
is to match resource supply with application demand. These effects require adaptation of the system to new
Resource capacities should also be provided locally to conditions on a permanent basis. While it could be
where demands occur avoiding cross-network traffic. possible to manage such a system by an army of human
Since demands are fluctuating over time and locations, operators, this approach is certainly not economically
application placements need to be adjusted accordingly, viable and more error-prone. In our view, automatic
ideally completely automated without human management comes into place at this point. We believe
intervention. Such an automated service grid control that self-organization, fault-tolerance and adaptation to
changes in supply and demand of resources are the key

-2-
elements to master this challenge on the top level, i.e. the we discuss at end of most sections the “building blocks”
application level. for transfer of learned lessons and paradigms.
Self-organization, fault-tolerance and adaptation. The Basic assumptions. In the remainder of this paper, we
term “self-organization” is not defined precisely in the assume some lower-lever system properties which are
literature. Intuitively, it describes an ability of a system to necessary for the functionality of the discussed
organize its components into a working framework algorithms. Specifically, we assume a basic mechanism
without the need of external help or control. For our which allows a server or other type of resource to join the
purposes we will understand self-organization as the system and notify its “neighbors” (e.g. resources in the
capability of adding and removing system parts without same subnet) about its existence. Such mechanisms are
the need for reconfiguration nor the need for human provided in the lower network protocol layer, or by the
intervention. This aspect is of particular interest for us resource discovery mechanisms in mobile systems. Note
since (non-automatic) management of systems is an that we do not assume any central instance to be notified:
essential cost factor and source of a majority of errors. informing only the neighbors is sufficient.
The fault tolerance of a system is its ability to recover Another mechanism we build upon is the ability of each
from transient and also possibly permanent failures resource to measure its distance (in network hops or
without human intervention. There is a large amount of similar units) from other resources in the network. This
literature on fault-tolerant systems; however, it is mostly ability enable building “maps” of other resources
focused on fault tolerance of system components, and not classified by their distance from a server. While this
on recovery of large and complex distributed systems. problem is not yet solved satisfactory, there are some
The interested reader is referred to [18]. promising approaches e.g. in the domain of P2P-systems
[20].
Adaptation to changing demand/supply conditions is
closely related to load balancing. Research on this topic
has a long history in distributed systems. However, in 2.2 Reactiviness and Solution Quality
most cases local ensembles of resources (such as One of the challenges of the service placement problem is
multiprocessors or clusters of workstations) are to find algorithms that are both reactive and deliver high-
considered, and stable “laboratory-like” conditions are quality solutions for the control scale we are dealing with.
assumed. In our case we have to meet a multitude of goals In practice, the responsiveness of an algorithm must be
as discussed in Section 2.3; also, the large-scale and the traded against the quality of a solution. Thus,
dynamics of Grid-like systems make new approaches responsiveness constitutes one parameter of the design
necessary. space. Another parameter is the type of the control
Paradigms for mobile computing and ubiquitous system, ranging from centralized to completely
computing. The challenges of dynamic Grid systems bear distributed. Since it is unrealistic to find one algorithm,
similarities to challenges of other highly dynamic (albeit which can be parameterized in both dimensions, we look
smaller) distributed systems – those occurring in mobile at several approaches covering most of the design space.
computing or ubiquitous computing. We believe that Figure 1 summarizes tradeoffs for algorithms used for
many of the methods or techniques presented here can decision-making. The first chart symbolizes the
become applicable or can give rise to new paradigms in dependency between the solution quality and time to find
those areas. Additional motivation for this statement is the a solution. The second chart shows that centralized
fact that the Grid is envisioned to comprise mobile algorithms usually do not scale well compared to
computing devices, as stated in the OGSA roadmap [8]. distributed algorithms. The next figure classifies four
Satyanarayanan points out in his paper [22] that in mobile algorithms in regard to solution quality vs. reactiveness.
systems the roles of a server and client become blurred at Since being part of a control system, reactiveness of
certain times, and mobile entities take both roles decisions is important. Reactiveness is understood as the
depending on the actual system conditions and resource time between detection an abnormality, for instance a
supply. Such a scenario is closely related to a picture of sudden peak demand, and the final computation of a
“dynamic mini-Grids” with needs for constant adaptation decision how the situation can be dealt with. Three time
of the computing loads. In this way, the approaches scales are considered: the “design” stage of an initial
discussed in this paper become directly applicable. service placement, in longer periods reiterated as long-
term adjustment process in the system; a mid-term period
To facilitate the application of the self-organizing for periodic operational adjustments, and a shorter-term
elements and fault-tolerant properties in other domains, period for discharging sudden hot spots.

-3-
network. This fully distributed approach can be
Quality vs. Time Scale vs. Centralization parameterized in order to react either fast yet less optimal
solution
or slower but yielding a better-quality solution. It can be
scale
used for both fast discharging of hot spots or for mid-term
quality
operational adjustments. A comparison of these
approaches in respect to the reactiviness/accuracy-tradeoff
is presented in Figure 2.

computation time centralized vs. distributed 2.3 Control Objectives and the Partial Objective
Function (POF)
Figure 1: Decision-making algorithm tradeoffs.
General control objectives. As discussed in the
beginning of this section, the goals for optimal placement
accurate fast
might vary in general. Therefore, the following algorithms
slow approximating
are designed to be generic enough to support new
objectives without fundamental changes. However, we
BLE-based focus on only few aspects to be achieved by control
Ant Colony
decisions. These are:
Agents in Overlay Networks 1. Balancing the server load such that the utilization of
Integer each server is in a desired range.
longer-term mid-term operational shorter-term hot
service placement adjustments 2. Placing services in such a way that communication
spot discharging
demand among them does not exceed the capacity of
Figure 2: Comparison of four algorithms in terms of the links between the hosting server environments.
accuracy and reactiviness. 3. Minimizing the overall network traffic aiming to
One approach we pursued is a centralized heuristic place services with high traffic close to each other on
algorithm based on integer programming. This algorithm nearby servers (nearby in the sense of a low number
(not discussed in this paper) yields high-quality solutions of communication hops across nodes).
but at a cost of longer running time and limited The Partial Objective Function. We want to be able to
scalability. For improved responsiveness and larger scale, compare different placement options in a quantitative
we explore agent-based and distributed algorithms way. To this aim we introduce a partial objective function
described below. Such algorithms are composed of (POF) fPOF, which is derived from a balanced sum of two
several simple decision-making instances, sometimes also characteristics. The first one, cT, is the sum of traffic costs
referred to as agents. They communicate with each other between the services on a pair of servers weighted by the
directly or indirectly in order to approximate a solution. distance of these servers. The second number, uT, is the
Each decision-making instance has, in general, only variance of the processing capacity usage among the
partial knowledge of the system. This facilitates servers. This leads to the POF computed by the formula:
scalability of such approaches. Furthermore, failure of any
of the decision-making instance does not make the overall
algorithm fail. β
f POF = ,
One agent-based approach is based on the Ant Colony β + (α ⋅ cT + (1 − α ) ⋅ uT )
Optimization paradigm [6], [23]. This fully distributed where α is the balancing factor between 0 and 1, and β a
algorithm has medium responsiveness and can be used for parameter described below. In our setting, both a lower
periodical reassignments of services onto servers. weighted traffic cost and a lower variance are better. This
As an alternative approach, we evaluate an agent system is reflected in the value of the POF, which has a higher
based on a paradigm known as Broadcast of Local “score” for smaller cT or uT. Note that the value of fPOF
Eligibility (BLE), used for coordination of robot teams ranges between 0 and 1; β must be chosen according to
[28]. This partially distributed algorithm allows faster the maximum possible values of cT and uT in order to
rebalancing of the managed services for the price of ensure a relatively uniform distributions of the values of
potentially lower-quality assignments. the POF.
Another approach uses more “intelligent” agents moving It is of course possible to exchange each of the above two
in the system guided by an self-organizing overlay characteristics. Especially, instead of cT one might

-4-
imagine function which returns the total number of servers “objective function”, we can use fPOF to ensure that such
used for placement. Such a function can be implemented requirements are fulfilled. Specifically, we set the value of
in the way described in [4]. the POF to 0, if any of the following conditions is
violated:
Our POF is evaluated for a set V of servers and a set S of
services. It is important to note that such a set might not − Each service is placed at exactly one server.
contain all services or all servers in the system. In case of
− For each server v, the total processing demand of all
services this is motivated by a fact that for larger systems
services assigned to v is at most the processing
we can frequently isolate groups of interdependent
capacity of v.
services (i.e. services communicating with each other).
While it makes sense to consider all services in such a − For each server v, the total storage demand of all
service group for a placement decision, we do not need to services assigned to v is at most the storage capacity
consider services outside the group. of v.
The rationale for considering only few and not all servers − For each pair (v, v’) of servers, the total network
is dictated by scalability issues. In large systems, it is traffic between the services hosted on these servers
simply impossible to take all servers into consideration. should not exceed the link capacity between v and
The algorithms described in the following select an v’.
appropriate subset of the servers from the system in a
− The entries of a so-called affinity/repulsion matrix, if
heuristic fashion. The subsets are then subject to
present, are respected; they indicate that a service
evaluation in the POF.
must not or must be placed on a certain server.
In the following, we give formal definitions for the
characteristics cT and uT. We assume a fixed assignment
of services in the set S to the servers in the set V. 3 Ant-Based Control Algorithm

For two servers v and v’, we designate by cv,v’ the In the classical Ant Colony Optimization [6], the path
estimated total traffic between all services placed on v and taken by an ant on its way between objects (e.g. cities in
all services placed on v’, measured in the number of the Traveling Salesman Problem) represents a possible
exchanged IP packets. If proxv,v’ is the network distance solution to the optimization problem. In our case, the
of servers v and v’ (in terms of IP-hops), then the total objects would be both servers and services, and the
weighted communication cost cT is given by the formula: alternating path would represent an assignment of services
to servers. However, this approach is centralized and not
really scalable for the following reasons:
1
cT =
M ¦¦ prox
v∈V v '∈V
v ,v ' ⋅ c v ,v ' , 1. The ant must “remember” the whole path it has
taken; this information might become very large.
2. The ant must visit all objects on its tour. In a large
and dynamic system, this is a serious drawback.
where M is the total number of exchanged IP packets
times the maximum distance between two servers in V. 3. Finally, each solution (path) must be evaluated
against others. This requires central knowledge.
For a server v, let uv be the fraction of its processing
capacity used by all services placed on this server. We 3.1 Overview
assume that uv is a real number in [0, 1]. Then the
variance uT of these numbers is defined by: For these reasons, we evaluate another approach not
common in the classical Ant Colony Optimization yet
leading to a better scalability. First we give an informal
2 overview of this algorithm.
1 § ·
uT = ¦ u v2 −
|V ¦¨ u ¸ .
| ¨© v∈V ¸¹
v In our system, for each service s we instantiate a “demon”
v∈V Ms called a service manager of s. If the service s is not yet
placed or an overload condition has occurred, Ms creates
multiple ants (“agents”) and sends them out to the server
Necessary conditions of an assignment. An assignment network. Each ant has a service list containing s and the
must fulfill certain necessary conditions; for example, we services cooperating with s. For each such a service, it
cannot assign a service to a server with insufficient knows the current resource requirements; also, it knows
processing capacity. By slightly abusing the notion of an

-5-
the current communication requirements among the the already assigned services from the service list, and
services in the service list. data about already visited servers, including link
capacities.
The ant travels from one server to another choosing the
servers along the path based on a probability computed Finally, a server manager holds the pheromone table of its
locally. In each step, one of the services from the list is server. The structure of the pheromone table is shown in
assigned to the current server. The path created in this Table 1. For each pair (serviceId, serverId) existing in this
way represents a partial solution to the placement problem table, we record the known pheromone score, the age of
as found by this particular ant. When the ant has assigned this score and the number of ants which contributed to
all the services, it reports its path to the service manager establish this score.
Ms of s and terminates. The manager compares the serviceId serverId pheromone score # ants
reported paths using the POF, where the set V of the POF score age (sec)
is constituted by the servers visited by this ant, and the set apache-01 15.1.64.5 0.572 95 15
S is the service list of s. This assignment is compared with
the current placement of those services. Finally, Ms apache-01 15.1.64.7 0.356 120 9
decides of a possible rearrangement of the placement. oracle-02 15.1.64.1 0.012 62 12

On each server, the ant evaluates the score of the server in … … … … …


respect to each service from its list. For each pair
(service, server), this placement score expresses how well Table 1: An example pheromone table
this server is suitable to host the service. It is computed
also using the POF in the way described below. 3.3 Functionality of the System Components
Furthermore, the ant causes the pheromone table of the In this section we describe in detail the behavior of the
current server to be updated. This table contains entities introduced above.
pheromone scores for certain pairs (service, server).
Those are essentially weighted sums of placement scores Service managers. A service manager constantly watches
of the ants that evaluated this particular (service, server)- the performance of “its” service and evaluates the current
pair. The table is used to help an ant to decide which assignment by a POF. On two occasions it spawns ants
server to visit next. The server managers of neighboring starting a process described below:
servers periodically exchange these tables, thus providing − If the POF value is larger than some critical limit;
a mechanism to disseminate the local information across this corresponds to the case of an occurrence of a
the system. “hot spot”.

3.2 Ants, Service Managers and Server Managers − If a certain period of time has passed since the last
launch of the ants. The purpose of this step is to
In our algorithm we have three entities that store and periodically “rebalance” the wholes system towards
manipulate data: an optimal utilization.
− a service manager Ms of a service s, The process from the decision of launching ants until its
− an ant representing s, termination includes the following steps.

− a server manager (corresponding to a single server) 1. Synthesize the ant data described in Section 3.2.
which executes the ant code, and maintains and 2. Place cs copies of such an ant in the server
updates the pheromone table of its server. network. The placement method and the value of
The data held by a service manager comprises the service cs is described in Section 3.6.
list of s, the number of spawned ants and the currently 3. Collect the assignments and the corresponding
best assignment reported by an ant. A service manager scores sent by the ants that terminated.
also knows how to evaluate the POF and its value for the
4. Once all ants have finished (or a timeout has
current placement of the services in the service list.
occurred), compare the reported assignments by
An ant is launched with the following data that are “static” the POF and choose the one with the best POF
during its lifetime: the service list together with the value.
current demand profiles of each service in the list, and the
5. If the service s has already been placed, compare
communication demand profiles between those services.
the current POF of s and the cooperating services
This information is necessary to compute the score via a
with the one found in Step 4. If the new
POF. The dynamic data carried by an ant are the scores of
assignment is better by a threshold ts

-6-
(representing the “penalty” for reassigning Server managers. Both entities described above have
services to servers), continue with the next step; essentially a fixed order of tasks to be executed. By way
otherwise, terminate this epoch of ant launching. of contrast, a service manager acts in an asynchronous
6. If s is not placed, or the evaluation in Step 5. led way, providing “services” to the other two entities. Its
to this step, reassign the services to servers in a roles comprise the following tasks:
following way. 1. It provides an environment where the ants are
a. Contact all servers to be used in the new executed. Especially, it can asynchronously
assignment of services and verify that receive messages from other server managers
their scores are still (approximately) which send the ant data. Once this data is
valid. If this is not the case, start a new received, it executes the locally stored code
epoch of ant launching (i.e. begin from representing an ant.
Step 1.) 2. It lets an ant update the pheromone table with the
b. Contact the service managers of all scores computed for the services in the service
cooperating services and let them stop list.
any running ant-based evaluations. 3. It maintains the pheromone table by updating the
c. Contact the servers to be used in the age of the pheromone scores and pruning the
new assignment and let them reserve the table. The last step is necessary, because in the
required resource capacities. extreme case, the pheromone table could attain a
size proportional the to number of servers
d. Start installing and starting the services multiplied by the number of services; this would
on their new locations. seriously impede scalability. During the pruning,
e. When step d. is finished, shut down the the oldest entries (except for those regarding the
services in the old placement. neighboring servers) are removed, until the
desired table length is reached.
f. Finally, start the service managers of
the newly installed services. 4. Finally, a server manager sends periodically its
own pheromone table to the neighboring servers,
Ants. An ant created by a service manager Ms of a service keeping the information of the neighbors up to
s “travels” from one server manager to the next one date.
(usually residing on an another physical entity).
Technically, it is done by contacting the next server The last function provides a mechanism for dissemination
manager, transmitting the ant data to it and initiating of the local knowledge throughout the system. This
executing the ant code for this ant instance. The choice of reduces the gap between a distributed system where each
the next server manager is done in the way described participant has only local knowledge, and a centralized
below. The ant has the following life cycle after it has system with the complete, global knowledge of the
arrived on a new server manager: system. The size of the time interval between the updates
and the size of a pheromone table controls the degree of
1. Evaluate for each service in the service list the the “global knowledge” in the system. An antagonistic
score in regard to this server. This is done via the trend is the rate of changes in the system and
POF for this server as described in Section 3.4. consequently the ageing rate of the pheromone. Also,
2. Update the pheromone table of the current server albeit a high degree of this knowledge is very useful for
by passing the computed scores to the server choosing the next server in a correct way, attaining it
manager. costs a lot of resources, mostly network bandwidth and
storage for the pheromone tables.
3. Choose the service with the highest computed
score among the not yet assigned services and
3.4 Placement Scores and the Pheromone Table
remember this assignment.
Recall when an ant reaches a new server manager, it
4. If all services from the internal list have been
computes placement scores for all services from its list in
assigned, report the resulting assignment to the
respect to the current server. For such a pair (service s,
“original” service manager Ms, then terminate.
server v), this computation is done via the POF as
5. Otherwise, move to the next server manager and follows. The current server v and servers already visited
continue with 1. by an ant become the set V. Furthermore, s and all
already assigned services from the ant's service list

-7-
constitute the set S. Then the value of the POF is However, if no such a pair exists, the ant chooses a set of
computed for the (partial) assignment of services to servers from the pheromone table with 1. most recently
servers already chosen by this ant, together with the updated pheromone scores, and with 2. highest
mapping of s to v. We assume that information about link pheromone scores. Then a random server from such a set
capacities between the servers is buffered by the ant or is selected as the next host. This approach targets to
can be obtained from the server manager, if necessary. identify with high probability servers with free
computational resources.
Let us describe now how the pheromone tables are
updated. Assume that an ant has computed a fresh As an alternative to each of the above cases, sometimes
placement score r for the pair (service, server). If such a we send an ant to a randomly selected not-too-distant
pair does not exist in this server manager's table, it is server. The decision for this step is taken with a (small)
simply inserted with the pheromone score being equal to probability h. Such an addition of a “noise” is helpful to
the placement score. Otherwise, the new value p' for this prevent the blocking problem and the shortcut problem
pair's pheromone score is computed from the current table [23]. The blocking problem occurs if a “popular” path
entry p and the newly computed placement score r by the found by many ants can no longer be taken, e.g. due to a
formula: server failure. The shortcut problem occurs in a situation
where a new assignment of services to servers suddenly
p' = γ ⋅ p + (1 − γ ) ⋅ r.
becomes possible, for example due to introduction of new
Here γ is a parameter between zero and one which servers to the system. In both cases the information
determines the degree of inheriting previous pheromone stored in the pheromone tables might cause lack of
score value. Note that the contribution of all other scores adaptation of the ants to the new conditions. A small
decreases geometrically with the number of iterations: if amount of noise forces the ants to exploit the alternative
the very first ant which has visited the node has set the routes on a permanent basis.
pheromone score to p, then after k new ants have reached
the server, the contribution of this first ant to the current 3.6 Initial Placement of the Ants
score of the pair will only be γkp. The initial placement of the ants is intuitively an
We also want to consider an effect known from the ant important factor for finding good service placements. In
colony systems in the nature: evaporation of the our case, the service manager Ms places the ants in the
pheromone. Due to this effect, old and probably outdated system according to the following schema.
information about affinities of services to servers will be First, it determines Nr “regions” where clusters of ants are
removed with time, even if no new ants have arrived at placed. The centers of these regions are chosen randomly
this server. To this aim, a server manager scans through in the known system area in the way that the probability of
its pheromone table once every T minutes, and reduces choosing a center distant from the service manager is
the score p in the pheromone table according to the smaller than choosing a center close to Ms. To this aim,
formula: each service manager maintains a (partial) map of the
p = δ ⋅ p, resources in term of their network location. The resources
are categorized by their IP-distance d to the server
where delta is an aging factor between 0 and 1 (usually manager. When choosing a center of the region, in the fist
close to 1). If the value of the pheromone score decreases step the service manager selects randomly a class of
below a certain limit, these pairs are removed from the resources with a distance d to Ms. Then it decides to
pheromone table in order to save storage resources. continue with this class with probability
1
,
3.5 Choosing the Next Server (1 + d )θ
Pheromone tables are the main decision factor for otherwise it chooses again a random class until success;
choosing the next server to be visited by an ant. Since here θ is a parameter greater 1. If successful, a random
those tables are exchanged by the neighboring servers and resource as the center of a new region is chosen.
propagated through the system, an ant has a good chance According to the findings in [15], this approach ensures
to find a pair (service s, server v) in the pheromone table that very rare resources can be still discovered, but
of the current server. Here v is a not too distant server and simultaneously supports clustering of services according
s is a still unassigned service from the service list of this to the location of their inception.
ant. If multiple such pairs have been found, the server of In each of the regions determined in this way, the service
a pair with the highest pheromone score is selected. manager spawns Na ants on the resources close to the

-8-
center of the region. Here a similar approach to the one 4 BLE-Based Control Algorithm
described above is taken, yet the distances of the created
ants from the center of the region are kept smaller by We adapt the concept of the Broadcast of Local Eligibility
means of increasing θ. Furthermore, ants “repel” used for coordination of robots [28] for the placement of
themselves: if an ant is placed on a certain resource, then services. This concept can be used to create highly fault-
Ms will discard all servers within a distance Dr from this tolerant and flexible frameworks for coordination of
resource for further placements. systems of agents. However, the originally proposed
framework has a drawback of limited scalability. To
3.7 Conclusions for Self-Organization and Fault overcome this problem, we use a hierarchical control
structure discussed below.
Tolerance
Decision cycle in a cluster. We consider a cluster of
The presented algorithms have some pleasant features in
servers with a distinguished server called cluster head.
respect to automating resource management. For example,
Each member of the cluster has the ability to broadcast a
servers and resources added to the network do not need to
message to all other members of the cluster. This can be
inform any central instance of their existence; it is
done either directly or via the cluster head. The placement
sufficient, that only their neighbors learn new topology
of services in this cluster is periodically re-evaluated by
(by the mechanism mentioned in Section 2.1).
arbitration between peer servers in so-called decision
Furthermore, even if the majority of the servers in the
cycles. The time between two cycles is determined by the
system are unavailable or unreachable, our approach will
required responsiveness to fluctuations in server
not be prevented to work correctly in the remaining part
utilization and by the induced communication between
of the system. Also a plus is the fact that by changing the
cluster members.
amount of noise in the ant’s selection of its next steps
along its path we can adjust the necessary degree of In each decision cycle, the following actions take place:
adaptability. 1. Each server broadcasts the list of services it hosts
A disadvantage is the fact that the service manager is a with all new arrived services and simultaneously
single point of failure; if it disappear, the service or a updates its list of all services in the cluster.
group of them might not recover without human 2. Each server evaluates its own suitability to host each
intervention. The reader is referred to Section 5 for a service and sorts the list according to the computed
solution to this problem. score. The evaluation is done by using the POF from
Building blocks for other domains. We believe that the Section 2.3. In addition, a service already deployed
idea of pheromone tables deserves some attention in on a server highly increases the score.
conjunction with the agent technology. In the classical 3. Each server broadcasts a list, ordered by scores, of
agent frameworks, the communication takes place either those services the server can host simultaneously
directly between agents or between agents and “agent without exceeding its capacity.
containers” (i.e. the environment executing them). It
would be interesting to exploit models where the 4. When a server receives a score list from a peer, it
information between agents can be exchanged in an compares this score with its own score for a service.
undirected and passive ways, as in the case of pheromone As a consequence, each server knows whether it is
tables. However, we are not aware of possible the most eligible one for hosting a particular service.
applications of this schema. 5. The changes in the service placement are executed.
Another idea worth to be “extracted” from the above Notice that each server knows already whether it has
algorithm is the dissemination of information by to install new or remove current services. In addition,
exchanging it between the neighboring servers only. This the cluster head compares the initial list of services
mechanism, similar to those used in Systolic Computing, with those, which will be hosted at the end of this
allows to blur the distinction between a situation where decision cycle. The remaining services are passed on
only partial information of the system is known at each to the next hierarchy level as explained below.
node as opposed to the scenario in which every node An important aspect is that the servers do not forget the
possess a complete system description. It would be list of services in the cluster after a decision cycle. In this
interesting to learn by theoretical analysis or an empirical way we provide fault-tolerance: if a server hosting certain
study how frequently information must be exchanged and services fails, other servers in the cluster will
how fast the information can expire for a large part of the automatically install the failed services (or the cluster
system to have accurate information. head adds them to the list of unassigned services).

-9-
Gossiping algorithms. Note that steps 1 and 3 require (however, cluster heads building the cluster of the next
all-to-all communication, i.e. each server learns the level will recover the failed head in their next decision
information from all other servers. This may lead to a cycle). Another inconvenience of this algorithm is the fact
problem of the communication costs in terms of the that the hierarchy of clusters must be created externally
number of messages and the time until all members of a (i.e. is not given implicitly by the algorithm), which limits
cluster are informed. In infrastructures like Ethernet or the self-organization of this approach.
wireless LAN a cost of a broadcast is comparable to
Building blocks for fault-tolerance. An interesting
sending a targeted message, which partially relieves the
quality of the above approach is the implicit fault
situation. This problem becomes more serious if members
tolerance and also implicit “negotiation” between the
of a cluster are geographically distributed or communicate
resources about their assumed roles (i.e. roles as hosts for
over a switched network.
applications). This mechanism works due to the fact that
These communication costs can be reduced using information about all required tasks (in our settings, the
gossiping algorithms [10]. These deterministic and also services to be hosted) and information about the
randomized [14] algorithms achieve optimal bounds for capabilities of the cluster members is known to everybody
the number of messages with a low number of in a cluster. While this scheme has been exploited
communication rounds; for example, the information successfully in the BLE-approach, we think that by
exchange can be completed in approximately 2 log2 n extending it to a hierarchical system of cluster a true
steps in the deterministic case, and in roughly log n steps scalability becomes possible.
in the randomized case, where n is the number of servers
in the cluster. The reader is refereed to the literature for
more detailed discussion. 5 Agents in Overlay Networks

Scalability by a cluster hierarchy. Obviously, the In this section we describe an approach which combines
scalability of the above approach is limited by the size of the advantages of agent technology techniques with the
the cluster, the communication capacity in the cluster and fault-tolerant properties of peer-to-peer (P2P) networks.
the processing capacity of the cluster head. Service groups and agents. As discussed in Section 2.3,
We propose a following hierarchical approach to extend services frequently build clusters of interdependent
the scalability. Basically, the cluster heads of the clusters entities, which do not rely on further services outside the
at level k are treated as “normal” members of a cluster of cluster. Such a service group, if not too large, can be
level k+1. However, they compete only for those services, treated as one (albeit not atomic) entity in the process of
which could not be installed in their own cluster (see step the optimization. Therefore we assign to such a service
5. above). After a decision round in the cluster of level group Na instances of group agents. Each group agent has
k+1, these pending services are possibly moved to another the task to walk around in the resource network and
peer, which is a cluster head for a cluster of level k. (The evaluate the current server and its neighborhood in regard
cluster head evaluates the eligibility of the servers in its to placement of the services in the service group;
own cluster, not its own eligibility). In the cluster of level however, one agent stays on one of the servers which host
k, these services become part of the list of services to be members of the service group, and evaluates only the
installed and participate in the normal decision cycles. current placement.

The cluster size is essential for the balance between the The evaluation of potential new placements is initiated by
responsiveness of the system and flexibility. Identifying a retrieving the capacity parameters and utilization data of
correct hierarchical structure can be done similarly to the current server and its neighboring servers by means of
clustering algorithms used in sensor networks [7]. a P2P-network described below. This data is then a
subject to evaluation by the Partial Objective Function
4.1 Conclusion: Self-Organization and Fault- from Section 2.3. Periodically, the group agents belonging
to the same service group exchange their best scores. If
Tolerance
the score of one of them is better than the real placement
The above algorithm has several good properties. In (also taking into account a penalty for moving services),
addition to being relatively simple, it ensures the this group agent initiates a rearrangement of the
automatic recovery of services without special placement.
mechanisms. Also, the size of a cluster can be treated as a
A further assignment of a group agent is to provide the
parameter for tuning the algorithm’s reactiviness (against
fault-tolerance to the optimization infrastructure: it is
solution quality); see Section 2.2. A weakness of the
done by constantly watching all other Na–1 group agents
algorithm is the fact that the cluster head can become
for being alive; the special group agent staying close to
overloaded or even temporarily be a single point of failure

- 10 -
actually deployed services also watches the health of the 5.1 Lessons Learned for Self-Organization and
services. If one of the group agents fails, it is immediately Fault-Tolerance
re-instantiated by other agents. Also, if one of the services
turns out to have failed, an appropriate recovery action is Opposed to the ACO-approach from Section 3, the above
initiated. algorithms provides full fault-tolerance. Since agents are
guarding themselves together with the service group,
It is important to note the difference to the Ant Colony faults of even a majority of the system does not lead to
Optimization Algorithm presented in Section 3. While breakdown of the service group. Another positive aspect
both ants and agents use a notion of a service group and is exploiting the self-organization properties of the
carry data of services in such a group, agents have a underlying P2P-network. A disadvantage of the algorithm
different evaluation algorithm compared to ants. While an is the fact that each agent is a complex entity, which might
ant assigns a service to a server in each step, an agent bind more resources than e.g. in case of the Ant Colony
evaluates a possible assignment of all services to the Optimization-based algorithm.
current server and its neighbors in such a step.
Furthermore, agents have more “intelligence” and do not Building blocks. The first idea deserving to be
die as opposed to ants. On the other hand, ants use the transferred into other research domains is the symmetry of
pheromone trails to learn the best assignments. agents in their roles (except for the one which stays at the
service group). This simplifies the overall schema and
P2P-based overlay networks. Since the evaluation of a allows a higher degree of fault tolerance. Another
new agent placement incurs a lot of effort, the next jump noteworthy paradigm is using a P2P-network as a “lower
of an agent must be chosen carefully. To this aim agents layer” providing the self-organization capabilities to the
are guided by information from an overlay network which system elements (and in our case, also providing the
provides capacity-related attributes of servers. In the information infrastructure). Such an architecture suggests
overlay network described in [3] servers are connected in a layered model, where lower layers provide self-
a P2P-manner to achieve fault-tolerance and self- organizing properties used by higher, more complex
organizing properties (i.e. servers may join and leave layers.
without a reconfiguration exercise). The functionality of
the network allows range queries of attributes; in our case
we are mostly interested in server processing capacity, 6 Two Simple Algorithms
server storage capacity and the density values of these To complement the above three approaches, we discuss in
attributes. The density of an attribute is the averaged the following two algorithms for service placement
attribute value from a group of servers whose center is the characterized by simplicity and statelessness.
server which “labels” this density value; thus, a density
value is an indicator of the attribute (capacity) in the 6.1 Random / Round Robin (R3) Load Distribution
surrounding of a server. The density values are Algorithm
periodically computed on each server by receiving
updates from the surrounding resources. Pretty much like random or round robin scheduling, the
load distribution algorithm pushes load from an
When deciding about the next server to be visited, an overloaded server to a randomly or in a round robin
agent first collects the current utilization data from its fashion chosen neighbor that may absorb that load if it has
service group. This demand value determines the range the capacity, or pushes the load further on to another
for which the density values are queried. The overlay server chosen in the same fashion. Once a place has been
network responds with a list of servers fulfilling the found where the load can be absorbed, the actual
criteria. An agent sorts them according to their distance, migration of the load is initiated in the underlying system.
and chooses randomly the next server to move on,
2. either accept load or
similarly as described in Section 3.5. Once arrived on the push load further on
new server, it queries directly the surrounding servers
retrieving their individual attribute values. (If ranges of
1. 2.
values are necessary, the overlay network query capability
can be used.). This data is then used for the evaluation of
the POF. 1. push load to a
chosen neighbor
3. if capacity found, migrate
3. the actual load

Figure 3: The R3 Load Distribution Algorithm.

- 11 -
The advantage of this algorithm is its simplicity and The concept of ad-hoc networks known from mobile
statelessness (efforts to maintain states can be avoided). computing [21] is another source of paradigms for self-
The disadvantages are unpredictability and insufficient organization. The focus of the research in this area are
(random) convergence on the chance for thrashing. protocols for discovering routes between dynamically
located nodes. The BARWAN project [5] addresses
The termination problem of the algorithm can be
aspects of self-organization and dynamic adaptation in the
addressed by limiting the number of hops. Cycles cannot
domains of mobile networking and ubiquitous computing.
be avoided due to the statelessness of the algorithm.
The most prominent project at the edge of self-
6.2 Simple Greedy Algorithm organization and resource management is IBM’s
Greedy Algorithms also represent a simple category of Autonomic Computing vision [13]. This broad collection
distributed algorithms. A simple greedy algorithm just of projects intends to create systems that are self-
pushes load on to the least loaded neighbor. Unlike configuring, self-healing and self-optimizing. Related to
random algorithms that do not take any information into this research thread is the Océano project [12]. It
account, greedy algorithms make use of locally available addresses the designing and building a prototype of a
information such as load conditions on neighbored servers scalable infrastructure for a large-scale “computing utility
in our case. Servers need to exchange information in order powerplant” that enables multi-customer hosting on a
to keep this information up to date. However, total virtualized collection of hardware resources. An
consistency cannot be achieved. undertaking of similar flavor is HP’s Utility Data Center
project [11], [1].
Convergence is better than random. However, since load
is pushed only to most-underutilized servers, these servers There is a multitude of activities focused on using
quickly become utilized with the danger of becoming computational Grids for sharing distributed
overloaded themselves. This causes greedy algorithms supercomputing resources [25], [2]. Examples include the
tend to oscillate with the effect of thrashing service load Globus toolkit [26], or Sun’s Grid Engine [24]. Although
in the underlying system. For this reason, greedy these systems exhibit a mature infrastructure for resource
algorithms strongly depend on the update frequency of management, the scheduling part still lack more
load conditions in the meta-system. They also require a sophisticated algorithms.
bias between load states in order to defuse the oscillation In the field of distributed constraint solving the most
problem. Termination and cycles can also not be avoided notable thread is the research on the Distributed
by the algorithm itself. Both need to be guaranteed by Constraint Satisfaction Problems (DCSPs) [17]. In a
limiting the number of hops. The algorithm does not DCSP several computational agents try to solve a
guarantee to find a solution. connected Constraint Satisfaction Problem collectively.
The algorithms R3 and Greedy make good use of locality Such a problem consists of a set of variables which take
by placing load on the closest server they can find. Over their values in particular domains, and a set of constraints
a longer period, both algorithms achieve good load which specify the permitted value combinations. Each
balancing. However, fast reactiveness is not guaranteed. agent carries – strategy dependent – a subset of variables
or a subset of values for variables and tries to assign
values to variables while preserving consistency between
7 Related Work agents. Noteworthy strategies in DCSP are Asynchronous
Backtracking, Weak-Commitment Search Algorithms or
Work related to the topics of this paper can be classified
Distributed Constrained Heuristic Search [9].
in three themes: self-organization of distributed systems;
resource management in such systems; and distributed
constraint solving. 8 Conclusion
The self-organization of distributed systems includes The algorithms presented in this paper provide means for
contributions from P2P-systems research, mobile systems distributed control of resources dynamic distributed
and ubiquitous computing. In most P2P-systems systems such as large Grids or federations of data centers.
mechanisms which automatically handle joining and The approaches exhibit different levels of the tradeoff
leaving nodes (e.g. servers) are inherent parts of the between reactiviness and solution accuracy, so that not a
design. Examples include Gnutella, Pastry, Tapestry, single algorithm but a suite of them becomes necessary.
Chord and CAN Error! Reference source not found.. Interesting aspects of the algorithms are the capabilities of
Project OceanStore [16] exemplifies an application of a self-organization and fault-tolerance. For each algorithm,
P2P-based self-organization for resource management; we discuss these capabilities with the goal of proposing
another example is given in [3].

- 12 -
paradigms usable in other domains, such as mobile [5] E. A. Brewer, R. H. Katz, E. Amir, H. Balakrishnan, Y.
computing or ubiquitous computing. Chawathe, A. Fox, S. D. Gribble, T. Hodes, G. Nguyen,
V. N. Padmanabhan, M. Stemm, S. Seshan and T.
Henderson: A Network Architecture for Heterogeneous
Figure 4 summarizes and classifies the behavior of Mobile Computing, IEEE Personal Communications
Magazine, Oct. 1998.
algorithms (a comparison to a centralized integer
programming approach not discussed in this paper is also [6] M. Dorigo, V. Maniezzo and A. Colorni: The Ant System:
provided). Optimization by a Colony of Cooperating Agents. IEEE
Transactions on Systems, Man, and Cybernetics-Part B,
26(1), 29-41, 1996.
Integer Ovl.Agts Ants BLE R3 Greedy [7] D. Estrin, R. Govindan, J. Heidemann, and S. Kumar:
scalable Next century challenges: Scalable coordination in sensor
- + + + + + networks, Proceedings of MOBICOM, pp. 263-270,
Seattle, USA, August 1999.
dense
graph
+ - + - ? ?
[8] I. Foster, C. Kesselman, J. M. Nick, and S. Tuecke: The
globally Physiology of the Grid – An Open Grid Services
accurate
+ - ? - - - Architecture for Distributed Systems Integration, DRAFT,
http://www.globus.org/research/papers/ogsa.pdf, May
fast
reactivenes
- + - + - - 2002.

self-organi- [9] M. Hannebauer, On Proving Properties of Concurrent


sation
- + + + ? ? Algorithms for Distributed CSPs, ** complete **
fail-over [10] S. T. Hedetniemi, S. M. Hedetniemi, and A. L. Liestman:
capability
- + - + ? ? A survey of broadcasting and gossiping in communication
extensib-
networks. Networks 18: 319-349, 1988.
ility
+ + - - + +
[11] HP, Utility Data Center, http://www.hp.com/go/hpudc,
adaptability http://www.hp.com/go/always-on, November 2001.
- + + - + +
[12] IBM, and University of Berkeley, Oceano Project,
simplicity http://www.research.ibm.com/oceanoproject.
+ - - + + +
[13] IBM, Autonomic Computing, Manifesto,
http://www.research.ibm.com/autonomic/manifesto.
Figure 4: Classification of control algorithms. [14] A.-M. Kermarrec, L. Massoulie, and A. J. Ganesh:
Reliable Probabilistic Communication in Large-Scale
Information Dissemination Systems, Microsoft Research
References Technical Report MMSR-TR-2000-105, October 2000.
[1] A. Andrzejak, S. Graupner, V. Kotov and H. Trinks: [15] J. Kleinberg: The Small-World Phenomenon: An
Control Architecture for Service Grids in a Federation of Algorithmic Perspective, Cornell Computer Science
Utility Data Centers, HP Labs Technical Report1 HPL- Technical Report 99-1776, October 1999.
2002-235, 2002. [16] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P.
[2] A. Andrzejak, S. Graupner, V. Kotov and H. Trinks: Self- Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon,
Organizing Control in Planetary-Scale Computing, IEEE W. Weimer, C. Wells, and B. Zhao: OceanStore: An
International Symposium on Cluster Computing and the Architecture for Global-Scale Persistent Storage,
Grid (CCGrid), May 21-24, 2002, Berlin. ASPLOS ‘00, MA, USA, 2000.
[3] A. Andrzejak and Z. Xu: Scalable, Efficient Range [17] Q. Y. Luo, P. G. Hendry, and J. T. Buchanan:
Queries for Grid Information Services, Second IEEE Comparison of different approaches for solving
International Conference on Peer-to-Peer Computing distributed constraint satisfaction problems, Research
(P2P2002), Linköping, Sweden, 5-7 September 2002. Report RR-93-74, Department of Computer Science,
University of Strathclyde, Glasgow G11XH, UK, 1993.
[4] A. Andrzejak, J. Rolia, and M. Arlitt: Bounding the
Resource Savings of Several Utility Computing Models [18] E. Marcus and H. Stern: Blueprints for High Availability:
for a Data Center, in preparation, 2002. Designing Resilent Distributed Systems, John Wiley &
Sons, N.Y., 2000.
[19] D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja1,
J. Pruyne, B. Richard, S. Rollins and Z. Xu, Peer-to-Peer
1
HPL-TR are available: http://lib.hpl.hp.com/techpubs.

- 13 -
Computing, HP Labs Technical Report HPL-2002-57,
2002.
[20] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S.
Shenker: A Scalable Content-Addressable Network,
SIGCOMM 2001, San Diego, August 27-31, 2001.
[21] E. M. Royer, C.-K. Toh: A Review of Current Routing
Protocols for Ad Hoc Mobile Wireless Networks, IEEE
Personal Communications Magazine, Apr. 1999.
[22] M. Satyanarayanan: Fundamental Challenges in Mobile
Computing, Symposium on Principles of Distributed
Computing, 1996.
[23] R. Schoonderwoerd, O. Holland, J. Bruten, and L.
Rothkrantz: Ants for Load Balancing in
Telecommunications Networks, Adaptive Behavior 2:169-
207, 1996.
[24] Sun Microsystems, The Sun Grid Engine,
http://wwws.sun.com/gridware.
[25] The Global Grid Forum, http://www.gridforum.org/.
[26] The Globus Toolkit, http://www.globus.org/toolkit.
[27] The GridLab Project, http://www.gridlab.org
[28] B. B. Werger, and M. Matarić: From Insect to Internet:
Situated Control for Networked Robot Teams, to appear in
Annals of Mathematics and Artificial Intelligence, 2000.

- 14 -

You might also like