A Failure Handling Framework For Distributed Data Mining Services On The Grid
A Failure Handling Framework For Distributed Data Mining Services On The Grid
A Failure Handling Framework For Distributed Data Mining Services On The Grid
A Failure Handling Framework for Distributed Data Mining Services on the Grid
Eugenio Cesario ICAR-CNR Rende, Italy cesario@icar.cnr.it Domenico Talia University of Calabria, ICAR-CNR Rende, Italy talia@deis.unical.it
AbstractFault tolerance is an important issue in Grid computing, where many and heterogenous machines are used. In this paper we present a exible failure handling framework which extends a service-oriented architecture for Distributed Data Mining previously proposed, addressing the requirements for fault tolerance in the Grid. The framework allows users to achieve failure recovery whenever a crash can occur on a Grid node involved in the computation. The implemented framework has been evaluated on a real Grid setting to assess its effectiveness and performance. Keywords-Distributed Data Mining; Fault Tolerance; Grid computing;
I. I NTRODUCTION Grid computing differs from conventional distributed computing because it focuses on large-scale resource sharing, offering innovative applications, and, in some cases, it is geared toward high-performance systems [5]. The driving Grid applications are traditional high-performance applications, such as high-energy particle physics, astronomy and environmental modeling, in which experimental devices create large quantities of data that require scientic analysis. For these reasons, Grids must offer effective support to the implementation and use of data mining and knowledge discovery systems. To achieve such a goal, several distributed data mining systems exploiting the Grid infrastructure have been designed and implemented [18], [14], [9], [2]. In a previous paper [3] we described how some distributed data mining patterns can be implemented as mining services by exploiting the Grid infrastructure. In particular we dened a distributed architectural model that can be exploited for different distributed mining algorithms deployed as Grid services, for the analysis of dispersed data sources. The implementation of some algorithms on the proposed architecture has been presented, as well as an experimental evaluation on real data sets. In computer science applications, a general problem to take into account is machine failure due to faults in some component, such as a processor, memory, device, cable, or software. A fault is a malfunction, possibly caused by a design error, a manufacturing error, a programming error, physical damage, deterioration in the course of time, and many other causes. Not all faults lead (immediately) to system failures, but they can do. In particular, this aspect
1066-6192/11 $26.00 2011 IEEE DOI 10.1109/PDP.2011.50 70
becomes relevant in a scenario, like the Grid, where many and heterogenous machines are involved. Developing, deploying, and running applications on a such environment poses signicant challenges due to the diverse failures and error conditions encountered during execution. As observed in [12], although the mean time to failure of any entity in a computational Grid is high, the large number of entities in a Grid (hardware, network, software, grid middleware, core services, etc.) means that a Grid can fail frequently. For example, in [16], the authors studied the failure data from several high performance computing systems operated by Los Alamos National Laboratory (LANL) over nine years. Although failure rates per processor varied from 0.1 to 3 failures per processor per year, systems with 4096 processors averaged as many as 3 failures per day. Thus, although the number of failures per processor is relatively low, the aggregate reliability of a system clearly deteriorates as the number of processors is increased. So, the reliability of a computational Grid is a real problem to deal with. In this paper we present a exible failure handling framework which extends that proposed in [3], addressing the requirements for fault tolerance in the Grid. The framework allows users to achieve failure recovery whenever a crash can occur on a Grid node involved in the computation. The rest of the paper is organized as follows. Section II describes the original framework, that supports no fault tolerant functionalities. Section III gives an overview of the most common techniques for failure handling as well as some systems exploiting them. Section IV describes a fault tolerant framework for distributed data mining, as extension of that proposed in [3]. Section V discusses the experimental evaluation of the proposed approach. Section VI gives some concluding remarks. II. BACKGROUND : A S ERVICE -O RIENTED A RCHITECTURE FOR D ISTRIBUTED DATA M INING THE G RID
ON
Typically in Distributed Data Mining (DDM) data sets are stored in local databases or le systems, hosted by local computers/repositories, which are connected through a computer network. Data mining takes place both locally at each remote site and at a global level where the local knowledge is fused in order to discover global knowledge.
Global Model
Local Model 1
Local Model 2
Local Model i
Local Model N
Data Source 1
Data Source 2
Data Source i
Data Source N
Figure 1.
One of the most common DDM approaches is reported in Figure 1. The rst phase normally involves the analysis of the local datasets at each site; this step infers statistics or local models. In the second step, the locally discovered knowledge is usually transmitted to a merger (or central) site, where the integration/renement of the distributed local models is performed. Such two steps can be repeated more times, until a convergence condition has been achieved. It is worth noting that such a pattern is very common in many classes of DDM algorithms. Examples of solutions adhering to this design pattern fall into the following categories: clustering ([15], [4], [11]), classication ( [8]), association rule and frequent itemsets mining ([8]), ensemble learning ([19]), collective data mining ([11]) and meta-learning ([13]). A. Architecture of the Proposed Framework In order to provide a service-oriented architecture for the execution of distributed data mining algorithms on the Grid, in [3] we have designed the Grid service architectural model shown in Figure 2. It is composed of two Grid Services: the GlobalMiner-WS and the LocalMiner-WS. The overall architecture resembles the pre-specied DDM schema contemplating the presence of an entity acting as coordinator (the GlobalMiner-WS) and a certain number of entities acting as miners (LocalMiner-WS) on local sites. Thus, it is a master/worker architecture, in which the service on the master node arranges the operations performed by the services on the worker nodes. A resource is associated to each service: the GlobalModel Resource to the GlobalMinerWS and the LocalModel Resource to the LocalMiner-WS. Such resources are used to store the state of the services, in this case represented by the computed models (globally and locally, respectively). Additionally, the resources are published also as topics, in order to be considered as items of interest of subscription for notications. As it can be evicted from the Figure 2, a Global Algorithm Library (GAL)
71
and a Local Algorithm Library (LAL) are present in the two types of nodes that are code libraries providing the mining algorithms to be executed (at the global and local level, respectively). The services development has been done by using the Java WSRF library provided by the WS-Core, a component of the Globus Toolkit 4 [6]. In the following we describe all the steps composing the whole process, by pointing out details of the interactions between entities composing the whole architecture. Let us suppose that a client wants to execute a distributed mining algorithm on a dataset D, which is partitioned in N partitions, {D1 , ..., DN }, each one stored on one of the nodes {N ode1 , ..., N odeN }. A request to the framework of performing a mining process can be labeled in three different main phases, each one composed of various steps, as described in the following. Phase 1 - Resource Creation and Notication Subscription. Whenever a Client wants to submit a request, it rstly invokes the createResource operation of the GlobalMiner-WS to create the GlobalModel Resource (step 1). In turn, the GlobalMiner-WS dispatches this operation in a similar way: for each local site with a running LocalMinerWS (supposing that the GlobalMiner-WS holds a list of them), it invokes the createResource operation (of the LocalMiner-WS) to create the LocalModel Resource (step 2). At this point, the Client invokes the subscribe method to subscribe a listener (inside the Client) as consumer of notications on the GlobalModel topic (step 3). Finally, the GlobalMiner-WS invokes the subscribe method to subscribe a listener as consumer of notications on the LocalModel topic (step 4). As an effect of such subscription steps, as soon as a resource changes, its new value will be delivered to its listener. Phase 2 - Task Submission and Results Notication. Such a phase is the core of the application, that is, the execution of the mining process and the result return. The
Client
Listener
1 3 5 9
GlobalMiner-WS
create subscribe submitGlobTask destroy
GlobalModel Resource
Global Algorithm Library
Listener
7
Node 1
Node i
Node N
LocalMiner WS LocalMiner-WS
2 4 6 10 create subscribe submitLocTask destroy d t
LocalMiner WS LocalMiner-WS
2 4 6 create subscribe submitLocTask destroy d t
LocalMiner WS LocalMiner-WS
2 4 6 create subscribe submitLocTask destroy d t
LocalModel Resource
Local Algorithm Library
LocalModel Resource
Local Algorithm Library
LocalModel Resource
Local Algorithm Library
10
10
D1
Di
DN
Figure 2.
Client invokes the submitGlobalTask operation (step 5). In order to execute the assigned task, the GlobalMinerWS invokes the submitLocalTask operation (step 6). At this stage, each ith LocalMiner-WS begins the analysis of the local dataset for inferring a local model. Such task is (obviously) executed concurrently on all the Grid sites. As soon as a local computation terminates, the value of the LocalModel Resource is set by the value of the inferred local model. The changes in this resource is automatically notied to the listener on the GlobalMiner-WS (step 7); in this way, the local model computed at the ith local site is delivered to the global site. As soon as all the local models are delivered to the GlobalMiner-WS, the integration of all these local models is performed. The GlobalMiner-WS evaluates if the algorithm is terminated (or not). If it is, the global model computed by the GlobalMiner-WS is stored on the GlobalModel Resource, that is immediately delivered (for the notication mechanism) to the client (step 8). Otherwise, the GlobalMiner-WS asks for further processing, by invoking one more time the submitLocalTask operation (step 6) and waiting for the delivering of the result (step 7). Such couple of steps (6 and 7) is executed as many times as the GlobalMiner-WS needs, until the computation achieves a convergence condition. Phase 3 - Resource Destruction. As nal steps, as soon as the computation terminates and its results have been notied to the Client (step 8), the Client invokes the destroy operation of the GlobalMiner-WS (step 9), which eliminates the resource previously created (GlobalModel). Similarly, the GlobalMiner-WS asks for the destruction of the LocalModel (step 10). B. Unreliability Aspects The system described in the Section II-A presents no strategy to tolerate faults. In particular, crashes can occur both on the GlobalMiner-WS and on the LocalMiner-WSs, so there
72
are two points where the fault tolerance has to be handled. The rst service acts both as the coordinator of the system and as collector of the results delivered by the local services; for such a reason, any failure occurring on it stops the whole computation. The second services, acting as workers in the architecture, execute local computations. Since the global model can be built only if all the local models are delivered, any crash on a local site (and consequently no deliver of its local model to the central site) prevents the GlobalMiner-WS to build a global model. In the rest of the paper we address such issues and propose a solution to deal with them. III. R ELATED WORK Many fault-tolerance systems have been proposed in literature. Most of them exploit the two well-known classical Primary-Backup and Active-Replication approaches. In the following, we will give a brief description of such two techniques. Then, we will cite some systems devoted at handling fault tolerance in Grids. The primary-backup method [20] contemplates the presence of a primary server and a certain number of backup server. The essential idea is that, at any instant, only the primary is running and does all the work. The client sends a request to the primary, which receives and executes the requested task. If the primary fails, a cut over from the primary to the backup is handled by a suitable protocol. At this point, the backup acts as new primary (and a new backup should be activated). Ideally, the cut over protocol from the primary to the backup requires no active interaction with the client, but just a notication (to the client) that the primary has failed and a newly-appointed primary is running. This scheme is widely used in the world: some examples are in the government (the Vice President), aviation (co-pilots), automobiles (spare tires), and diesel-powered electrical generators in hospital operating rooms.
Active replication [20], sometimes referred as state machine approach, is a well-known technique for providing fault tolerance using physical redundancy. The strategy contemplates the presence of a certain number of independent replicated server (replicas). The client sends the invocation to all the replicas, which execute the computation and send the responses. In other words, the computation is made by more servers at the same time: if one fails, the remaining servers are executing exactly the same work and are able to nalize the computation and reply to the client. As for the primary-backup strategy, similar schema of the active replication are used in the real world: some examples are in biology (mammals have two eyes, two ears, two lungs, etc.), aircraft (747s have four engines but can y on three), and sports (multiple referees in case one misses an event). Due to the heterogeneity of the resources and applications, there is no generic Grid protocol to support any fault tolerant mechanisms. For this reason, some suitable strategies are designed for the various developed applications. Some approaches are reported in the following. In [22] the authors deal with the problem of building a reliable and highly-available Grid service, by replicating the service on two or more hosts using the primarybackup approach. The design of a primary-backup protocol using OGSI (an early implementation of OGSA [5]) and its implementation, by using the Globus Toolkit [6], to evaluate performance implications and tradeoffs is presented. Finally, three different communication mechanisms (OGSI notications, standard Grid service requests and low-level socket primitives) are compared, by concluding that the OGSI model is suitable for building highly available services and it makes the task of engineering such services easier. In [21] is proposed a novel replication architecture for stateful servers that offers an integrated solution for faulttolerance and load-distribution. The key idea is that, at the same time, each server replica executes client requests and serves as backup for other replicas. A load balancing mechanism and a reconguration algorithms, guaranteeing that each replica has the same number of backups in a dynamic environment (i.e., replicas can join or leave at any time), are presented. In [10] the Grid Workow System (Grid-WFS) for failure handling on the Grid, is proposed. The goal is to propose a high-level workow structure that allows users to achieve failure recovery in a variety of ways, depending on the requirements and constraints of their applications. The authors identies a set of states (inactive, active, done, failed, exception) of the tasks submitted to the Grid node and investigate the use of heartbeats and notications to communicate transition state among them. The paper presented in [17] gives a preliminary description of the Grid Enactor and Management Service (GEMS), a system supporting submission, monitoring and restarts of Grid jobs. The goal of the study is to design a
system supporting ne-grained monitoring of individual job processes at the local resource manager, integrated with a centralized server acting as Grid level job monitor. None of these systems is in the domain of data mining. The system we present here uses a service-oriented approach to fault tolerance similar to the work discussed in [22]. However, there are some differences between ours and that approach: our approach contemplates an ad-hoc failure and recovery service, it is data mining-oriented and exploits notication mechanism to send hearthbeats. IV. A FAULT-T OLERANT F RAMEWORK This section presents the extension of the framework described in the Section II with the goal of making it faulttolerant. As cited in the Section II-B, we deal with twolevel fault tolerant strategies: the rst one applied to the GlobalMiner-WS level, the second one to the LocalMinerWS level. For lack of space, in this paper we only focus on presenting the Fault Tolerant GlobalMiner-WS. A. Fault tolerance on the GlobalMiner-WS The fault-tolerance on the GlobalMiner-WS has been designed by adopting and implementing the general primarybackup strategy. In particular, three main observations justify the choice of this strategy than the active replication. First, the primary-backup approach is simpler (during normal operation) because the client has to communicate with only one service (the primary) and not with a whole group (of replicas). Second, in practice it requires fewer CPU resources (or machines) working, because at any instant only one service (the primary) is running. Third, some algorithms exploiting the framework (see [3]) assume that random operations (e.g., initialization) are performed on the GlobalMiner: this means that an active-replication strategy, that strictly requires the operations on the replica to be deterministic [7], can not be applied in this case. On the other side, a drawback of the primary-backup choice is that, in case of failure, the client receives the response with an additive delay because the cut-over protocol has to be executed. Nevertheless, since the framework is oriented to long-running tasks, the cut-over time should result a very short time-overhead for the overall time requested by the complete task execution. Such aspects will be analyzed in the experimental section in detail. The proposed fault-tolerant framework supposes the presence of a set of GlobalMiner-WS replica, whose just one is the primary at any time. The others are named backups. The primary-backup strategy [7] contemplates the following general steps: 1) The client sends the invocation to the Primary GlobalMiner-WS 2) The primary receives the invocation and asks for local computations; as soon as such computation results are returned and the Global Model has been re-computed,
73
Client
Listener
8
A 5
heartbeat h b
heartbeat h b
FDR-WS
FDR-WS
submitGlobTask
Primary
B
GlobalModel Resource
submitGlobTask
Backup 1
B
GlobalModel Resource
submitGlobTask
Backup 2
GlobalModel Resource
GlobalMiner-WS
updateModel
GlobalMiner-WS
updateModel
GlobalMiner-WS
updateModel
Node 1 LocalMiner-WS
submitLocTask Local Model Resource
Node 2 LocalMiner-WS
submitLocTask Local Model Resource
Node i LocalMiner-WS
submitLocTask Local Model Resource
Node N LocalMiner-WS
submitLocTask Global Model Resource
D1
D2
Di
DN
Figure 3.
the primary sends a model-update message to the backups. 3) If the primary crashes during step 2, a new primary is elected among the replicas, and it becomes the new primary of the system, taking care of the computation. 4) Once the primary has received a reply of the state update (step 2) from all backups, the response is sent to the client. Three delicate phases, as in any implementation of a primary-backup mechanism [22], should be analyzed in detail:
processed by it. Now, let us describe in detail the architecture and mechanisms implemented for the proposed fault-tolerant framework. It contemplates the presence of r GlobalMiner-WS, whose one is the Primary GlobalMiner-WS and the others r 1 are the Backup GloablMiner-WSs. Moreover, the architecture contemplates also r 1 Failure Detection and Recovery-WSs (named FDR-WS), each one associated to a backup service. In the following we describe all the steps composing the whole process, by pointing out details of the interactions between entities composing the whole architecture. To avoid any confusion for the reader, the operation are labeled by the same step numbering used in the Figure 2; the new steps, implementing the fault tolerance protocol, are identied by lexicographic labels (A, B and C). Task Execution and Checkpointing. Figure 3 shows the system with the Primary GlobalMiner-WS and 2 Backup GlobalMiner-WSs. After some initialization steps, the client invokes the submitGlobalTask operation of the Primary GlobalMiner-WS, which takes care of executing the task (step 5). As seen in the Section II, the global node asks for some local elaborations to the local nodes and waits for their responses (step 6). During such a time, the Primary GlobalMiner-WS periodically sends a heartbeat message to the FDR-WSs, to communicate that it is alive and correctly running (step A). As soon as all the local computations terminate and the local models are delivered to the global site, the integration of all these local models is performed and stored in the GlobalModel Resource (step 7). As soon as such a step terminates, the new GlobalModel is sent to
74
Checkpointing, or transfer of application state. The primary periodically needs to send the change in the Global Model (its state) to the backups; it basically consists of storing a snapshot of the current application state. In particular, the consistency has to be guaranteed among the backup states, i.e., the primary can continue its work (or reply to the client) only when it is known that the backups have applied the state change. Failure detection. Crashes of the primary node can be detected by a periodic message, i.e. heartbeat, that is sent to the backup; if no messages are sent for a given time, then this can be an indication of a failure on the primary node. Recovery phase, or switching to a new primary. Originally, one of the service instances is designated as a primary and others as backups. After a failure of the primary, the backups agree on a new primary that restarts the execution from the last checkpoint state. Hereafter, all future requests are directed to and
Client
Listener
8
heartbeat
heartbeat
FDR-WS
FDR-WS
Primary
GlobalMiner-WS (FAILED)
submitGlobTask s bmitGlobTask
Backup 1
B
GlobalModel Resource
submitGlobTask s bmitGlobTask
Backup 2
GlobalModel Resource
GlobalMiner-WS
updateModel
GlobalMiner-WS
updateModel
Node 1 LocalMiner-WS
submitLocTask Local Model Resource
Node 2 LocalMiner-WS
submitLocTask Local Model Resource
Node i LocalMiner-WS
submitLocTask Local Model Resource
Node N LocalMiner-WS
submitLocTask Local Model Resource
D1
D2
Di
DN
Figure 4.
the backups by a synchronous method (step B): this is the checkpoint step of fault-tolerant strategy. Now, the Primary GlobalMiner-WS can send the nal model to the client by notication (step 8) or can ask for further local processing (step 6) and waiting for the delivering of the results (step 7). Such couple of steps (6 and 7) is executed as many times as the GlobalMiner-WS needs, until the computation achieves a convergence condition. It is worth repeating that, during the execution of such steps, the Primary GlobalMiner-WS sends a periodic heartbeat message to the FDR-WSs, guaranteeing that it is alive. Crash Detection and Recovery phase. The scenario described above is a picture of the system functionality when no problem occurs. Figure 4 shows the steps performed by the system whenever a crash of the PrimaryGlobal-WS occurs. A failure is detected by the FDR-WSs when they do not receive a heartbeat message after a certain period of time. As soon as a failure has been discovered, a suitable FDR-WS can activate a backup as the new PrimaryGlobalWS. To do that, the FDR-WS invokes the submitGlobalTask (step C) operation of its associated backup (i.e., the Backup1 GlobalMiner-WS in the Figure 4), that takes the control of the computation as new primary. In particular, it restarts the computation, starting from the last GlobalModel committed by the failed primary. If it is necessary, it invokes the submitLocalTask operation of the local miners to request local computations and, after a certain time, receives the results by notication (step 6 and 7), as in a classical scenario. When the primary has completed the computation, it sends by notication the results to the client (step 8). It
75
is worth noting that as soon as the GlobalModel resource is updated, its value is sent to the remaining backup services (step B). Moreover, to detect a possible failure of the current primary, it periodically sends a hearthbeat message to the other FDR-WSs. Now, let us make some considerations on the protocol above described, giving some more details on it:
Even though state updates are asynchronous events, they are sent from the primary to the backup by the synchronous updateModel operation; this guarantees that the primary has a control on the reception of the updates by the backups. In fact, for consistency reasons, it is important that the computation can go ahead only if all the backups have stored the last checkpointed global model. The FDR-WS are Web services aimed at receiving heartbeats and at detecting a failure. This is normally done by setting a timeout for every message and, if no messages are sent for a long time, a crash is detected. Moreover, the architecture contemplates the presence of such kind of services in order to split the role of mining, performed by the GlobalMiner-WSs from the role of crash detector and recovery starter, performed by the FDR-WS. Whenever a role of primary is assigned to a backup, it re-starts the computation from the last committed GlobalModel. Since it has no control on the operations performed by the old primary after the last checkpoint and before the failure, the new primary performs the following choices: if the last checkpointed global model
60
70
80
90
100
Figure 5.
Execution time with respect to MTTF, for the fault-tolerant and original failure-free systems.
is the nal model, it is just delivered to the client; otherwise, if it needs more processing, the new primary submits to the local services the task to be performed (probably, the same submitted by the old primary before its failure). The protocol guarantees hiding of any failure to the client, in fact no interaction is requested during all the process (even if a failure occurs). The protocol does not consider the failures of a backup service before its activation as primary; the only new issue are the detection of backup failures and the integration of new backup into the system. These steps would not interfere with the operation of the surviving system components and its implementation is not complex. Details of how to do this are outside the scope of this paper. V. E XPERIMENTAL E VALUATION
To evaluate the performance and effectiveness of the proposed fault tolerant architecture, we carried out several experiments considering various scenarios. We have developed the fault-tolerant system described in the Section IV as an extension of that described in the Section II. The services development has been done by using the Java WSRF library provided by the Globus Toolkit 4 [40]. Services have been deployed in a real Grid, each one running on a dedicated machine. We make the assumption that the failure rate is exponentially distributed. In particular, the failure rate is a random variable representing an arrival rate of failures governed by a Poisson distribution, as is commonly assumed in fault tolerance literature [1], [10]. Moreover, is related to the Mean Time To Failure MTTF, dened by 1/, where the Time To Failure TTF is a random variable representing the time between adjacent arrivals of failures, governed by the well-known exponential distribution. We performed the experiments in a scenario where a
user wants to submit a mining task to the framework, to analyze the dataset D. If n local nodes are used, the user can suppose that n (horizontal) partitions {D1 , ..., Dn } (whose size is |D|/n) are stored on the local nodes. The client invokes the GlobalMiner-WS, which in turn contacts the n LocalMiner-WSs. According to the schema reported in Figure 3, to guarantee the reliability of the framework, some GlobalMiner-WS Backups and FDR-WSs are available. Let us describe how we evaluated the effectiveness of the fault tolerant framework and the overhead introduced by the mechanisms used by it. As input dataset we have chosen the CoverType1 , that contains information about forest cover type for 580K sites in the United States; each instance corresponds to a site observation and is described by 54 attributes giving information about the main features of a site (e.g., elevation, aspect, slope, etc.). The whole dataset has a size of about 72 MB. The mining algorithm we used is the Distributed Expectation Maximization (as described in [3]). Since the initial choice of the model parameters and mixing probabilities inuence the behaviour of the computation, identical centers, covariance matrices and mixing probabilities, for a xed dataset, have been chosen. A rst set of experiments have been conducted to validate the correctness and effectiveness of the proposed approach. In particular, giving in input to the system a 36MB subset of the CoverType dataset, we run the same clustering task (above specied) in three different scenarios. First, we assigned the clustering task to the fault-tolerant system, where crash events occur according to the failure rate . Second, the same task has been repeated, in the hypothesis that no crash events occur ( = 0). Third, the mining task has been assigned to the original system (described in [3]), where no fault tolerant mechanisms are exploited neither failure events can occur. Various tests by varying the failure
1 http://archive.ics.uci.edu/ml/datasets/Covertype
76
1000
900
800
700
600
500
400 0 100 200 300 400 500 600 700 800 900 1000
N. of Iteration
Figure 6.
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
1 5
Execution Time
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Figure 7.
rate have been executed. Figure 5 shows the total execution time with respect to varying values of MTTF (mean time to failure, where M T T F = 1/). The total execution time is the time elapsed from the submission of the task by the client (to the GlobalMiner-WS) until the return (by notication) of the nal result to it. From the gure, we can notice that the original system (no fault tolerant mechanisms, no failure events) takes around 139 hours to execute the whole task. On the other side, in absence of failures the fault-tolerant system takes around 141 hours. It is worth noticing that such two times are not conditioned by failures, so they are constant for each MTTF value. The overhead of the introduced fault tolerant mechanisms in this case is 2 hours corresponding to 1.4% of the total execution time. With the exception of MTTF values from 1 to 10, corresponding to very frequent failures (1 failure in 1 to 10 hours), the total execution time is assessed to 149 hours. This means that the overhead for recovering the failure is 5 hours, corresponding to 3.47% of the total time. As described in the Section IV, whenever a failure occurs, the recovery mechanism reacts by giving the control to the new GlobalMiner-WS which re-starts the execution by rerunning the iteration that has been interrupted by the failure.
77
Figure 6 shows the execution time of all the 1000 iterations, when MTTF=100. In particular, we observe that the durations takes in average around 500 seconds. The peaks correspond to the failure events: the iterations interrupted during their executions and re-started by the new primary take a longer time with respect to the others. An another interesting issue is to evaluate how much the overhead due to fault-tolerant mechanisms contributes to the total execution time. Figure 7 shows the contribution of the mining and the fault tolerant steps to the total execution time, for MTTF=100. From the plot results that the fault tolerant overhead is very low with respect to the computation time. In fact, with the exception of the lowest value of MTTF, it takes around the 4% of the total execution time; the remaining time is taken by mining computation. A nal experiment has aimed at analyzing how the number of iterations required by the algorithm to complete its execution, can affect the total execution time. This is an important aspect, because at the end of each iteration the system makes a checkpoint of the computation status (by directly inuencing the total execution time). Hence, the higher the number of iterations, the longer the expected total execution time. In order to make a fair comparison, we xed some congurations (dataset sizes, number of iterations)
170 165
Figure 8.
Execution time with respect to MTTF, for mining tasks typied by different number of iterations.
Dataset size 72 MB 54 MB 36 MB Average iteration time 0, 278 hours 0, 208 hours 0, 139 hours No. of iterations 500 666 1000 Total execution time 138, 9 hours 138, 9 hours 138, 9 hours
guaranteeing, in absence of failures, similar total execution times. By using the congurations reported in Table I, we run the same algorithm by varying the total number of iterations and input size. Figure 8 shows the execution time with respect to different values of MTTF, for each clustering task typied by different number of iterations. We can observe the smaller is the number of iterations, the shorter is the execution time. This is due to the fact that the number of checkpoints (to save the current status at the end of each iteration) is reduced, as well as the total time elapsed to do that.
R EFERENCES
[1] A. Beguelin, E. Seligman and P. Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing, vol. 43, no. 2, pp. 147-155, 1997. [2] M. Cannataro and D. Talia. The Knowledge Grid. Communitations of the ACM, vol. 46, no. 1, pp. 89-93, 2003. [3] E. Cesario and D. Talia. Distributed Data Mining Models as Services on the Grid. In Proc. of 10th International Workshop on High Performance Data Mining (HPDM 2008), in conjunction with ICDM08, IEEE, 2008, pp. 409-495. [4] I. Dhillon and D. Modha. A Data-Clustering on Distributed Memory Processors. In Proc. KDD Workshop on High Performance Knowledge Discovery, 1999. [5] I. Foster, C. Kesselman, J. Nick and S. Tuecke. The Physiology of the Grid. In Grid Computing: Making the Global Infrastructure a Reality, Berman F., Fox G. and Hey A., Eds., Wiley, 2003, pp. 217-249. [6] The Globus Toolkit. http://www.globus.org/toolkit/ [26 July 2010]. [7] R. Guerraoui and A. Schiper. Fault-Tolerance by Replication in Distributed Systems. In Proc. of Conference on Reliable Software Technologies, 1996, pp. 38-57. [8] L. Hall, N. Chawla and K. M. Bowyer. Combining Decision Trees Learned in Parallel. In Working Notes of the KDD-97 Workshop on Distributed Data Mining, 1997. [9] J. Hofer and P. Brezany. DIGIDT: Distributed Classier Construction in the Grid Data Mining Framework GridMinerCore. In Proc. Workshop on Data Mining and the Grid, 2004.
78
VI. C ONCLUSIONS
Owing to the heterogeneity and complexity of the Grids, executing long tasks in a reliable way is a challenge. In this paper a mechanism for handling machine failures in a Grid environment has been proposed. In particular, a fault tolerant extension of the Distributed Data Mining framework for Grid, presented in [3], has been described. An experimental evaluation on a real Grid has shown that the proposed solution is scalable and introduces a low overhead with respect to the total execution time. As future work, we plan to introduce some mechanisms for an efcient exploitation of all the machines (primaries and replica), in order to have each machine both serving client requests and acting as backup for the other machines. Moreover, a load balancing mechanism could be used in order to have a load-aware assignment of the mining tasks.
[10] S. Hwang and C. Kesselman. GridWorkow: A Flexible Failure Handling Framework for the Grid. In Proc. of 12th IEEE International Symposium on High Performance Distributed Computing (HPDC), 2003. [11] E. Johnson and H. Kargupta. Collective, Hierarchical Clustering from Distributed, Heterogeneous Data. In LargeScale Parallel KDD Systems, Zaki M. and Ho C., Eds., SpringerVerlag, 2000, pp. 217-249. [12] G. Kandaswamy, A. Mandal and D. A. Reed. Fault Tolerance and Recovery of Scientic Workows on Computational Grids. In Proc. of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGrid), 2008, pp. 777-782. [13] A. L. Prodromidis, P. K. Chan and S. J. Stolfo. Meta-learning in Distributed Data Mining Systems: Issues and Approaches. In Advances in Distributed and Parallel Knowledge Discovery, Kargupta H. and Chan P., Eds., AAAI/MIT Press: Menlo Park, 2000, pp. 81-87. [14] S. AlSaira, F. S. Emmanouil, M. Ghanem, N. Giannadakis, Y. Guo, D. Kalaitzopoulos, M. Osmond, A. Rowe, J. Syed and P. Wendel. The Design of Discovery Net: Towards Open Grid Services for Knowledge Discovery. International Journal of High Performance Computing Applications, vol. 17, no. 3, pp. 297-315, 2003. [15] N. F. Samatova, G. Ostrouchov, A. Geist and A. V. Melechko. RACHET: An Efcient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets. Distributed and Parallel Databases, vol. 11, pp. 157-180, 2002. [16] B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proc. of the International Conference on Dependable Systems and Networks, 2006. [17] S. Tadepalli, C. J. Ribbens, and S. Varadarajan. GEMS: A job management system for fault tolerant grid computing. In Proc. of High Performance Computing Symposium, 2004. [18] D. Talia, P. Truno and O. Verta. Weka4WS: A WSRFEnabled Weka Toolkit for Distributed Data Mining on Grids. In Proc. 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2005. [19] P. N. Tan, M. Steinbach and V. Kumar. Introduction to Data Mining. Addison-Wesley, 2006. [20] A. S. Tanenbaum and M. Van Steen. Distributed Systems: Principles and Paradigms. Prentice Hall, 2002. [21] H. Wu and B. Kemme. A Unied Framework for Load Distribution and Fault-Tolerance of Application Servers. In Proc. of 15th International Euro-Par Conference, Springer, 2009, pp. 179-189. [22] X. Zhang, D. Zagorodnov, M. Hiltunen, K. Marzullo and R. D. Schlichting. Fault-tolerant Grid Services Using PrimaryBackup: Feasibility and Performance. In Proc. of 2004 IEEE International Conference on Cluster Computing, 2004, pp. 105-114.
79