Scheduling High Performance Data Mining Tasks On A Data Grid Environment

Scheduling High Performance Data Mining Tasks on a Data Grid Environment
S. Orlando1 , P. Palmerini1,2 , R. Perego2 , F. Silvestri2,3

1 2
Dipartimento di Informatica, Universit` Ca Foscari, Venezia, Italy a Istituto CNUCE, Consiglio Nazionale delle Ricerche (CNR), Pisa, Italy 3 Dipartimento di Informatica, Universit` di Pisa, Italy a
Abstract Increasingly the datasets used for data mining are becoming huge and physically distributed. Since the distributed knowledge discovery process is both data and computational intensive, the Grid is a natural platform for deploying a high performance data mining service. The focus of this paper is on the core services of such a Grid infrastructure. In particular we concentrate our attention on the design and implementation of specialized Resource Allocation and Execution Management services aware of data source locations and resource needs of data mining tasks. Allocation and scheduling decisions are taken on the basis of performance cost metrics and models that exploit knowledge about previous executions, and use sampling to acquire estimate about execution behavior.
Introduction
In the last years we observed an explosive growth in the number and size of electronic data repositories. This gave researchers the opportunity to develop eective data mining (DM) techniques for discovering and extracting knowledge from huge amounts of information. Moreover, due to their size and also to social or legal restrictions that may prevent analysts from gathering data in a single site, the datasets are often physically distributed. If we also consider that data mining algorithms are computationally expensive, we can conclude that the Grid [5] is a natural platform for deploying a High Performance service for the Parallel and Distributed Knowledge Discovery (PDKD) process. The Grid environment may in fact furnish coordinated resource sharing, collaborative processing, and high performance data mining analysis of the huge amounts of data produced and stored. Since PDKD applications are typically data intensive, one of the main requirements of such a PDKD Grid environment is the ecient management of storage and communication resources. A signicative contribution in supporting data intensive applications is currently pursued within the Data Grid eort [2], where a data management architecture based on storage systems and metadata management services is provided. The data considered here are produced by several scientic laboratories geographically distributed among several institutions and countries. Data Grid
services are built on top of Globus [4], and simplify the task of managing computations that access distributed and large data sources. The above Data Grid framework seems to have a core of common requirements with the realization of a PDKD Grid, where data involved may originate from a larger variety of sources. Even if the Data Grid project is not explicitly concerned with data mining issues, its basic services could be exploited and extended to implement higher level grid services dealing with the process of discovering knowledge from larger and distributed data repositories. Motivated by these considerations, in [10] a specialized grid infrastructure named Knowledge Grid (K-Grid) has been proposed. This architecture was designed to be compatible with lower-level grid mechanisms and also with the Data Grid ones. The authors subdivide the K-Grid architecture into two layers: the core K-grid and the high level K-grid services. The former layer refers to services directly implemented on the top of generic grid services, the latter refers to services used to describe, develop and execute parallel and distributed knowledge discovery (PDKD) computations on the KGrid. Moreover, the layer oers services to store and analyze the discovered knowledge. In this paper we adopt the K-Grid architecture [10], and concentrate our attention on its core services, i.e. the Knowledge Directory Service (KDS) and the Resource Allocation and Execution Management (RAEM) services. The KDS extends the basic Globus Metacomputer Directory Service (MDS) [3], and is responsible for maintaining a description of all the data and tools used in the K-Grid. The metadata managed by the KDS are represented through XML documents stored in the Knowledge Metadata Repository (KMR). Metadata regard the following kind of objects: data sources characteristics, data management tools, data mining tools, mined data, and data visualization tools. Metadata representation for output mined data models may also adopt the Predictive Model Markup Language (PMML) [6] standard, which provides the XML specication for several kinds of data mining sources, models, and tools, also granting the interoperability among dierent PMML compliant tools. The RAEM service provides a specialized broker of Grid resources for PDKD computations: given a user request for performing a DM analysis, the broker takes allocation and scheduling decisions, and builds the execution plan, establishing the sequence of actions that have to be performed in order to prepare execution (e.g., resource allocation, data and code deployment), actually execute the task, and return the results to the user. The execution plan has to satisfy given requirements (such as performance, response time, and mining algorithm) and constraints (such as data locations, available computing power, storage size, memory, network bandwidth and latency). Once the execution plan is built, it is passed to the Grid Resource Management service for execution. Clearly, many dierent execution plans can be devised, and the RAEM service has to choose the one which maximizes or minimizes some metrics of interest (e.g. throughput, average service time). In this paper we analyze some of the issues encountered in the design and implementation of an allocation and scheduling strategy for the RAEM service,
i.e. for the broker of the K-Grid architecture. In its decision making process, this service has to exploit a composite performance model which consider the actual status of the Grid, the location of data sources, and the task execution behavior. The broker needs quite detailed knowledge about computation and communication costs to evaluate the protability of alternative mappings and related dataset transfer/partitioning. For example, the broker could evaluate when it is protable to launch a given expensive mining analysis in parallel. Unfortunately, the performance costs of many DM tools depend not only on the size of data, but also on the specic mining parameters provided by the user. Consider for example the Association Rule Mining (ARM) analysis: its complexity not only depends on the size of the input dataset, but also on the user-provided support and condence thresholds. Moreover, the correlations between the items present in the various transactions of a dataset largely inuences the number and the maximal length of the rules found by an ARM tool. Therefore, it becomes dicult to predict in advance either the computational and input/output costs, or the size of the output data. In order to deal with these issues, we propose to include in the KDS service dynamic information about the performances of the various DM tools over specic data sources. This information can be added as additional metadata associated with datasets, and collected by monitoring previous runs of the various software components on the specic datasets. Unfortunately these metadata may not be available when, for example, a dataset is analyzed for the rst time. In the absence of knowledge about costs, the Grid RAEM service would make blind allocation and scheduling decisions. To overcome this problem, we suggest to exploit sampling as a method to acquire preventive knowledge about the rough execution costs of specic, possibly expensive, DM jobs. Sampling has also been suggested as an ecient and eective approach to speed up data mining process, since in some case it may be possible to extract accurate knowledge from a sample of an huge dataset [11]. Unfortunately, the accuracy of the mined knowledge depends on the size of the sample in a non-linear way, and determining how much data has to be used is not possible a priori, thus making the approach impractical. In this paper we investigate an alternative use of sampling: in order to forecast the actual execution cost of a given DM algorithm on the whole dataset, we run the same algorithm on a small sample of the dataset. Many DM algorithms demonstrate optimal scalability with respect to the size of the processed dataset, thus making our performance estimate possible and accurate enough. Moreover, even if a wrong estimate is made, this can only aect the optimal use of the Grid and not the results of the nal DM analysis to be performed on the whole dataset. Besides execution costs, with sampling we can also estimate the size of the mined results, as well as to predict the amount of I/O and main memory required. These costs will thus feed specic performance models used by the K-Grid scheduler in order to forecast communication overhead, the eect of resource sharing, and the possible gain deriving from parallelism exploitation. The paper is organized as follows. Section 2 introduce our K-Grid scheduler and presents the cost model on which it is based on. In Section 3 we discuss the
methodology to be used to predict performances by sampling. Section 4 discusses our mapper and the relative simulative framework, and reports some preliminary results. Finally, Section 5 draws some conclusions and outlines future work.
Distributed Scheduling on the Knowledge Grid
In general a Grid broker should perform the following actions: (1) discover a number of resources that t with the minimum requirements for the execution; (2) verify permissions for submitting a job on these resources; (3) select the best matching resources with respect to the application performance requirements, and schedule the job. Existing broker for Grids ts more or less with the model above. The main dierence regard their organization (e.g., centralized, hierarchical, distributed), and the scheduling policy (e.g., we may optimize system throughput or application completion time). Moreover, scheduling algorithms may consider the state of the system as unchanged in the future, i.e. only depending on the decisions taken by the scheduler, or may try to predict possible changes in the system state. In both cases, it is important to know in advance information about task durations under several resource availability constraints and possible resource sharing. The algorithms used to schedule jobs may be classied as dynamic or static [9]. Dynamic scheduling may be on-line, i.e. when a task is assigned to a machine as soon as it arrives, or batch, i.e. when the arrived tasks are collected into a set that is examined for mapping at pre-scheduled times. On the other hand, static approaches, which exploit very expensive mapping strategies, are usually adopted to map long-running applications. Due to the characteristics of DM jobs, which are often interactive, we believe that the best scheduling policy to be used in the design of a K-Grid scheduler should be a dynamic one. In this preliminary study, we thus evaluate feasibility and benets of adopting a centralized local on-line scheduler for a Grid organization, which includes several clusters connected to the Grid, and may be shared by several Virtual Organizations (VO). This local scheduler will be part of a more complex hierarchical superscheduler for our K-Grid. The only performance measure considered in this work is completion time of DM jobs in order to optimize system throughput, while the constraints regard data access, computing power, memory size, and network bandwidth. 2.1 Task scheduling issues
Before sketching the dynamic scheduling algorithm used for mapping DM jobs, we introduce the issues encountered in scheduling these kind of computations and the simple cost model proposed. Most of the terminology used and the model of performance adopted have been inspired by [7,9]. Several decisions have to be taken in order to map and schedule the execution of a given (data mining) task on the Grid. First consider that a DM task ti is completely dened in terms of the DM analysis requested, the dataset Di (of size |Di |) to analyze, and the user parameters ui that specify and aect the analysis
behavior. Let i (Di ) be the knowledge model extracted by ti , where |i (Di )| is its size. In general the knowledge model extracted has to to be returned to a given site, where further analysis or visualization must be performed. Before discussing in detail the mapping algorithm and the simulation environment, let us make the following assumptions: A centralized local scheduler controls the mapping of DM tasks onto a small Grid organization, which is composed of a set M = {m1 , . . . , m|M| } of |M| machines, where a known performance factor pj is associated with each machine mj . This performance factor measures the relative speed of the various machines. Since in this paper we do not consider node multitasking, we do not take into account possible external machine loads that could aect these performance factors. Moreover, the machines are organized as a set of clusters CL = {cl1 , . . . , cl|CL| }, where each cluster clJ comprises a disjoint set of machines in M interconnected by a high-speed network. In particular, clJ = {mJ , . . . , mJ J | }. Each cluster clJ is thus a candidate for hosting a 1 |cl parallel implementation of a given DM analysis. The performance factors of a cluster clJ is pJ , which is equal to the factor of the slowest machine in the cluster. The code (sequential or parallel) that implements each DM tool is considered to be available at each Grid site. So the mapping issues, i.e. the evaluation of the benets deriving from the assignment of a task to a given machine, only concern the communication times needed to move input/output data, and also the ready times of machines and communication links. On the basis of sampling or historical data we assume that it is possible to estimate ei , dened as the base (normalized) sequential computational cost of task ti , when executed on dataset Di with user parameters ui . Let eij = pj ei be the execution time of ti on machine mj . When an analysis is performed in parallel on a cluster clJ , we assume that, in the absence of load imbalance, task ti can be executed in parallel with a quasi perfect speedup. In particular, let eiJ be the execution time of task ti on a cluster clJ , dened as maxmJ clJ (eit /|clJ |) + ovh = t maxmJ clJ ((pt ei )/|clJ |) + ovh. The term ovh models the overhead due t to the parallelization and heterogeneity of the cluster. Consider that when a cluster is homogeneous and ei is large enough, ovh is usually very small. A dataset Di may be centralized, i.e. stored in a single site, or distributed. In the following we will not consider the inherent distribution of datasets, even if we could easily add such a constraint into our framework. So we only assume that a dataset is moved when it is advantageous for reducing the completion time of a job. In particular, a centralized dataset stored in site h can be moved to another site j, and the cost depends on the average network bandwidth bhj between the two sites. For example, Di can be transferred with a cost of |Di | . bhj Moving datasets between sites has to be carried out by the replica manager of the lower Grid services, which is also responsible of the coherence of copies. Future accesses to the a dataset may take advantage of the existence of
dierent copies disseminated on the Grid. So, when a task ti must be mapped, we have to consider that, for each machine, we have to choose the most advantageous copy of a dataset to be moved or accessed. 2.2 Cost model
In the following cost model we assume that each input dataset is initially stored on at least a single machine mh , while the knowledge model extracted must be moved to a machine mk . Due to decisions taken by the scheduler, datasets may be replicated onto other machines, or partitioned among the machines composing a cluster. Sequential execution. Dataset Di is stored on a single machine mh . Task ti is sequentially executed on machine mj , and its execution time is eij . The knowledge model extracted |i (Di )| must be returned to machine mk . We have to consider the communications needed to move Di from mh to mj , and those to move the results to mk . Of course, the relative communication costs involved in dataset movements are zeroed if either h = j or j = k. The total execution time is thus: Eij =
|Di | bhj
+ eij +
|i (Di |) bjk
Parallel execution. Task ti is executed in parallel on a cluster clJ , with an execution time of eiJ . In general, we have also to consider the communications needed to move and partition Di from machine mh to cluster clJ , and to return the results |i (Di )| to machine mk . Of course, the relative communication costs are zeroed if the dataset is already distributed, and is allocated on the machines of clJ . The total execution time is thus: EiJ =
mJ clJ t |Di |/|clJ | bht
+ eiJ +
mJ clJ t
|i (Di )|/|clJ | btk
Finally, consider that the parallel algorithm we are considering requires coallocation and coscheduling of all the machines of the cluster. A dierent model of performance should be used if we adopted a more asynchronous distributed DM algorithm, where rst independent computations are performed on distinct dataset partitions, and then the various results of distributed mining analysis are collected and combined to obtain the nal results. Performance metrics. To optimize scheduling, our batch mapper has to forecast the completion time of tasks. To this end, the mapper has also to consider the tasks that were previously scheduled, and that are still queued or running. Therefore, in the following we analyze the actual completion time of a task for the sequential case. A similar analysis could be done for the parallel case. Let Cij be the wall-clock time at which all communications and sequential computation involved in the execution of ti on machine mj complete. To derive Cij we need to dene the starting times of communications and computation on the basis of the ready times of interconnection links and machines. Let shj be the starting time of the communication needed to move Di from mh to mj , sj the starting
time of the sequential execution of task ti on mj , and, nally, sjk the starting time of the communication needed to move i (Di ) from mj to mk . From the above denitions: Cij = (shj + |Di | ) + 1 + eij + 2 + |ib(Di )| = shj + Ehj + 1 + 2 bhj jk where 1 = sj (shj + |Di | ) 0 and 2 = sjk (sj + eij ) 0. bhj If mj is the specic machine chosen by our scheduling algorithm for executing a task ti , where T is the set of all the tasks to be scheduled, we dene Ci = Cij . The makespan for the complete scheduling is thus dened as maxti T (Ci ), and its minimization roughly corresponds to the maximization of the system thoughput.
Sampling as a method for performance prediction
Before discussing our mapping strategy based on the cost model outlined in the previous Section, we want to discuss the feasibility of sampling as a method to predict the performance of a given DM analysis. The rationale of our approach is that, since DM tasks may be very expensive, it may be more protable to spend a small additional time to sample their execution in order to estimate performances and schedule tasks more accurately, than adopting a blind scheduling strategy. For example, is a task is guessed to be expensive, we may be protable to move data to execute the task on a remote machine characterized by an early ready time, or distribute data on a cluster to perform the task in parallel. Dierently from [11], we are not interested in the accuracy of the knowledge extracted from a sampled dataset, but only in an approximate performance prediction of the task. To this end, it becomes important to study and analyze memory requirements and completion times of a DM algorithm as a function of the size of the sample exploited, i.e. to study the scalability of the algorithm. From this scalability study we expect to derive, for each algorithm, functions that, given the measures obtained with sampling, return predicted execution time and memory requirement for running the same analysis on the whole dataset.
DCP 1400 1200 Total time (sec) 1000 800 600 400 200 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Sample rate 0.8 0.9 1 s=.5% s=1.% s=2.% Total time (sec) 400 350 300 250 200 150 100 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Sample rate 0.8 0.9 1 k-means file size = 128 MB file size = 256 MB file size = 384 MB
(a)
(b)
Figure1. Execution time of the DCP ARM algorithm (a), and the k-means clustering one (b), as a function of the sample rate of the input dataset.
Suppose that a given task ti is rst executed on a sample Di of dataset Di on machine mj . Let eij be this execution time, and let ei = eij /pj be the
normalized execution time on the sample. Sampling is feasible as a method to predict performance of task ti i, on the basis of the results of sampling, we can derive a cost function F (), such that ei = F (|Di |). In particular, the coecients of F () must be derived on the basis of the sampled execution, i.e., in terms of ei , Di , and |Di |. The simplest case is when the algorithm scales linearly, so that F () is a linear function of the size of the dataset, i.e. ei = |Di |, where = ei / |Di |. We analyzed two DM algorithms: DCP, an ARM algorithm which exploits out-of-core techniques to enhance scalability [8], and k-means [1], the popular clustering algorithm. We ran DCP and k-means on synthetic datasets by varying the size of the sample considered. The results of the experiments are promising: both DCP and k-means exhibit quasi linear scalability with respect to the size of the sample of a given dataset, when user parameters are xed. Figure 1.(a) reports the DCP completion times on a dataset of medium size (about 40 MB) as a function of the size of the sample, for dierent user parameters (namely the minimum support s% of frequent itemsets). Similarly, in Figure 1.(b) the completion time of k-means is reported for dierent datasets, but for identical user parameters (i.e., the number k of clusters to look for). The results obtained for other datasets and other user parameters are similar, and are not reported here for sake of brevity. Note that the slopes of the various linear curves depend on both the specic user parameters and the features of the input dataset Di . Therefore, given a dataset and the parameters for executing one of these DM algorithms, the slope of each curve can be captured by running the same algorithm on a smaller sampled dataset Di . For other algorithms, scalability curves may be more complex than a simple linear one. For example when the dataset size has a strong impact on the in-core or out-core behavior of an algorithm, or on the main memory occupation. So, in order to derive an accurate performance model for a given algorithm, it should be important to perform an o-line training of the model, for dierent dataset characteristics and dierent parameter sets. Another problem that may occur in some DM algorithms, is the generation of false patterns for small sampling sizes. In fact, according to [11], we found that the performance estimation for very small sampling sizes may overestimate the actual execution times on the complete datasets. An open question is to understand the impact of this overestimation in our Grid scheduling environment.
On-line scheduling of DM tasks
We analyzed the eectiveness of a centralized on-line mapper based on the MCT (Minimum Completion Time) heuristics [7,9], which schedules DM tasks on a small organization of a K-Grid. The mapper does not consider node multitasking, is responsible for scheduling both dataset transfers and computations involved in the execution of a given task ti , and also is informed about their completions. The MCT mapping heuristics adopted is very simple. Each time a task ti is submitted, the mapper evaluates the expected ready time of each machine and communication links. The expected ready time is an estimate of the ready
Blind - 10% of tasks are expensive
MCT+sampling - 10% of tasks are expensive
2[0] 1[0] host id 0[0] 2[1] 1[1] 0[1] 0 10 20 30 40 50 60 Time Units 70 80 90 host id
2[0] 1[0] 0[0] 2[1] 1[1] 0[1] 0 10 20 30 40 Time Units 50 60
(a)
Blind - 60% of tasks are expensive
(b)
MCT+sampling - 60% of tasks are expensive
2[0] 1[0] host id 0[0] 2[1] 1[1] 0[1] 0 50 100 150 200 Time Units 250 300 host id
2[0] 1[0] 0[0] 2[1] 1[1] 0[1] 0 20 40 60 80 100 120 140 160 180 200 Time Units
(c)
(d)
Figure2. Gannt charts showing the busy times (in time units of 100 sec.) of our six machines when either the 10% (a,b) or the 60% (c,d) of the tasks are expensive: (a,b) blind scheduling heuristics, (c,d) MCT+sampling scheduling heuristics.
time, the earliest time a given resource is ready after the completion of the jobs previously assigned to it. On the basis of the expected ready times, our mapper evaluates all possible assignment of ti , and chooses the one that reduces the completion time of the task. Note that such estimate is based on both estimated and actual execution times of all the tasks that have been assigned to the resource in the past. To update resource ready times, when data transfers or computations involved in the execution of ti complete, a report is sent to the mapper. Note that any MCT mapper can take correct scheduling decisions only if the expected execution time of a task is known. When no performance prediction is available for ti , our mapper rst generates and schedules ti , i.e. the task ti executed on the sampled dataset Di . Unfortunately, the expected execution time of sampled task ti is unknown, so that the mapper has to assume that it is equal to a given small constant. Since our MCT mapper can not be able to optimize the assignment of ti , it simply assigns ti to the machine that hosts the corresponding input dataset, so that no data transfers are involved in the execution of ti . When ti completes, the mapper is informed about its execution time. On the basis of this knowledge, it can predict the performance of the actual task ti , and optimize its subsequent mapping and scheduling.
4.1
Simulation Framework and some preliminary results
We designed a simulation framework to evaluate our MCT on-line scheduler, which exploits sampling as a technique for performance prediction. We thus compared our MCT+sampling strategy with a blind mapping strategy. Since the blind strategy is unaware of actual execution costs, it can only try to minimize data transfer costs, and thus always maps each task on the machine that holds the corresponding input dataset. Moreover, it can not evaluate the protability of parallel executions, so that sequential implementations are always preferred. The simulated environment is similar to an actual Grid environment we have at disposal, and is composed of two clusters of three machines. Each cluster is interconnected by a switched fast Ethernet, while a slow WAN interconnection exists between the two clusters. The two clusters are homogeneous, but the machines of one cluster are two times faster than the machines of the other one. To x simulation parameters, we actually measured average bandwidths bW AN and bLAN of the WAN and LAN interconnections, respectively. Unfortunately, the WAN interconnection is characterized by long latency, so that, due to the TCP default window size, single connections are not able to saturate the actual bandwidth available. This eect is exacerbated by some packet losses, so that retransmissions are necessary and the TCP pipeline can not be lled. Under these hypotheses, we can open a limited number of concurrent sockets, each one characterized by a similar average bandwidth bW AN (100KB/s). We assumed that DM tasks to be scheduled arrive in a burst, according to an exponential distribution. They have random execution costs, but the x% of them corresponds to expensive tasks (1000 sec. as mean sequential execution time on the slowest machine), while the (100 x)% of them are cheap tasks (50 sec. as mean sequential execution time on the slowest machine). Datasets Di are all of medium size (50MB), and are randomly located on the machines belonging to the two clusters.
350 300 250 200 Makespan (Time Units) 150 100 50 0 60% 30% Percentage of heavy tasks 10% MCT+sampling BLIND
Figure3. Comparison of makespans observed for dierent percentages of expensive tasks, when either a blind heuristics or our MCT+sampling one is adopted.
In these rst simulation tests, we essentially checked the feasibility of our approach. Our goal was thus to evaluate mapping quality, in terms of makespan, of an optimal on-line MCT+sampling technique. This mapper is optimal because
it is supposed to also know in advance (through an oracle) the exact costs of the sampled tasks. In this way, we can evaluate the maximal improvement of our technique over the blind scheduling one. Figures 2 illustrate two pairs of Gannt charts, which show the busy times of the six machines of our Grid testbed when tasks of dierent weights are submitted. In particular, each pair of charts is relative to two simulations, when either the blind or the MCT+sampling strategy is adopted. Machine i of cluster j is indicated with the label i[j]. Note that when the blind scheduling strategy is adopted, since cluster 0 is slower than the other and no datasets are moved, the makespan on the slower machines results higher. Note that our MCT+sampling strategy sensibly outperforms the blind one, although it introduces higher computational costs due to the sampling process. Finally, Figure 3 shows the improvements in makespans obtained by our technique over the blind one when the percentage of heavy tasks is varied.
Conclusions and Future Works
In this paper we have discussed an on-line MCT heuristic strategy for scheduling high performance DM tasks onto a local organization of a Knowledge Grid. Scheduling decisions are taken on the basis of cost metrics and models based on information collected during previous executions, and use sampling to forecast execution costs. We have also reported the results of some preliminary simulations showing the improvements in the makespan (system throughput) of our strategy over a blind one. Our mapping and scheduling techniques might be adopted by a centralized on-line mapper, which is part of a more complex hierarchical Grid superscheduler, where the higher levels of the superscheduler might be responsible for taking rough schedule-decisions over multiple administrative organizations, e.g., by simply balancing the load among them by only considering aggregate queue lengths and computational power. The higher levels of a superscheduler, in fact, do not own the resources involved, may have outdated information about the load on these resources, and may be unable to exert any control over tasks currently on those domains. The on-line mapper we have discussed does not permit node multitasking, and schedules tasks in batch. In future works we plan to consider also this feature, e.g., the mapper could choose to concurrently execute a compute-bound and an I/O-bound task on the same machine . Finally, a possible drawback of our technique is the additional cost of sampling, even if it is worth considering that sampling has been already recognized as a feasible optimization technique in other elds, such as optimization of SQL queries. Of course, knowledge models extracted by sampling tasks could in some cases be of interest for the users, who might decide on the basis of the sampling results to abort or continue the execution on the whole dataset. On the other hand, since the results obtained with sampling actually represent a partial knowledge model extracted from a partition of the dataset, we could avoid to discard these partial results. For example, we might exploit a dierent DM
algorithm, also suitable for distributed environments, where independent DM analysis are performed on dierent dataset partitions, and then the partial results are merged. According to this approach, the knowledge extracted from the sample Di might be retained, and subsequently merged with the one obtained by executing the task on the rest of the input dataset Di \ Di .
References
1. R. Baraglia, D. Laforenza, S. Orlando, P. Palmerini, and R. Perego. Implementation issues in the design of I/O intensive data mining applications on clusters of workstations. In Proc. of the 3rd Workshop on High Performance Data Mining, Cancun, Mexico. Spinger-Verlag, 2000. 2. A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke. The Data Grid: towards an architecture for the distributed management and analysis of large scientic datasets. J. of Network and Comp. Appl., (23):187200, 2001. 3. S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski, W. Smith, and S. Tuecke. A directory service for conguring high-performance distributed computations. In Proc. 6th IEEE Symp. on High Performance Distributed Computing, pages 365 375. IEEE Computer Society Press, 1997. 4. I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. Intl J. of Supercomputer Applications, 11(2):115128, 1997. 5. I. Foster and C. Kesselman, editors. The Grid: Blueprint for a Future Computing Infrastructure. 1999. 6. The Data Mining Group. PMML 2.0. http://www.dmg.org/pmmlspecs v2/pmml v2 0.html. 7. M. Maheswaran, A. Shoukat, H. J. Siegel, D. Hensgen, and R. F. Freund. Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems. In 8th HCW, 1999. 8. S. Orlando, P. Palmerini, and R. Perego. Enhancing the Apriori Algorithm for Frequent Set Counting. In Proc. of 3rd Int. Conf. DaWaK 01 - Munich, Germany. LNCS Spinger-Verlag, 2001. 9. H. J. Siegel and Shoukat Ali. Techniques for Mapping Tasks to Machines in Heterogeneous Computing Systems. Journal of Systems Architecture, (46):627639, 2000. 10. D. Talia and M. Cannataro. Knowledge grid: An architecture for distributed knowledge discovery. Comm. of the ACM, 2002. to appear. 11. M. J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. Evaluation of sampling for data mining of association rules. In 7th Int. Work. on Research Issues in Data Eng., pages 4250, 1997.

Scheduling High Performance Data Mining Tasks On A Data Grid Environment

Uploaded by

Copyright:

Available Formats

Scheduling High Performance Data Mining Tasks On A Data Grid Environment

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scheduling High Performance Data Mining Tasks On A Data Grid Environment

Uploaded by

Copyright:

Available Formats

Scheduling High Performance Data Mining Tasks on a Data Grid Environment

S. Orlando1 , P. Palmerini1,2 , R. Perego2 , F. Silvestri2,3

Distributed Scheduling on the Knowledge Grid

|i (Di )|/|clJ | btk

Sampling as a method for performance prediction

On-line scheduling of DM tasks

Blind - 10% of tasks are expensive

MCT+sampling - 10% of tasks are expensive

2[0] 1[0] 0[0] 2[1] 1[1] 0[1] 0 10 20 30 40 Time Units 50 60

Simulation Framework and some preliminary results

Conclusions and Future Works

You might also like