Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Grid implementation of the Apriori algorithm

Advances in Engineering Software, 2007
...Read more
Grid implementation of the Apriori algorithm Cristian Aflori 1 , Mitica Craus * Technical University ‘‘Gh.Asachi’’, Department of Computer Science and Engineering, 53 A, Dimitrie Mangeron Street, 700050 Iasi, Romania Received 26 October 2005; accepted 8 August 2006 Available online 19 October 2006 Abstract The paper presents the implementation of an association rules discovery data mining task using Grid technologies. For the mining task we are using the Apriori algorithm on top of the Globus toolkit. The case study presents the design and integration of the data mining algorithm with the Globus services. The paper compares the Grid version with related work in the field and we outline the con- clusions and future work. Ó 2006 Elsevier Ltd. All rights reserved. Keywords: Grid technologies; Data mining; Association rules; Apriori algorithm; Globus toolkit 1. Introduction Data Mining (DM) or Knowledge Discovery in Data- bases (KDD) [1] is an interdisciplinary field with major impact in the scientific and commercial environments. Data Mining is the iterative and interactive process of discovering valid, novel, useful, and understandable patterns or models in massive databases. Data Mining means searching for valuable information in large volumes of data, using explo- ration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. The major data mining tasks [2] are pre- diction and description. Prediction methods use some vari- ables to predict unknown or future values of other variables: these include classification, regression, and devia- tion detection. The description methods find human-inter- pretable patterns that describe the data: these include clustering, association rules discovery and sequential pattern discovery. KDD consists of an iterative sequence of the fol- lowing steps: data selection, data cleaning, data transforma- tion, pattern generation, validation and visualisation. Association rule induction is a powerful method used to find regularities in data trends [3]. By induction of the asso- ciation rules, sets of data instances that frequently appear together must be founded. Such information is usually expressed in the form of rules. An association rule expresses an association between items or sets of items. However, only those association rules that are expressive and reliable are useful. The standard measures used to assess association rules are the support and the confidence of a rule. Both are computed from the support of certain item sets. 2. Problem formulation The problem of mining association rules was introduced in 1993 by Agrawal [3]. Notations and definitions: I ={i 1 ,i 2 ,...,i m } – set of items; D = set of transactions; each transaction t is included in I; X = set of items from I, t contains X. An association rule is a pair X ! Y, where X I, Y I, X \Y= ;. Confidence of the rule X ! Y is c, if c% of the transac- tions in D that contain the set X, contain also the set Y. Support of rule X ! Y is s, if s% of the transactions in D contains the set X [ Y. 0965-9978/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.advengsoft.2006.08.011 * Corresponding author. Tel.: +40 742 024117; fax: +40 232 232430. E-mail addresses: caflori@cs.tuiasi.ro (C. Aflori), craus@cs.tuiasi.ro (M. Craus). 1 Tel.: +40 741 166401; fax: +40 232 232430. www.elsevier.com/locate/advengsoft Advances in Engineering Software 38 (2007) 295–300
Problem statement: Given a set of transactions D, generate all the associa- tion rules (or a specified number of them) that have greater support and confidence than the user-specified minimum support and minimum confidence. The task of data mining for association rules can be divided into two steps: find all large itemsets that have transaction support greater than the minimum support; for all large itemsets, for each itemset l, find all non- empty subsets of l for every such subset a, the rule is a ! l a if supportðlÞ supportðaÞ > minimum confidence [4]. The most popular algorithm used for the association rules discovery is the Apriori algorithm. Initial conditions: L k = set of large k-itemsets (set of items having mini- mum support); C k = set of candidate k-itemsets (items to be counted); D = set of transactions, t D. Algorithm: L 1 = {frequent 1-itemsets}; for(k = 2;L k1 hi0;k + + ){ C k = set of new candidates; for all transactions t D for all k-subsets m of t if (m C k )m.count + + L k ={n C k jn.count > = minsupp} } Set of all frequent itemsets = ¨ k L k ; The Apriori algorithm finds only the frequent itemsets and for finding the associations rules we must apply the fol- lowing algorithm: for (each frequent itemset l){ generate all non-empty subsets of l for (each non-empty subset a of l){ output_rule: a ! (la) if support(l)/support(a)> ¼ min confidence} } For the general problem of the association discovery rule, given m items, there are potentially 2 m frequent item- sets. Discovering frequent itemsets requires a lot of compu- tation and storage resources, plus many input/output (I/O) communications. In the Apriory Algorithm case, the data- base is scanned for each iteration in order to obtain the support for new candidates. If the database does not fit memory then in each iteration there is a high I/O overhead for scanning. The efficient and effective discovery of the association rules in large databases poses numerous requirements and great challenges to researchers and developers. Scalability of the data mining process implies efficient and sufficient sampling, in-memory vs. disk-based processing and high- performance computing. Recently, several KDD systems have been implemented on parallel computing platforms, in order to achieve high-performance in the analysis of the large data sets that are stored in the same location. Large data sets, the geographic distribution of data and computationally intensive analysis demand parallel and distributed infrastructure [5]. Advances in networking technology and computational infrastructure made it possible to construct large-scale high-performance distributed computing environments, or computational grids that provide dependable, consistent, and pervasive access to high-end computational resources. The term computational grid refers to an emerging infra- structure that enables the integrated use of remote high- end computers, databases, scientific instruments, networks, and other resources [6]. Grid applications often involve large amounts of computing and/or data. For these reasons, grids can offer an effective support for the implementation and use of parallel and distributed data mining systems. 3. Related work There are several systems proposed in the field of the high-performance data mining. Most of them do not use computational grid infrastructure for the implementation of basic services such as authentication, data access, com- munication and security. These systems operate on clusters of computers or over the Internet. The best known systems for distributed data mining are presented below. Kensington Enterprise data mining is a PDKD system based on a three-tier client/server architecture in which includes: client, application server and third-tier servers (RDBMS and parallel data mining service) [7]. The Ken- sington system has been implemented in Java and uses the Enterprise JavaBeans component architecture. Java Agents for Meta-learning (JAM) is an agent-based distrib- uted data mining system that has been developed to mine data stored in different sites for building so called meta- models as a combination of several models learned at the different sites where data are stored. JAM uses Java applets to move data mining agents to remote sites [8]. Bio-diver- sity database platform (BODHI) is another agent-based distributed data mining system implemented in Java [9]. Papyrus is a distributed data mining system developed for clusters and super-clusters of workstations, composed four software layers: data management, data mining, pre- dictive modeling, and agent [10]. Another interesting dis- tributed data mining suite based on Java is Parallel and distributed data mining application suite (PaDDMAS), a component-based tool set that integrates pre-developed or custom packages (that can be sequential or parallel) using a dataflow approach [11]. Alongside this research work on distributed data min- ing, several research groups are working in the computa- tional grid area developing algorithms, components, and services that can be exploited in the implementation of dis- tributed data mining systems. 296 C. Aflori, M. Craus / Advances in Engineering Software 38 (2007) 295–300
Advances in Engineering Software 38 (2007) 295–300 www.elsevier.com/locate/advengsoft Grid implementation of the Apriori algorithm Cristian Aflori 1, Mitica Craus * Technical University ‘‘Gh.Asachi’’, Department of Computer Science and Engineering, 53 A, Dimitrie Mangeron Street, 700050 Iasi, Romania Received 26 October 2005; accepted 8 August 2006 Available online 19 October 2006 Abstract The paper presents the implementation of an association rules discovery data mining task using Grid technologies. For the mining task we are using the Apriori algorithm on top of the Globus toolkit. The case study presents the design and integration of the data mining algorithm with the Globus services. The paper compares the Grid version with related work in the field and we outline the conclusions and future work.  2006 Elsevier Ltd. All rights reserved. Keywords: Grid technologies; Data mining; Association rules; Apriori algorithm; Globus toolkit 1. Introduction Data Mining (DM) or Knowledge Discovery in Databases (KDD) [1] is an interdisciplinary field with major impact in the scientific and commercial environments. Data Mining is the iterative and interactive process of discovering valid, novel, useful, and understandable patterns or models in massive databases. Data Mining means searching for valuable information in large volumes of data, using exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. The major data mining tasks [2] are prediction and description. Prediction methods use some variables to predict unknown or future values of other variables: these include classification, regression, and deviation detection. The description methods find human-interpretable patterns that describe the data: these include clustering, association rules discovery and sequential pattern discovery. KDD consists of an iterative sequence of the following steps: data selection, data cleaning, data transformation, pattern generation, validation and visualisation. * Corresponding author. Tel.: +40 742 024117; fax: +40 232 232430. E-mail addresses: caflori@cs.tuiasi.ro (C. Aflori), craus@cs.tuiasi.ro (M. Craus). 1 Tel.: +40 741 166401; fax: +40 232 232430. 0965-9978/$ - see front matter  2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.advengsoft.2006.08.011 Association rule induction is a powerful method used to find regularities in data trends [3]. By induction of the association rules, sets of data instances that frequently appear together must be founded. Such information is usually expressed in the form of rules. An association rule expresses an association between items or sets of items. However, only those association rules that are expressive and reliable are useful. The standard measures used to assess association rules are the support and the confidence of a rule. Both are computed from the support of certain item sets. 2. Problem formulation The problem of mining association rules was introduced in 1993 by Agrawal [3]. Notations and definitions: • I = {i1,i2,. . .,im} – set of items; • D = set of transactions; each transaction t is included in I; • X = set of items from I, t contains X. • An association rule is a pair X ! Y, where X  I, Y  I, X \Y = ;. • Confidence of the rule X ! Y is c, if c% of the transactions in D that contain the set X, contain also the set Y. • Support of rule X ! Y is s, if s% of the transactions in D contains the set X [ Y. 296 C. Aflori, M. Craus / Advances in Engineering Software 38 (2007) 295–300 Problem statement: Given a set of transactions D, generate all the association rules (or a specified number of them) that have greater support and confidence than the user-specified minimum support and minimum confidence. The task of data mining for association rules can be divided into two steps: • find all large itemsets that have transaction support greater than the minimum support; • for all large itemsets, for each itemset l, find all nonempty subsets of l for every such subset a, the rule is supportðlÞ a ! l  a if supportðaÞ > minimum confidence [4]. The most popular algorithm used for the association rules discovery is the Apriori algorithm. • Initial conditions: Lk = set of large k-itemsets (set of items having minimum support); Ck = set of candidate k-itemsets (items to be counted); D = set of transactions, t  D. • Algorithm: L1 = {frequent 1-itemsets}; for(k = 2;Lk1hi0;k + + ){ Ck = set of new candidates; for all transactions t  D for all k-subsets m of t if (m  Ck)m.count + + Lk = {n  Ckjn.count > = minsupp} } Set of all frequent itemsets = ¨ kLk; The Apriori algorithm finds only the frequent itemsets and for finding the associations rules we must apply the following algorithm: for (each frequent itemset l){ generate all non-empty subsets of l for (each non-empty subset a of l) { output_rule: a ! (la) if support(l)/support(a)> ¼ min confidence} } For the general problem of the association discovery rule, given m items, there are potentially 2m frequent itemsets. Discovering frequent itemsets requires a lot of computation and storage resources, plus many input/output (I/O) communications. In the Apriory Algorithm case, the database is scanned for each iteration in order to obtain the support for new candidates. If the database does not fit memory then in each iteration there is a high I/O overhead for scanning. The efficient and effective discovery of the association rules in large databases poses numerous requirements and great challenges to researchers and developers. Scalability of the data mining process implies efficient and sufficient sampling, in-memory vs. disk-based processing and high- performance computing. Recently, several KDD systems have been implemented on parallel computing platforms, in order to achieve high-performance in the analysis of the large data sets that are stored in the same location. Large data sets, the geographic distribution of data and computationally intensive analysis demand parallel and distributed infrastructure [5]. Advances in networking technology and computational infrastructure made it possible to construct large-scale high-performance distributed computing environments, or computational grids that provide dependable, consistent, and pervasive access to high-end computational resources. The term computational grid refers to an emerging infrastructure that enables the integrated use of remote highend computers, databases, scientific instruments, networks, and other resources [6]. Grid applications often involve large amounts of computing and/or data. For these reasons, grids can offer an effective support for the implementation and use of parallel and distributed data mining systems. 3. Related work There are several systems proposed in the field of the high-performance data mining. Most of them do not use computational grid infrastructure for the implementation of basic services such as authentication, data access, communication and security. These systems operate on clusters of computers or over the Internet. The best known systems for distributed data mining are presented below. Kensington Enterprise data mining is a PDKD system based on a three-tier client/server architecture in which includes: client, application server and third-tier servers (RDBMS and parallel data mining service) [7]. The Kensington system has been implemented in Java and uses the Enterprise JavaBeans component architecture. Java Agents for Meta-learning (JAM) is an agent-based distributed data mining system that has been developed to mine data stored in different sites for building so called metamodels as a combination of several models learned at the different sites where data are stored. JAM uses Java applets to move data mining agents to remote sites [8]. Bio-diversity database platform (BODHI) is another agent-based distributed data mining system implemented in Java [9]. Papyrus is a distributed data mining system developed for clusters and super-clusters of workstations, composed four software layers: data management, data mining, predictive modeling, and agent [10]. Another interesting distributed data mining suite based on Java is Parallel and distributed data mining application suite (PaDDMAS), a component-based tool set that integrates pre-developed or custom packages (that can be sequential or parallel) using a dataflow approach [11]. Alongside this research work on distributed data mining, several research groups are working in the computational grid area developing algorithms, components, and services that can be exploited in the implementation of distributed data mining systems. 297 C. Aflori, M. Craus / Advances in Engineering Software 38 (2007) 295–300 The most well known effort for integrating computational grid and data mining techniques is the Knowledge Grid [12]. The Knowledge Grid offers global services based on the cooperation and combination of local services. The system architecture is more specialised for data mining tools that are compatible with lower-level Grid mechanisms and also with the Data Grid services. As previously cited, this article proposes a method that combines association rules discovery data mining task with Grid technologies. This method is particularly useful for large organisations, environments and enterprises that manage and analyse data that are geographically distributed in different data repositories or warehouses. The proposed method also deals with technical challenges such as communication, scheduling, security, information, data access, and fault detection. 4. Case study Organisations now have different branches located in various geographical locations, each branch having its own local database that stores information about its business. If top level management needs to mine novel information in the process of decision making, there are two options. The first one, which is not practical, is to transfer data to a single database and mine it on that database. The second option is to implement a virtual organisation based on Grid technologies and to integrate mining services for exploring and analysing the data. A possible infrastructure for a virtual organisation, implemented using Grid technologies, is presented in Fig. 1. The company has a central branch and several local branches (LB). Each branch is composed of a number of Grid nodes (GN) interconnected in a Grid infrastructure. In our case study, the data mining task is the discovery of the association rules in the local branch databases, and the implementation of the Grid infrastructure is based on the Globus toolkit. The Globus toolkit is a community based, open architecture, open source set of services and software libraries that supports Grids and Grid applications [13]. The toolkit addresses issues of security, information discovery, resource management, data management, communication, and portability. Globus toolkit mechanisms are in use at hundreds of sites and by dozens of major Grid projects worldwide. The Globus toolkit is based on the Open Grid Services Architecture (OGSA), in which a Grid provides an extensible set of services that virtual organisations can aggregate in various ways [14]. Building on concepts and technologies from both the Grid and Web service communities, OGSA defines uniform exposed service semantics for the Grid services. The Globus Toolkit is a set of software tools useful in developing Grid applications and it represents an imple- GN GN GN GN GN LB GN GN GN GN GN LB CENTRAL BRANCH GN LB GN GN GN GN GN LB VIRTUAL ORGANISATION Fig. 1. Virtual organisation infrastructure using grid technologies. mentation of the Open Grid Services Infrastructure (OGSI). In the Globus Toolkit, a Grid Service is based on the Web Service but with some improvements introduced by OGSI specifications: stateful and potentially transient services, data services, lifetime management, notifications, service groups, portType (interface) extension. 5. Implementation In this article a method of integrating the task of the mining association rules in geographically distributed databases with the Globus Toolkit is proposed. In the OGSA context, the association rules discovery task is exposed in the form of Grid services. The mining service has several components specific to a Grid service: service data access, service data element, and service implementation. The association rules discovery service is interacting with the rest of the grid services: service registry, service creation, authorisation, notification, manageability and concurrency. The Apriori Grid Service must comply with OGSA rules, constraints, standard interfaces and behaviours. The service data access contains a standard interface and a discovery service, for registering information about Grid service instances with registry services. The client application calls the standard method FindServiceData, which retrieves service information from individual Apriori Grid Service instances. The service data access defines a standard interface and semantics for dynamic service creation of the Apriori Grid Service, located at the Service Data Element level. The architecture of the association discovery service is presented in Fig. 2. 298 C. Aflori, M. Craus / Advances in Engineering Software 38 (2007) 295–300 At the implementation level, the specific association rules discovery task details are defined: algorithm libraries and metadata structure. The algorithm libraries implement the sequential and the parallel versions of the association mining task. The metadata contains the format of the data-sources to be mined, the data locations and the structure and location of the knowledge database that stores the results of the mining task. In our case study, from the central branch we want to launch the association rules discovery task, mining the data from central and local branches. The hosting environment from the central branch encapsulates computing and storage resources for creating association mining services and storage services for the knowledge collected. The hosting environment from the central branch also implements a Virtual Organisation (VO) registry service for providing information about the location of all mining, data transformation and database services. Each service provider has a local registry service (LR) that provides information about the interface of the implemented services. The user application queries the VO registry to search the association mining services. Then, the user invokes ‘‘create Grid service’’ requests on the two factories in the different hosting environment: the association discovery (Apriori Grid) service provider and storage service provider. These requests create the Apriori service that will perform the data mining operation on their behalf, and an allocation of temporary storage for use by that computation. Each request involves mutual authentication of the user and the relevant factory (using an authentication mechanism described in the factory’s service descrip- User request Service Data Access Discovery Creation Notification Grid Service Apriori Grid Interface Service Data Element Metadata Service Management Service Apriori Grid Service tion) followed by authorisation of the request. Each request is successful and results in the creation of a Grid service instance with some initial lifetime. The new data mining service instance is also provided with delegated proxy credentials that allow it to perform further remote operations on behalf of the user. The newly created association mining service uses its proxy credentials to start requesting data from the database services, placing intermediate results in local storage. The Apriori Grid Service also uses notification mechanisms to provide the user application with periodic updates on its status. Meanwhile, the user application generates periodic ‘‘keep-alive’’ requests to the two Grid service instances that it has created. We propose a distributed framework for mining associative rules in geographic distributed databases. The framework implements VO structure and it uses data mining methods and techniques and different technologies like Grid, Java and relational databases. The client module allows the user to send the parameters and the commands to the rest of the components. Configuration parameters of the framework and of the Apriori algorithm, then call the mining methods. The client finds and creates the Apriori Grid Service and sends it the parameters for: locating and connecting to the remote transaction databases, locating the transaction tables or calling the pre-processing methods, Apriori algorithm minimum support and confidence, locating the knowledge base with the partial rules generated. In the next step, the Apriori Grid Service instances perform the association rules discovery task independently on the remote databases, located on the Central Branch or on the Local Branches of the VO. The user/client application receives notifications when an association mining service completes its job and the results can be explored from the knowledge base. The user can explore the partial association rules generated or can apply incremental methods to combine the partial rules into more general rules. The Apriori Grid Service implements an original Apriori library for mining the associative rules. The library is implemented in Java for portability and compatibility with the implementation of the Globus 3 Toolkit Core services and, in this phase, it contains the serial version of the algorithm. The implementation of the algorithm follows the next steps: Implementation Sequential Apriori Parallel Apriori Metadata Hosting environment / runtime (J2EE, .NET, CORBA, C) Open Grid Services Architecture Fig. 2. The design of the Apriori Grid Service in the open grid service architecture. • the first step finds all 1-itemsets candidates and all frequent 1-itemsets candidates (a candidate is an itemset to be counted). • in the next step the algorithm finds all frequent 2itemsets. • the algorithm generates a hash tree structure for storing the k-itemsets. • the algorithm multiple scans the database and does not load it in the memory, which is very useful for large databases. C. Aflori, M. Craus / Advances in Engineering Software 38 (2007) 295–300 299 • the associative rules are generated from the maximal list of the frequent k-itemsets with the confidence and support coefficients greater than the minimum values. After the Apriori Grid Service finishes the processing, it sends the notification to the client and stores the rules in the knowledge base. 6. Experiments In order to measure the Apriori Grid Service performances and accuracy, an experiment was set up, using an intranet network that connects three local area networks (each with ten PCs of P4 2.4 MHz, 1 Gb RAM), with the WindowsXP and Linux operating systems. Each workstation was installed with the Globus 3 Toolkit and it was deployed with the Apriori Grid Service. Some workstations were installed with the framework that connects the user client and Apriori Grid Service deployed in the Globus Toolkit 3. The input transactions were stored in three different database systems: Oracle 10 g, PostGreSQL and MySQL. The remote databases were populated with two categories of test data: the sales transaction data of a company and demographic data. Association Discovery (AD) User Interface VO registry A.D. service provider LR Storage service provider AD Factory Storage Factory Apriori Service Knowledge Base Service Data provider 1 LR Data provider n Data extraction Remote database Data transformation extraction Remote database Fig. 3. Apriori Grid Service implementation. Fig. 4. Results evaluation of the Apriori Grid Service. A series of tests were run. First of all, the sequential standalone implementation of the Apriori algorithm, called by the Apriori Grid Service, and a classical implementation (Weka library) [15] were compared. The organisation of the datasets was given in the form ‘‘TxxIyyDzzzK’’, where ‘‘xx’’ denotes the average number of items present per transaction, ‘‘yy’’ denotes the average support of each item in the dataset and ‘‘zzzK’’ denotes the total number of transactions in ‘‘K’’ (1000 s). The experiments were performed for 4 database sizes (3000, 7000, 10,000 and 50,000 transactions) and the resulting rules resulted had 30%, 25% and 20% support factors (see Fig. 3). The efficiency of the Apriori Grid Service algorithm compared with the classical Apriori is presented in the Fig. 4, which shows an improvement in performance for the Grid version of the association algorithm compared to the classical Apriori of 25% for 30% support, and it increased to about 27% for 25% support and to 31% for 20% support, all for the larger dataset (10 k transactions). Also, a series of distributed tests using the ‘‘Virtual Organization’’ implemented on the local intranet were performed. We called the Apriori Grid Service with different series of parameters (remote databases, algorithm support and confidence factors, rules generated) and we obtained accurate results in real time. 7. Conclusions and future work This paper presents some aspects of architectures, algorithms and implementations of two emerging fields: Data mining and Grid technologies. We propose the design and the implementation of the Apriori algorithm using the Grid service infrastructure. Our approach for distributed association mining is based on the Open Grid Services Architecture and the implementation is realised using the Grid toolkit vers. 3 and the Apriori algorithm. We set up an experimental evaluation and the results were compared with the related work. The Apriori Grid uses a library based on the classical Apriori algorithm, 300 C. Aflori, M. Craus / Advances in Engineering Software 38 (2007) 295–300 but the implementation is original, it was optimised and it was evaluated with classical and new datasets. The results were comparable with the existing implementations. The practical results show the efficiency of Grid services that use algorithms with higher degrees of complexity. The process of mining associative rules implies multiple databases scanning and high cost resource operations. The use of Grid technologies can improve the response time of such applications because of the capacity to combine the calculus power of various geographic distributed resources. The presented method also has other advantages like distributed processing, increase in the availability, flexibility, portability and extensibility of the method. Future work is focused on the evaluation and optimisation of the association rules discovery system implementation. Another important issue is the implementation of the parallel version of the Apriori algorithm using the Message Passing Interface extended with Grid services (MPICH G2). Acknowledgement This work was partially supported by grant no. 139/ 2004 of the Information Society Romanian Project. References [1] Piatestsky-Shapiro G. Knowledge discovery in databases. AAI/MIT Press; 1991. [2] Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R. Advances in knowledge discovery and data mining. Menlo Park, Calif: AAAI Press; 1996. [3] Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. In: Proc. ACM SIGMOD Intl. Conf. Management Data, 1993. [4] Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proc. 20th VLDB Conference, 1994. [5] Foster I, Kesselman C, Tuecke S. The anatomy of the grid: enabling scalable virtual organizations. Int’l J. High-Perform Comput Appl 2001;15(3). [6] Foster I, Kesselman C, editors. The grid: blueprint for a new computing infrastructure. San Francisco: Morgan Kaufman; 1999. [7] Chattratichat J, Darlington J, Guo Y, Hedvall S, Köler M, Syed J. An architecture for distributed enterprise data mining. HPCN Europe 1999, Lecture Notes in Computer Science, vol. 1593. Springer; 1999. [8] Stolfo SJ, Prodromidis AL, Tselepis S, Lee W, Fan DW, Chan PK. JAM: Java agents for meta-learning over distributed databases. International KDD’97 Conference, pp. 74–81, 1997. [9] Kargupta H, Park B, Hershberger D, Johnson E. Collective data mining: a new perspective toward distributed data mining. In: Kargupta Hillol, Chan Philip, editors. Adv Distributed Parallel Knowl Discov. AAAI Press; 1999. [10] Grossman R, Bailey S, Kasif S, Mon D, Ramu A, Malhi B. The preliminary design of papyrus: a system for high-performance, distributed data mining over clusters, meta-clusters and super-clusters International KDD’98 Conference, 1998. [11] Rana OF, Walker DW, Li M, Lynden S, Ward M. PaDDMAS: parallel and distributed data mining application suite. In: Proc Int Parallel Distributed Process Sympos (IPDPS/SPDP). IEEE Computer Society Press; 2000. [12] Cannataro M, Talia D, Trunfio P. Knowledge GRID: High-Performance Knowledge Discovery Services on the Grid, 2nd Int. Workshop on Grid Computing (GRID 2001) in conjunction with Supercomputing 2001, LNCS 2242, Springer Verlag, 2001. [13] Globus Toolkit, 2001, http://www.globus.org. [14] Foster I, et al. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration, tech. report, Globus Project; 2002. [15] Witten IH, Frank E. Data mining: practical machine learning tools with Java implementations. San Francisco: Morgan Kaufman; 2000.