Grid implementation of the Apriori algorithm

Vishwapathy Pujala

Grid implementation of the Apriori algorithm

Advances in Engineering Software, 2007

Grid implementation of the Apriori algorithm Cristian Aﬂori 1 , Mitica Craus * Technical University ‘‘Gh.Asachi’’, Department of Computer Science and Engineering, 53 A, Dimitrie Mangeron Street, 700050 Iasi, Romania Received 26 October 2005; accepted 8 August 2006 Available online 19 October 2006 Abstract The paper presents the implementation of an association rules discovery data mining task using Grid technologies. For the mining task we are using the Apriori algorithm on top of the Globus toolkit. The case study presents the design and integration of the data mining algorithm with the Globus services. The paper compares the Grid version with related work in the ﬁeld and we outline the con- clusions and future work. Ó 2006 Elsevier Ltd. All rights reserved. Keywords: Grid technologies; Data mining; Association rules; Apriori algorithm; Globus toolkit 1. Introduction Data Mining (DM) or Knowledge Discovery in Data- bases (KDD) [1] is an interdisciplinary ﬁeld with major impact in the scientiﬁc and commercial environments. Data Mining is the iterative and interactive process of discovering valid, novel, useful, and understandable patterns or models in massive databases. Data Mining means searching for valuable information in large volumes of data, using explo- ration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. The major data mining tasks [2] are pre- diction and description. Prediction methods use some vari- ables to predict unknown or future values of other variables: these include classiﬁcation, regression, and devia- tion detection. The description methods ﬁnd human-inter- pretable patterns that describe the data: these include clustering, association rules discovery and sequential pattern discovery. KDD consists of an iterative sequence of the fol- lowing steps: data selection, data cleaning, data transforma- tion, pattern generation, validation and visualisation. Association rule induction is a powerful method used to ﬁnd regularities in data trends [3]. By induction of the asso- ciation rules, sets of data instances that frequently appear together must be founded. Such information is usually expressed in the form of rules. An association rule expresses an association between items or sets of items. However, only those association rules that are expressive and reliable are useful. The standard measures used to assess association rules are the support and the conﬁdence of a rule. Both are computed from the support of certain item sets. 2. Problem formulation The problem of mining association rules was introduced in 1993 by Agrawal [3]. Notations and deﬁnitions: • I ={i 1 ,i 2 ,...,i m } – set of items; • D = set of transactions; each transaction t is included in I; • X = set of items from I, t contains X. • An association rule is a pair X ! Y, where X  I, Y  I, X \Y= ;. • Conﬁdence of the rule X ! Y is c, if c% of the transac- tions in D that contain the set X, contain also the set Y. • Support of rule X ! Y is s, if s% of the transactions in D contains the set X [ Y. 0965-9978/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.advengsoft.2006.08.011 * Corresponding author. Tel.: +40 742 024117; fax: +40 232 232430. E-mail addresses: caﬂori@cs.tuiasi.ro (C. Aﬂori), craus@cs.tuiasi.ro (M. Craus). 1 Tel.: +40 741 166401; fax: +40 232 232430. www.elsevier.com/locate/advengsoft Advances in Engineering Software 38 (2007) 295–300

Problem statement: Given a set of transactions D, generate all the associa- tion rules (or a speciﬁed number of them) that have greater support and conﬁdence than the user-speciﬁed minimum support and minimum conﬁdence. The task of data mining for association rules can be divided into two steps: • ﬁnd all large itemsets that have transaction support greater than the minimum support; • for all large itemsets, for each itemset l, ﬁnd all non- empty subsets of l for every such subset a, the rule is a ! l  a if supportðlÞ supportðaÞ > minimum conﬁdence [4]. The most popular algorithm used for the association rules discovery is the Apriori algorithm. • Initial conditions: L k = set of large k-itemsets (set of items having mini- mum support); C k = set of candidate k-itemsets (items to be counted); D = set of transactions, t  D. • Algorithm: L 1 = {frequent 1-itemsets}; for(k = 2;L k1 hi0;k + + ){ C k = set of new candidates; for all transactions t  D for all k-subsets m of t if (m  C k )m.count + + L k ={n  C k jn.count > = minsupp} } Set of all frequent itemsets = ¨ k L k ; The Apriori algorithm ﬁnds only the frequent itemsets and for ﬁnding the associations rules we must apply the fol- lowing algorithm: for (each frequent itemset l){ generate all non-empty subsets of l for (each non-empty subset a of l){ output_rule: a ! (la) if support(l)/support(a)> ¼ min confidence} } For the general problem of the association discovery rule, given m items, there are potentially 2 m frequent item- sets. Discovering frequent itemsets requires a lot of compu- tation and storage resources, plus many input/output (I/O) communications. In the Apriory Algorithm case, the data- base is scanned for each iteration in order to obtain the support for new candidates. If the database does not ﬁt memory then in each iteration there is a high I/O overhead for scanning. The eﬃcient and eﬀective discovery of the association rules in large databases poses numerous requirements and great challenges to researchers and developers. Scalability of the data mining process implies eﬃcient and suﬃcient sampling, in-memory vs. disk-based processing and high- performance computing. Recently, several KDD systems have been implemented on parallel computing platforms, in order to achieve high-performance in the analysis of the large data sets that are stored in the same location. Large data sets, the geographic distribution of data and computationally intensive analysis demand parallel and distributed infrastructure [5]. Advances in networking technology and computational infrastructure made it possible to construct large-scale high-performance distributed computing environments, or computational grids that provide dependable, consistent, and pervasive access to high-end computational resources. The term computational grid refers to an emerging infra- structure that enables the integrated use of remote high- end computers, databases, scientiﬁc instruments, networks, and other resources [6]. Grid applications often involve large amounts of computing and/or data. For these reasons, grids can oﬀer an eﬀective support for the implementation and use of parallel and distributed data mining systems. 3. Related work There are several systems proposed in the ﬁeld of the high-performance data mining. Most of them do not use computational grid infrastructure for the implementation of basic services such as authentication, data access, com- munication and security. These systems operate on clusters of computers or over the Internet. The best known systems for distributed data mining are presented below. Kensington Enterprise data mining is a PDKD system based on a three-tier client/server architecture in which includes: client, application server and third-tier servers (RDBMS and parallel data mining service) [7]. The Ken- sington system has been implemented in Java and uses the Enterprise JavaBeans component architecture. Java Agents for Meta-learning (JAM) is an agent-based distrib- uted data mining system that has been developed to mine data stored in diﬀerent sites for building so called meta- models as a combination of several models learned at the diﬀerent sites where data are stored. JAM uses Java applets to move data mining agents to remote sites [8]. Bio-diver- sity database platform (BODHI) is another agent-based distributed data mining system implemented in Java [9]. Papyrus is a distributed data mining system developed for clusters and super-clusters of workstations, composed four software layers: data management, data mining, pre- dictive modeling, and agent [10]. Another interesting dis- tributed data mining suite based on Java is Parallel and distributed data mining application suite (PaDDMAS), a component-based tool set that integrates pre-developed or custom packages (that can be sequential or parallel) using a dataﬂow approach [11]. Alongside this research work on distributed data min- ing, several research groups are working in the computa- tional grid area developing algorithms, components, and services that can be exploited in the implementation of dis- tributed data mining systems. 296 C. Aﬂori, M. Craus / Advances in Engineering Software 38 (2007) 295–300

Advances in Engineering Software 38 (2007) 295–300 www.elsevier.com/locate/advengsoft Grid implementation of the Apriori algorithm Cristian Aﬂori 1, Mitica Craus * Technical University ‘‘Gh.Asachi’’, Department of Computer Science and Engineering, 53 A, Dimitrie Mangeron Street, 700050 Iasi, Romania Received 26 October 2005; accepted 8 August 2006 Available online 19 October 2006 Abstract The paper presents the implementation of an association rules discovery data mining task using Grid technologies. For the mining task we are using the Apriori algorithm on top of the Globus toolkit. The case study presents the design and integration of the data mining algorithm with the Globus services. The paper compares the Grid version with related work in the ﬁeld and we outline the conclusions and future work. 2006 Elsevier Ltd. All rights reserved. Keywords: Grid technologies; Data mining; Association rules; Apriori algorithm; Globus toolkit 1. Introduction Data Mining (DM) or Knowledge Discovery in Databases (KDD) [1] is an interdisciplinary ﬁeld with major impact in the scientiﬁc and commercial environments. Data Mining is the iterative and interactive process of discovering valid, novel, useful, and understandable patterns or models in massive databases. Data Mining means searching for valuable information in large volumes of data, using exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. The major data mining tasks [2] are prediction and description. Prediction methods use some variables to predict unknown or future values of other variables: these include classiﬁcation, regression, and deviation detection. The description methods ﬁnd human-interpretable patterns that describe the data: these include clustering, association rules discovery and sequential pattern discovery. KDD consists of an iterative sequence of the following steps: data selection, data cleaning, data transformation, pattern generation, validation and visualisation. * Corresponding author. Tel.: +40 742 024117; fax: +40 232 232430. E-mail addresses: caﬂori@cs.tuiasi.ro (C. Aﬂori), craus@cs.tuiasi.ro (M. Craus). 1 Tel.: +40 741 166401; fax: +40 232 232430. 0965-9978/$ - see front matter 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.advengsoft.2006.08.011 Association rule induction is a powerful method used to ﬁnd regularities in data trends [3]. By induction of the association rules, sets of data instances that frequently appear together must be founded. Such information is usually expressed in the form of rules. An association rule expresses an association between items or sets of items. However, only those association rules that are expressive and reliable are useful. The standard measures used to assess association rules are the support and the conﬁdence of a rule. Both are computed from the support of certain item sets. 2. Problem formulation The problem of mining association rules was introduced in 1993 by Agrawal [3]. Notations and deﬁnitions: • I = {i1,i2,. . .,im} – set of items; • D = set of transactions; each transaction t is included in I; • X = set of items from I, t contains X. • An association rule is a pair X ! Y, where X I, Y I, X \Y = ;. • Conﬁdence of the rule X ! Y is c, if c% of the transactions in D that contain the set X, contain also the set Y. • Support of rule X ! Y is s, if s% of the transactions in D contains the set X [ Y. 296 C. Aﬂori, M. Craus / Advances in Engineering Software 38 (2007) 295–300 Problem statement: Given a set of transactions D, generate all the association rules (or a speciﬁed number of them) that have greater support and conﬁdence than the user-speciﬁed minimum support and minimum conﬁdence. The task of data mining for association rules can be divided into two steps: • ﬁnd all large itemsets that have transaction support greater than the minimum support; • for all large itemsets, for each itemset l, ﬁnd all nonempty subsets of l for every such subset a, the rule is supportðlÞ a ! l a if supportðaÞ > minimum conﬁdence [4]. The most popular algorithm used for the association rules discovery is the Apriori algorithm. • Initial conditions: Lk = set of large k-itemsets (set of items having minimum support); Ck = set of candidate k-itemsets (items to be counted); D = set of transactions, t D. • Algorithm: L1 = {frequent 1-itemsets}; for(k = 2;Lk1hi0;k + + ){ Ck = set of new candidates; for all transactions t D for all k-subsets m of t if (m Ck)m.count + + Lk = {n Ckjn.count > = minsupp} } Set of all frequent itemsets = ¨ kLk; The Apriori algorithm ﬁnds only the frequent itemsets and for ﬁnding the associations rules we must apply the following algorithm: for (each frequent itemset l){ generate all non-empty subsets of l for (each non-empty subset a of l) { output_rule: a ! (la) if support(l)/support(a)> ¼ min confidence} } For the general problem of the association discovery rule, given m items, there are potentially 2m frequent itemsets. Discovering frequent itemsets requires a lot of computation and storage resources, plus many input/output (I/O) communications. In the Apriory Algorithm case, the database is scanned for each iteration in order to obtain the support for new candidates. If the database does not ﬁt memory then in each iteration there is a high I/O overhead for scanning. The eﬃcient and eﬀective discovery of the association rules in large databases poses numerous requirements and great challenges to researchers and developers. Scalability of the data mining process implies eﬃcient and suﬃcient sampling, in-memory vs. disk-based processing and high- performance computing. Recently, several KDD systems have been implemented on parallel computing platforms, in order to achieve high-performance in the analysis of the large data sets that are stored in the same location. Large data sets, the geographic distribution of data and computationally intensive analysis demand parallel and distributed infrastructure [5]. Advances in networking technology and computational infrastructure made it possible to construct large-scale high-performance distributed computing environments, or computational grids that provide dependable, consistent, and pervasive access to high-end computational resources. The term computational grid refers to an emerging infrastructure that enables the integrated use of remote highend computers, databases, scientiﬁc instruments, networks, and other resources [6]. Grid applications often involve large amounts of computing and/or data. For these reasons, grids can oﬀer an eﬀective support for the implementation and use of parallel and distributed data mining systems. 3. Related work There are several systems proposed in the ﬁeld of the high-performance data mining. Most of them do not use computational grid infrastructure for the implementation of basic services such as authentication, data access, communication and security. These systems operate on clusters of computers or over the Internet. The best known systems for distributed data mining are presented below. Kensington Enterprise data mining is a PDKD system based on a three-tier client/server architecture in which includes: client, application server and third-tier servers (RDBMS and parallel data mining service) [7]. The Kensington system has been implemented in Java and uses the Enterprise JavaBeans component architecture. Java Agents for Meta-learning (JAM) is an agent-based distributed data mining system that has been developed to mine data stored in diﬀerent sites for building so called metamodels as a combination of several models learned at the diﬀerent sites where data are stored. JAM uses Java applets to move data mining agents to remote sites [8]. Bio-diversity database platform (BODHI) is another agent-based distributed data mining system implemented in Java [9]. Papyrus is a distributed data mining system developed for clusters and super-clusters of workstations, composed four software layers: data management, data mining, predictive modeling, and agent [10]. Another interesting distributed data mining suite based on Java is Parallel and distributed data mining application suite (PaDDMAS), a component-based tool set that integrates pre-developed or custom packages (that can be sequential or parallel) using a dataﬂow approach [11]. Alongside this research work on distributed data mining, several research groups are working in the computational grid area developing algorithms, components, and services that can be exploited in the implementation of distributed data mining systems. 297 C. Aﬂori, M. Craus / Advances in Engineering Software 38 (2007) 295–300 The most well known eﬀort for integrating computational grid and data mining techniques is the Knowledge Grid [12]. The Knowledge Grid oﬀers global services based on the cooperation and combination of local services. The system architecture is more specialised for data mining tools that are compatible with lower-level Grid mechanisms and also with the Data Grid services. As previously cited, this article proposes a method that combines association rules discovery data mining task with Grid technologies. This method is particularly useful for large organisations, environments and enterprises that manage and analyse data that are geographically distributed in diﬀerent data repositories or warehouses. The proposed method also deals with technical challenges such as communication, scheduling, security, information, data access, and fault detection. 4. Case study Organisations now have diﬀerent branches located in various geographical locations, each branch having its own local database that stores information about its business. If top level management needs to mine novel information in the process of decision making, there are two options. The ﬁrst one, which is not practical, is to transfer data to a single database and mine it on that database. The second option is to implement a virtual organisation based on Grid technologies and to integrate mining services for exploring and analysing the data. A possible infrastructure for a virtual organisation, implemented using Grid technologies, is presented in Fig. 1. The company has a central branch and several local branches (LB). Each branch is composed of a number of Grid nodes (GN) interconnected in a Grid infrastructure. In our case study, the data mining task is the discovery of the association rules in the local branch databases, and the implementation of the Grid infrastructure is based on the Globus toolkit. The Globus toolkit is a community based, open architecture, open source set of services and software libraries that supports Grids and Grid applications [13]. The toolkit addresses issues of security, information discovery, resource management, data management, communication, and portability. Globus toolkit mechanisms are in use at hundreds of sites and by dozens of major Grid projects worldwide. The Globus toolkit is based on the Open Grid Services Architecture (OGSA), in which a Grid provides an extensible set of services that virtual organisations can aggregate in various ways [14]. Building on concepts and technologies from both the Grid and Web service communities, OGSA deﬁnes uniform exposed service semantics for the Grid services. The Globus Toolkit is a set of software tools useful in developing Grid applications and it represents an imple- GN GN GN GN GN LB GN GN GN GN GN LB CENTRAL BRANCH GN LB GN GN GN GN GN LB VIRTUAL ORGANISATION Fig. 1. Virtual organisation infrastructure using grid technologies. mentation of the Open Grid Services Infrastructure (OGSI). In the Globus Toolkit, a Grid Service is based on the Web Service but with some improvements introduced by OGSI speciﬁcations: stateful and potentially transient services, data services, lifetime management, notiﬁcations, service groups, portType (interface) extension. 5. Implementation In this article a method of integrating the task of the mining association rules in geographically distributed databases with the Globus Toolkit is proposed. In the OGSA context, the association rules discovery task is exposed in the form of Grid services. The mining service has several components speciﬁc to a Grid service: service data access, service data element, and service implementation. The association rules discovery service is interacting with the rest of the grid services: service registry, service creation, authorisation, notiﬁcation, manageability and concurrency. The Apriori Grid Service must comply with OGSA rules, constraints, standard interfaces and behaviours. The service data access contains a standard interface and a discovery service, for registering information about Grid service instances with registry services. The client application calls the standard method FindServiceData, which retrieves service information from individual Apriori Grid Service instances. The service data access deﬁnes a standard interface and semantics for dynamic service creation of the Apriori Grid Service, located at the Service Data Element level. The architecture of the association discovery service is presented in Fig. 2. 298 C. Aﬂori, M. Craus / Advances in Engineering Software 38 (2007) 295–300 At the implementation level, the speciﬁc association rules discovery task details are deﬁned: algorithm libraries and metadata structure. The algorithm libraries implement the sequential and the parallel versions of the association mining task. The metadata contains the format of the data-sources to be mined, the data locations and the structure and location of the knowledge database that stores the results of the mining task. In our case study, from the central branch we want to launch the association rules discovery task, mining the data from central and local branches. The hosting environment from the central branch encapsulates computing and storage resources for creating association mining services and storage services for the knowledge collected. The hosting environment from the central branch also implements a Virtual Organisation (VO) registry service for providing information about the location of all mining, data transformation and database services. Each service provider has a local registry service (LR) that provides information about the interface of the implemented services. The user application queries the VO registry to search the association mining services. Then, the user invokes ‘‘create Grid service’’ requests on the two factories in the diﬀerent hosting environment: the association discovery (Apriori Grid) service provider and storage service provider. These requests create the Apriori service that will perform the data mining operation on their behalf, and an allocation of temporary storage for use by that computation. Each request involves mutual authentication of the user and the relevant factory (using an authentication mechanism described in the factory’s service descrip- User request Service Data Access Discovery Creation Notification Grid Service Apriori Grid Interface Service Data Element Metadata Service Management Service Apriori Grid Service tion) followed by authorisation of the request. Each request is successful and results in the creation of a Grid service instance with some initial lifetime. The new data mining service instance is also provided with delegated proxy credentials that allow it to perform further remote operations on behalf of the user. The newly created association mining service uses its proxy credentials to start requesting data from the database services, placing intermediate results in local storage. The Apriori Grid Service also uses notiﬁcation mechanisms to provide the user application with periodic updates on its status. Meanwhile, the user application generates periodic ‘‘keep-alive’’ requests to the two Grid service instances that it has created. We propose a distributed framework for mining associative rules in geographic distributed databases. The framework implements VO structure and it uses data mining methods and techniques and diﬀerent technologies like Grid, Java and relational databases. The client module allows the user to send the parameters and the commands to the rest of the components. Conﬁguration parameters of the framework and of the Apriori algorithm, then call the mining methods. The client ﬁnds and creates the Apriori Grid Service and sends it the parameters for: locating and connecting to the remote transaction databases, locating the transaction tables or calling the pre-processing methods, Apriori algorithm minimum support and conﬁdence, locating the knowledge base with the partial rules generated. In the next step, the Apriori Grid Service instances perform the association rules discovery task independently on the remote databases, located on the Central Branch or on the Local Branches of the VO. The user/client application receives notiﬁcations when an association mining service completes its job and the results can be explored from the knowledge base. The user can explore the partial association rules generated or can apply incremental methods to combine the partial rules into more general rules. The Apriori Grid Service implements an original Apriori library for mining the associative rules. The library is implemented in Java for portability and compatibility with the implementation of the Globus 3 Toolkit Core services and, in this phase, it contains the serial version of the algorithm. The implementation of the algorithm follows the next steps: Implementation Sequential Apriori Parallel Apriori Metadata Hosting environment / runtime (J2EE, .NET, CORBA, C) Open Grid Services Architecture Fig. 2. The design of the Apriori Grid Service in the open grid service architecture. • the ﬁrst step ﬁnds all 1-itemsets candidates and all frequent 1-itemsets candidates (a candidate is an itemset to be counted). • in the next step the algorithm ﬁnds all frequent 2itemsets. • the algorithm generates a hash tree structure for storing the k-itemsets. • the algorithm multiple scans the database and does not load it in the memory, which is very useful for large databases. C. Aﬂori, M. Craus / Advances in Engineering Software 38 (2007) 295–300 299 • the associative rules are generated from the maximal list of the frequent k-itemsets with the conﬁdence and support coeﬃcients greater than the minimum values. After the Apriori Grid Service ﬁnishes the processing, it sends the notiﬁcation to the client and stores the rules in the knowledge base. 6. Experiments In order to measure the Apriori Grid Service performances and accuracy, an experiment was set up, using an intranet network that connects three local area networks (each with ten PCs of P4 2.4 MHz, 1 Gb RAM), with the WindowsXP and Linux operating systems. Each workstation was installed with the Globus 3 Toolkit and it was deployed with the Apriori Grid Service. Some workstations were installed with the framework that connects the user client and Apriori Grid Service deployed in the Globus Toolkit 3. The input transactions were stored in three diﬀerent database systems: Oracle 10 g, PostGreSQL and MySQL. The remote databases were populated with two categories of test data: the sales transaction data of a company and demographic data. Association Discovery (AD) User Interface VO registry A.D. service provider LR Storage service provider AD Factory Storage Factory Apriori Service Knowledge Base Service Data provider 1 LR Data provider n Data extraction Remote database Data transformation extraction Remote database Fig. 3. Apriori Grid Service implementation. Fig. 4. Results evaluation of the Apriori Grid Service. A series of tests were run. First of all, the sequential standalone implementation of the Apriori algorithm, called by the Apriori Grid Service, and a classical implementation (Weka library) [15] were compared. The organisation of the datasets was given in the form ‘‘TxxIyyDzzzK’’, where ‘‘xx’’ denotes the average number of items present per transaction, ‘‘yy’’ denotes the average support of each item in the dataset and ‘‘zzzK’’ denotes the total number of transactions in ‘‘K’’ (1000 s). The experiments were performed for 4 database sizes (3000, 7000, 10,000 and 50,000 transactions) and the resulting rules resulted had 30%, 25% and 20% support factors (see Fig. 3). The eﬃciency of the Apriori Grid Service algorithm compared with the classical Apriori is presented in the Fig. 4, which shows an improvement in performance for the Grid version of the association algorithm compared to the classical Apriori of 25% for 30% support, and it increased to about 27% for 25% support and to 31% for 20% support, all for the larger dataset (10 k transactions). Also, a series of distributed tests using the ‘‘Virtual Organization’’ implemented on the local intranet were performed. We called the Apriori Grid Service with diﬀerent series of parameters (remote databases, algorithm support and conﬁdence factors, rules generated) and we obtained accurate results in real time. 7. Conclusions and future work This paper presents some aspects of architectures, algorithms and implementations of two emerging ﬁelds: Data mining and Grid technologies. We propose the design and the implementation of the Apriori algorithm using the Grid service infrastructure. Our approach for distributed association mining is based on the Open Grid Services Architecture and the implementation is realised using the Grid toolkit vers. 3 and the Apriori algorithm. We set up an experimental evaluation and the results were compared with the related work. The Apriori Grid uses a library based on the classical Apriori algorithm, 300 C. Aﬂori, M. Craus / Advances in Engineering Software 38 (2007) 295–300 but the implementation is original, it was optimised and it was evaluated with classical and new datasets. The results were comparable with the existing implementations. The practical results show the eﬃciency of Grid services that use algorithms with higher degrees of complexity. The process of mining associative rules implies multiple databases scanning and high cost resource operations. The use of Grid technologies can improve the response time of such applications because of the capacity to combine the calculus power of various geographic distributed resources. The presented method also has other advantages like distributed processing, increase in the availability, ﬂexibility, portability and extensibility of the method. Future work is focused on the evaluation and optimisation of the association rules discovery system implementation. Another important issue is the implementation of the parallel version of the Apriori algorithm using the Message Passing Interface extended with Grid services (MPICH G2). Acknowledgement This work was partially supported by grant no. 139/ 2004 of the Information Society Romanian Project. References [1] Piatestsky-Shapiro G. Knowledge discovery in databases. AAI/MIT Press; 1991. [2] Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R. Advances in knowledge discovery and data mining. Menlo Park, Calif: AAAI Press; 1996. [3] Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. In: Proc. ACM SIGMOD Intl. Conf. Management Data, 1993. [4] Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proc. 20th VLDB Conference, 1994. [5] Foster I, Kesselman C, Tuecke S. The anatomy of the grid: enabling scalable virtual organizations. Int’l J. High-Perform Comput Appl 2001;15(3). [6] Foster I, Kesselman C, editors. The grid: blueprint for a new computing infrastructure. San Francisco: Morgan Kaufman; 1999. [7] Chattratichat J, Darlington J, Guo Y, Hedvall S, Köler M, Syed J. An architecture for distributed enterprise data mining. HPCN Europe 1999, Lecture Notes in Computer Science, vol. 1593. Springer; 1999. [8] Stolfo SJ, Prodromidis AL, Tselepis S, Lee W, Fan DW, Chan PK. JAM: Java agents for meta-learning over distributed databases. International KDD’97 Conference, pp. 74–81, 1997. [9] Kargupta H, Park B, Hershberger D, Johnson E. Collective data mining: a new perspective toward distributed data mining. In: Kargupta Hillol, Chan Philip, editors. Adv Distributed Parallel Knowl Discov. AAAI Press; 1999. [10] Grossman R, Bailey S, Kasif S, Mon D, Ramu A, Malhi B. The preliminary design of papyrus: a system for high-performance, distributed data mining over clusters, meta-clusters and super-clusters International KDD’98 Conference, 1998. [11] Rana OF, Walker DW, Li M, Lynden S, Ward M. PaDDMAS: parallel and distributed data mining application suite. In: Proc Int Parallel Distributed Process Sympos (IPDPS/SPDP). IEEE Computer Society Press; 2000. [12] Cannataro M, Talia D, Trunﬁo P. Knowledge GRID: High-Performance Knowledge Discovery Services on the Grid, 2nd Int. Workshop on Grid Computing (GRID 2001) in conjunction with Supercomputing 2001, LNCS 2242, Springer Verlag, 2001. [13] Globus Toolkit, 2001, http://www.globus.org. [14] Foster I, et al. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration, tech. report, Globus Project; 2002. [15] Witten IH, Frank E. Data mining: practical machine learning tools with Java implementations. San Francisco: Morgan Kaufman; 2000.

Log In

Grid implementation of the Apriori algorithm