Advances in Engineering Software 38 (2007) 295–300
www.elsevier.com/locate/advengsoft
Grid implementation of the Apriori algorithm
Cristian Aflori 1, Mitica Craus
*
Technical University ‘‘Gh.Asachi’’, Department of Computer Science and Engineering, 53 A, Dimitrie Mangeron Street, 700050 Iasi, Romania
Received 26 October 2005; accepted 8 August 2006
Available online 19 October 2006
Abstract
The paper presents the implementation of an association rules discovery data mining task using Grid technologies. For the mining
task we are using the Apriori algorithm on top of the Globus toolkit. The case study presents the design and integration of the data
mining algorithm with the Globus services. The paper compares the Grid version with related work in the field and we outline the conclusions and future work.
2006 Elsevier Ltd. All rights reserved.
Keywords: Grid technologies; Data mining; Association rules; Apriori algorithm; Globus toolkit
1. Introduction
Data Mining (DM) or Knowledge Discovery in Databases (KDD) [1] is an interdisciplinary field with major
impact in the scientific and commercial environments. Data
Mining is the iterative and interactive process of discovering
valid, novel, useful, and understandable patterns or models
in massive databases. Data Mining means searching for
valuable information in large volumes of data, using exploration and analysis, by automatic or semi-automatic means,
of large quantities of data in order to discover meaningful
patterns and rules. The major data mining tasks [2] are prediction and description. Prediction methods use some variables to predict unknown or future values of other
variables: these include classification, regression, and deviation detection. The description methods find human-interpretable patterns that describe the data: these include
clustering, association rules discovery and sequential pattern
discovery. KDD consists of an iterative sequence of the following steps: data selection, data cleaning, data transformation, pattern generation, validation and visualisation.
*
Corresponding author. Tel.: +40 742 024117; fax: +40 232 232430.
E-mail addresses: caflori@cs.tuiasi.ro (C. Aflori), craus@cs.tuiasi.ro
(M. Craus).
1
Tel.: +40 741 166401; fax: +40 232 232430.
0965-9978/$ - see front matter 2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.advengsoft.2006.08.011
Association rule induction is a powerful method used to
find regularities in data trends [3]. By induction of the association rules, sets of data instances that frequently appear
together must be founded. Such information is usually
expressed in the form of rules. An association rule expresses
an association between items or sets of items. However,
only those association rules that are expressive and reliable
are useful. The standard measures used to assess association
rules are the support and the confidence of a rule. Both are
computed from the support of certain item sets.
2. Problem formulation
The problem of mining association rules was introduced
in 1993 by Agrawal [3]. Notations and definitions:
• I = {i1,i2,. . .,im} – set of items;
• D = set of transactions; each transaction t is included in
I;
• X = set of items from I, t contains X.
• An association rule is a pair X ! Y, where X I, Y I,
X \Y = ;.
• Confidence of the rule X ! Y is c, if c% of the transactions in D that contain the set X, contain also the set Y.
• Support of rule X ! Y is s, if s% of the transactions in D
contains the set X [ Y.
296
C. Aflori, M. Craus / Advances in Engineering Software 38 (2007) 295–300
Problem statement:
Given a set of transactions D, generate all the association rules (or a specified number of them) that have greater
support and confidence than the user-specified minimum
support and minimum confidence.
The task of data mining for association rules can be
divided into two steps:
• find all large itemsets that have transaction support
greater than the minimum support;
• for all large itemsets, for each itemset l, find all nonempty subsets of l for every such subset a, the rule is
supportðlÞ
a ! l a if supportðaÞ
> minimum confidence [4].
The most popular algorithm used for the association
rules discovery is the Apriori algorithm.
• Initial conditions:
Lk = set of large k-itemsets (set of items having minimum support); Ck = set of candidate k-itemsets (items
to be counted); D = set of transactions, t D.
• Algorithm:
L1 = {frequent 1-itemsets};
for(k = 2;Lk1hi0;k + + ){
Ck = set of new candidates;
for all transactions t D
for all k-subsets m of t
if (m Ck)m.count + +
Lk = {n Ckjn.count > = minsupp}
}
Set of all frequent itemsets = ¨ kLk;
The Apriori algorithm finds only the frequent itemsets
and for finding the associations rules we must apply the following algorithm:
for (each frequent itemset l){
generate all non-empty subsets of l
for (each non-empty subset a of l) {
output_rule: a ! (la) if support(l)/support(a)> ¼
min confidence}
}
For the general problem of the association discovery
rule, given m items, there are potentially 2m frequent itemsets. Discovering frequent itemsets requires a lot of computation and storage resources, plus many input/output (I/O)
communications. In the Apriory Algorithm case, the database is scanned for each iteration in order to obtain the
support for new candidates. If the database does not fit
memory then in each iteration there is a high I/O overhead
for scanning.
The efficient and effective discovery of the association
rules in large databases poses numerous requirements and
great challenges to researchers and developers. Scalability
of the data mining process implies efficient and sufficient
sampling, in-memory vs. disk-based processing and high-
performance computing. Recently, several KDD systems
have been implemented on parallel computing platforms,
in order to achieve high-performance in the analysis of
the large data sets that are stored in the same location.
Large data sets, the geographic distribution of data and
computationally intensive analysis demand parallel and
distributed infrastructure [5].
Advances in networking technology and computational
infrastructure made it possible to construct large-scale
high-performance distributed computing environments, or
computational grids that provide dependable, consistent,
and pervasive access to high-end computational resources.
The term computational grid refers to an emerging infrastructure that enables the integrated use of remote highend computers, databases, scientific instruments, networks,
and other resources [6]. Grid applications often involve
large amounts of computing and/or data. For these reasons,
grids can offer an effective support for the implementation
and use of parallel and distributed data mining systems.
3. Related work
There are several systems proposed in the field of the
high-performance data mining. Most of them do not use
computational grid infrastructure for the implementation
of basic services such as authentication, data access, communication and security. These systems operate on clusters
of computers or over the Internet. The best known systems
for distributed data mining are presented below.
Kensington Enterprise data mining is a PDKD system
based on a three-tier client/server architecture in which
includes: client, application server and third-tier servers
(RDBMS and parallel data mining service) [7]. The Kensington system has been implemented in Java and uses
the Enterprise JavaBeans component architecture. Java
Agents for Meta-learning (JAM) is an agent-based distributed data mining system that has been developed to mine
data stored in different sites for building so called metamodels as a combination of several models learned at the
different sites where data are stored. JAM uses Java applets
to move data mining agents to remote sites [8]. Bio-diversity database platform (BODHI) is another agent-based
distributed data mining system implemented in Java [9].
Papyrus is a distributed data mining system developed
for clusters and super-clusters of workstations, composed
four software layers: data management, data mining, predictive modeling, and agent [10]. Another interesting distributed data mining suite based on Java is Parallel and
distributed data mining application suite (PaDDMAS), a
component-based tool set that integrates pre-developed
or custom packages (that can be sequential or parallel)
using a dataflow approach [11].
Alongside this research work on distributed data mining, several research groups are working in the computational grid area developing algorithms, components, and
services that can be exploited in the implementation of distributed data mining systems.
297
C. Aflori, M. Craus / Advances in Engineering Software 38 (2007) 295–300
The most well known effort for integrating computational grid and data mining techniques is the Knowledge
Grid [12]. The Knowledge Grid offers global services based
on the cooperation and combination of local services. The
system architecture is more specialised for data mining
tools that are compatible with lower-level Grid mechanisms and also with the Data Grid services.
As previously cited, this article proposes a method that
combines association rules discovery data mining task with
Grid technologies. This method is particularly useful for
large organisations, environments and enterprises that
manage and analyse data that are geographically distributed in different data repositories or warehouses. The proposed method also deals with technical challenges such as
communication, scheduling, security, information, data
access, and fault detection.
4. Case study
Organisations now have different branches located in
various geographical locations, each branch having its
own local database that stores information about its
business.
If top level management needs to mine novel information in the process of decision making, there are two
options. The first one, which is not practical, is to transfer
data to a single database and mine it on that database. The
second option is to implement a virtual organisation based
on Grid technologies and to integrate mining services for
exploring and analysing the data.
A possible infrastructure for a virtual organisation,
implemented using Grid technologies, is presented in
Fig. 1.
The company has a central branch and several local
branches (LB). Each branch is composed of a number of
Grid nodes (GN) interconnected in a Grid infrastructure.
In our case study, the data mining task is the discovery
of the association rules in the local branch databases, and
the implementation of the Grid infrastructure is based on
the Globus toolkit.
The Globus toolkit is a community based, open architecture, open source set of services and software libraries
that supports Grids and Grid applications [13].
The toolkit addresses issues of security, information discovery, resource management, data management, communication, and portability. Globus toolkit mechanisms are in
use at hundreds of sites and by dozens of major Grid projects worldwide.
The Globus toolkit is based on the Open Grid Services
Architecture (OGSA), in which a Grid provides an extensible set of services that virtual organisations can aggregate
in various ways [14]. Building on concepts and technologies
from both the Grid and Web service communities, OGSA
defines uniform exposed service semantics for the Grid
services.
The Globus Toolkit is a set of software tools useful in
developing Grid applications and it represents an imple-
GN
GN
GN
GN
GN
LB
GN
GN
GN
GN
GN
LB
CENTRAL
BRANCH
GN
LB
GN
GN
GN
GN
GN
LB
VIRTUAL ORGANISATION
Fig. 1. Virtual organisation infrastructure using grid technologies.
mentation of the Open Grid Services Infrastructure
(OGSI).
In the Globus Toolkit, a Grid Service is based on the
Web Service but with some improvements introduced by
OGSI specifications: stateful and potentially transient services, data services, lifetime management, notifications, service groups, portType (interface) extension.
5. Implementation
In this article a method of integrating the task of the
mining association rules in geographically distributed
databases with the Globus Toolkit is proposed. In the
OGSA context, the association rules discovery task is
exposed in the form of Grid services. The mining service has several components specific to a Grid service:
service data access, service data element, and service
implementation.
The association rules discovery service is interacting
with the rest of the grid services: service registry, service
creation, authorisation, notification, manageability and
concurrency. The Apriori Grid Service must comply with
OGSA rules, constraints, standard interfaces and behaviours. The service data access contains a standard interface
and a discovery service, for registering information about
Grid service instances with registry services. The client
application calls the standard method FindServiceData,
which retrieves service information from individual Apriori
Grid Service instances. The service data access defines a
standard interface and semantics for dynamic service creation of the Apriori Grid Service, located at the Service
Data Element level.
The architecture of the association discovery service is
presented in Fig. 2.
298
C. Aflori, M. Craus / Advances in Engineering Software 38 (2007) 295–300
At the implementation level, the specific association
rules discovery task details are defined: algorithm libraries
and metadata structure. The algorithm libraries implement
the sequential and the parallel versions of the association
mining task. The metadata contains the format of the
data-sources to be mined, the data locations and the structure and location of the knowledge database that stores the
results of the mining task.
In our case study, from the central branch we want to
launch the association rules discovery task, mining the data
from central and local branches.
The hosting environment from the central branch
encapsulates computing and storage resources for creating
association mining services and storage services for the
knowledge collected. The hosting environment from the
central branch also implements a Virtual Organisation
(VO) registry service for providing information about the
location of all mining, data transformation and database
services. Each service provider has a local registry service
(LR) that provides information about the interface of the
implemented services.
The user application queries the VO registry to search
the association mining services. Then, the user invokes
‘‘create Grid service’’ requests on the two factories in the
different hosting environment: the association discovery
(Apriori Grid) service provider and storage service provider. These requests create the Apriori service that will
perform the data mining operation on their behalf, and
an allocation of temporary storage for use by that computation. Each request involves mutual authentication of
the user and the relevant factory (using an authentication mechanism described in the factory’s service descrip-
User request
Service Data Access
Discovery
Creation Notification
Grid Service
Apriori Grid Interface
Service Data Element
Metadata Service
Management Service
Apriori Grid Service
tion) followed by authorisation of the request. Each
request is successful and results in the creation of a
Grid service instance with some initial lifetime. The new
data mining service instance is also provided with delegated proxy credentials that allow it to perform further
remote operations on behalf of the user. The newly created association mining service uses its proxy credentials
to start requesting data from the database services, placing
intermediate results in local storage. The Apriori Grid
Service also uses notification mechanisms to provide the
user application with periodic updates on its status. Meanwhile, the user application generates periodic ‘‘keep-alive’’
requests to the two Grid service instances that it has
created.
We propose a distributed framework for mining associative rules in geographic distributed databases. The framework implements VO structure and it uses data mining
methods and techniques and different technologies like
Grid, Java and relational databases.
The client module allows the user to send the parameters
and the commands to the rest of the components. Configuration parameters of the framework and of the Apriori
algorithm, then call the mining methods. The client finds
and creates the Apriori Grid Service and sends it the
parameters for: locating and connecting to the remote
transaction databases, locating the transaction tables or
calling the pre-processing methods, Apriori algorithm minimum support and confidence, locating the knowledge base
with the partial rules generated.
In the next step, the Apriori Grid Service instances perform the association rules discovery task independently on
the remote databases, located on the Central Branch or on
the Local Branches of the VO. The user/client application
receives notifications when an association mining service
completes its job and the results can be explored from
the knowledge base. The user can explore the partial association rules generated or can apply incremental methods
to combine the partial rules into more general rules.
The Apriori Grid Service implements an original Apriori
library for mining the associative rules. The library is
implemented in Java for portability and compatibility with
the implementation of the Globus 3 Toolkit Core services
and, in this phase, it contains the serial version of the
algorithm.
The implementation of the algorithm follows the next
steps:
Implementation
Sequential Apriori
Parallel Apriori
Metadata
Hosting environment / runtime
(J2EE, .NET, CORBA, C)
Open Grid Services Architecture
Fig. 2. The design of the Apriori Grid Service in the open grid service
architecture.
• the first step finds all 1-itemsets candidates and all frequent 1-itemsets candidates (a candidate is an itemset
to be counted).
• in the next step the algorithm finds all frequent 2itemsets.
• the algorithm generates a hash tree structure for storing
the k-itemsets.
• the algorithm multiple scans the database and does not
load it in the memory, which is very useful for large
databases.
C. Aflori, M. Craus / Advances in Engineering Software 38 (2007) 295–300
299
• the associative rules are generated from the maximal list
of the frequent k-itemsets with the confidence and support coefficients greater than the minimum values.
After the Apriori Grid Service finishes the processing, it
sends the notification to the client and stores the rules in
the knowledge base.
6. Experiments
In order to measure the Apriori Grid Service performances and accuracy, an experiment was set up, using an
intranet network that connects three local area networks
(each with ten PCs of P4 2.4 MHz, 1 Gb RAM), with the
WindowsXP and Linux operating systems. Each workstation was installed with the Globus 3 Toolkit and it was
deployed with the Apriori Grid Service. Some workstations
were installed with the framework that connects the user
client and Apriori Grid Service deployed in the Globus
Toolkit 3.
The input transactions were stored in three different
database systems: Oracle 10 g, PostGreSQL and MySQL.
The remote databases were populated with two categories
of test data: the sales transaction data of a company and
demographic data.
Association
Discovery (AD) User
Interface
VO registry
A.D. service provider
LR
Storage
service provider
AD Factory
Storage
Factory
Apriori Service
Knowledge Base
Service
Data provider 1
LR
Data provider
n
Data
extraction
Remote database
Data
transformation
extraction
Remote database
Fig. 3. Apriori Grid Service implementation.
Fig. 4. Results evaluation of the Apriori Grid Service.
A series of tests were run. First of all, the sequential
standalone implementation of the Apriori algorithm, called
by the Apriori Grid Service, and a classical implementation
(Weka library) [15] were compared.
The organisation of the datasets was given in the form
‘‘TxxIyyDzzzK’’, where ‘‘xx’’ denotes the average number
of items present per transaction, ‘‘yy’’ denotes the average
support of each item in the dataset and ‘‘zzzK’’ denotes the
total number of transactions in ‘‘K’’ (1000 s). The experiments were performed for 4 database sizes (3000, 7000,
10,000 and 50,000 transactions) and the resulting rules
resulted had 30%, 25% and 20% support factors (see
Fig. 3).
The efficiency of the Apriori Grid Service algorithm
compared with the classical Apriori is presented in the
Fig. 4, which shows an improvement in performance for
the Grid version of the association algorithm compared
to the classical Apriori of 25% for 30% support, and it
increased to about 27% for 25% support and to 31% for
20% support, all for the larger dataset (10 k transactions).
Also, a series of distributed tests using the ‘‘Virtual
Organization’’ implemented on the local intranet were performed. We called the Apriori Grid Service with different
series of parameters (remote databases, algorithm support
and confidence factors, rules generated) and we obtained
accurate results in real time.
7. Conclusions and future work
This paper presents some aspects of architectures, algorithms and implementations of two emerging fields: Data
mining and Grid technologies. We propose the design
and the implementation of the Apriori algorithm using
the Grid service infrastructure. Our approach for distributed association mining is based on the Open Grid Services Architecture and the implementation is realised
using the Grid toolkit vers. 3 and the Apriori algorithm.
We set up an experimental evaluation and the results
were compared with the related work. The Apriori Grid
uses a library based on the classical Apriori algorithm,
300
C. Aflori, M. Craus / Advances in Engineering Software 38 (2007) 295–300
but the implementation is original, it was optimised and it
was evaluated with classical and new datasets. The results
were comparable with the existing implementations.
The practical results show the efficiency of Grid services
that use algorithms with higher degrees of complexity. The
process of mining associative rules implies multiple databases scanning and high cost resource operations. The
use of Grid technologies can improve the response time
of such applications because of the capacity to combine
the calculus power of various geographic distributed
resources. The presented method also has other advantages
like distributed processing, increase in the availability, flexibility, portability and extensibility of the method.
Future work is focused on the evaluation and optimisation of the association rules discovery system implementation. Another important issue is the implementation of the
parallel version of the Apriori algorithm using the Message
Passing Interface extended with Grid services (MPICH
G2).
Acknowledgement
This work was partially supported by grant no. 139/
2004 of the Information Society Romanian Project.
References
[1] Piatestsky-Shapiro G. Knowledge discovery in databases. AAI/MIT
Press; 1991.
[2] Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R.
Advances in knowledge discovery and data mining. Menlo Park,
Calif: AAAI Press; 1996.
[3] Agrawal R, Imielinski T, Swami A. Mining association rules between
sets of items in large databases. In: Proc. ACM SIGMOD Intl. Conf.
Management Data, 1993.
[4] Agrawal R, Srikant R. Fast algorithms for mining association rules.
In: Proc. 20th VLDB Conference, 1994.
[5] Foster I, Kesselman C, Tuecke S. The anatomy of the grid: enabling
scalable virtual organizations. Int’l J. High-Perform Comput Appl
2001;15(3).
[6] Foster I, Kesselman C, editors. The grid: blueprint for a new
computing infrastructure. San Francisco: Morgan Kaufman; 1999.
[7] Chattratichat J, Darlington J, Guo Y, Hedvall S, Köler M, Syed J.
An architecture for distributed enterprise data mining. HPCN Europe
1999, Lecture Notes in Computer Science, vol. 1593. Springer; 1999.
[8] Stolfo SJ, Prodromidis AL, Tselepis S, Lee W, Fan DW, Chan PK.
JAM: Java agents for meta-learning over distributed databases.
International KDD’97 Conference, pp. 74–81, 1997.
[9] Kargupta H, Park B, Hershberger D, Johnson E. Collective data
mining: a new perspective toward distributed data mining. In:
Kargupta Hillol, Chan Philip, editors. Adv Distributed Parallel
Knowl Discov. AAAI Press; 1999.
[10] Grossman R, Bailey S, Kasif S, Mon D, Ramu A, Malhi B. The
preliminary design of papyrus: a system for high-performance,
distributed data mining over clusters, meta-clusters and super-clusters
International KDD’98 Conference, 1998.
[11] Rana OF, Walker DW, Li M, Lynden S, Ward M. PaDDMAS:
parallel and distributed data mining application suite. In: Proc Int
Parallel Distributed Process Sympos (IPDPS/SPDP). IEEE Computer Society Press; 2000.
[12] Cannataro M, Talia D, Trunfio P. Knowledge GRID: High-Performance Knowledge Discovery Services on the Grid, 2nd Int. Workshop on Grid Computing (GRID 2001) in conjunction with
Supercomputing 2001, LNCS 2242, Springer Verlag, 2001.
[13] Globus Toolkit, 2001, http://www.globus.org.
[14] Foster I, et al. The Physiology of the Grid: An Open Grid Services
Architecture for Distributed Systems Integration, tech. report,
Globus Project; 2002.
[15] Witten IH, Frank E. Data mining: practical machine learning tools
with Java implementations. San Francisco: Morgan Kaufman; 2000.