Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Effect of Data Skewness and Workload Balance in Parallel Data Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

498 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO.

3, MAY/JUNE 2002

Effect of Data Skewness and Workload


Balance in Parallel Data Mining
David W. Cheung, Member, IEEE, Sau D. Lee, and Yongqiao Xiao

AbstractÐTo mine association rules efficiently, we have developed a new parallel mining algorithm FPM on a distributed share-
nothing parallel system in which data are partitioned across the processors. FPM is an enhancement of the FDM algorithm, which we
previously proposed for distributed mining of association rules [8]. FPM requires fewer rounds of message exchanges than FDM and,
hence, has a better response time in a parallel environment. The algorithm has been experimentally found to outperform CD, a
representative parallel algorithm for the same goal [2]. The efficiency of FPM is attributed to the incorporation of two powerful
candidate sets pruning techniques: distributed and global prunings. The two techniques are sensitive to two data distribution
characteristics, data skewness, and workload balance. Metrics based on entropy are proposed for these two characteristics. The
prunings are very effective when both the skewness and balance are high. In order to increase the efficiency of FPM, we have
developed methods to partition a database so that the resulting partitions have high balance and skewness. Experiments have shown
empirically that our partitioning algorithms can achieve these aims very well, in particular, the results are consistently better than a
random partitioning. Moreover, the partitioning algorithms incur little overhead. So, using our partitioning algorithms and FPM together,
we can mine association rules from a database efficiently.

Index TermsÐAssociation rules, data mining, data skewness, workload balance, parallel mining, partitioning.

1 INTRODUCTION

M INING association rules in large databases is an


important problem in data mining research [1], [2],
[4], [6], [11], [12], [15], [17], [19], [20], [24]. It can be reduced
[3], IDD (Intelligent Data Distribution) [10] and HPA (Hash
Based Parallel) [23].
In the count distribution paradigm, each processor is
to finding large itemsets with respect to a given support responsible for computing the local support counts of all the
threshold [1], [2]. The problem demands a lot of CPU candidates, which are the support counts in its partition. By
resources and disk I/O to solve. It needs to scan all the exchanging the local support counts, all processors then
transactions in a database, which introduces much I/O, and compute the global support counts of the candidates, which
at the same time, search through a large set of candidates are the total support counts of the candidates from all the
for large itemsets which requires a lot of CPU computation. partitions. Subsequently, large itemsets are computed by
Thus, parallel mining may deliver an effective solution to each processor independently. The merit of this approach is
this problem. In this paper, we study the behavior of the simple communication scheme: The processors need
parallel association rule mining algorithms in parallel only one round of communication in every iteration. This
systems with a share-nothing memory. In this model, the
makes it very suitable for a parallel system when consider-
database is partitioned and distributed across the local
ing response time. CD [3] is a representative algorithm in
disks of the processors. We investigate how the partitioning
count distribution. It was implemented on an IBM SP2.
method affects the performance of the algorithms and then
propose new partitioning methods to exploit this finding to PDM [18] is a modification of CD with the inclusion of the
speed up the parallel mining algorithms. direct hashing technique proposed in [17]. In the count
The prime activity in finding large itemsets is the distribution approach, every processor is required to keep
the local support counts of all the candidates at each
computation of support counts of candidate itemsets. Two
iteration, one possible problem is the space required to
different paradigms have been proposed for a parallel
maintain the local support counts of a large number of
system with distributed memory for this purpose. The first candidate sets.
one is count distribution and the second one is data In the data distribution paradigm, to ensure enough
distribution [3]. Algorithms that use the count distribution memory for the candidates, each processor is responsible
paradigm include CD (Count Distribution) [3] and PDM for keeping the support counts of only a subset of the
(Parallel Data Mining) [18]. Algorithms which adopt the candidates. However, transactions (or their subsets) in
data distribution paradigm include DD (Data Distribution) different partitions must then be sent to other processors for
counting purposes. Compared with sending support
counts, sending transaction data requires a lot more
. The authors are with the Department of Computer Science and Information communication bandwidth. DD (Data Distribution) is the
Systems, The University of Hong Kong, Hong Kong. first proposed data distribution algorithm [3]. It has been
E-mail: {dcheung, sdlee, yqxiao}@csis.hku.hk.
implemented on an IBM SP2. The candidates in DD are
Manuscript received 28 July 1997; revised 1 Jan. 1999; accepted 11 Dec. 2000; distributed equally over all the processors in a round-robin
posted to Digital Library 7 Sept. 2001.
For information on obtaining reprints of this article, please send e-mail to: fashion. Then, every processor ships its database partition
tkde@computer.org, and reference IEEECS Log Number 105428. to all other processors for support counts computing. It
1041-4347/02/$17.00 ß 2002 IEEE
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 499

generates a lot of redundant computation because every We have captured the distribution characteristics in two
transaction is processed as many times as the number of factors: data skewness and workload balance. Intuitively, a
processors. In addition, it requires a lot of communication, partitioned database has high data skewness if most
and its performance is worse than CD [3]. IDD (Intelligent globally large itemsets1 are locally large only at a few
Data Distribution) and its variant HD (Hybrid Distribution) partitions. Loosely speaking, a partitioned database has a
are important improvements on DD [10]. They partition the high workload balance if all the processors have similar
candidates across the processors based on the first item of a number of locally large itemsets.2 We have defined
candidate. Therefore, each processor only needs to handle quantitative metrics to measure data skewness and work-
the subsets of a transaction which begin with the items load balance. We found out that both distributed and
assigned to the processor. This significantly reduces the global prunings have super performance in the best-case of
redundant computation in DD. HPA (Hash Based Parallel), high data skewness and high workload balance. The
which is very similar to IDD, uses a hashing technique to combination of high balance with moderate skewness is
distribute the candidates to different processors [23]. the second best-case. Inspired by this finding, we investi-
One problem of data distribution that needs to be noted: gate the feasibility of planning the partitioning of the
It requires at least two rounds of communication in each database. We want to divide the data into different
iteration at each processorÐto send its transaction data to partitions so as to maximize the workload balance and
other processors; to broadcast the large itemsets found yield high skewness. Mining a database by partitioning it
subsequently to other processors for candidate sets genera- appropriately and then employing FPM gives us excellent
tion of the next iteration. This two-round communication mining performance. We have implemented FPM on an
scheme puts data distribution in an unfavorable situation IBM SP2 parallel machine with 32 processors. Extensive
when considering response time. performance studies have been carried out. The results
In this work, we investigate parallel mining employing confirm our observation on the relationship between
the count distribution approach. This approach requires less
pruning effectiveness and data distribution.
bandwidth and has a simple one-round communication
For the purpose of partitioning, we have proposed four
scheme. To tackle the problem of a large number of
algorithms. We have implemented these algorithms to
candidate sets in count distribution, we adopt two effective
techniques, distributed pruning and global pruning, to prune study their effectiveness. K-means clustering, like most
and reduce the number of candidates in each iteration. clustering algorithm, provides good skewness. However, it
These two techniques make use of the local support counts would, in general, destroy the balance. Random partition-
of large itemsets found in an iteration to prune candidates ing, in general, can deliver high balance but very low
for the next iteration. These two pruning techniques have skewness. We introduce an optimization constraint to
been adopted in a mining algorithm FDM (Fast Distributed control the balance factor in the k-means clustering
Mining) previously proposed by us for distributed data- algorithm. This modification, called Bk (balanced k-means
bases [7], [8]. However, FDM is not suitable for parallel clustering), produces results which exhibit as good a
environment, it requires at least two rounds of message balance as the random partitioning and also high skewness.
exchanges in each iteration that increases the response time In conclusion, we found that Bk is the most favorable
significantly. We have adopted the two pruning techniques partitioning algorithms among those we have studied.
to develop a new parallel mining algorithm FPM (Fast We summarize our contributions as follows:
Parallel Mining), which requires only one round of message
exchange in each iteration. Its communication scheme is as 1. We have enhanced FDM to FPM for mining associa-
simple as that in CD, and it has a much smaller number of tion rules on a distributed share-nothing parallel
candidate sets due to the pruning. In the rare case that the system which requires fewer rounds of message
set of candidates are still too large to fit into the memory of communication.
each processor even after the pruning, we can integrate the 2. We have analytically shown that the performance of
pruning techniques with the algorithm HD into a 2-level the pruning techniques in FPM are very sensitive to
cluster algorithm. This approach will provide the scalability the data distribution characteristics of skewness and
to handle candidate sets of any size and at the same balance, and proposed entropy-based metrics to
maintain the benefit of effective pruning. (For details, please measure these two characteristics.
see discussion in Section 7.2). 3. We have implemented FPM on an SP2 parallel
In this paper, we focus on studying the performance machine and experimentally verified its perfor-
behavior of FPM and CD. It depends heavily on the mance behavior with respect to skewness and
distribution of data among the partitions of the database. To balance.
study this issue, we first introduce two metrics, skewness 4. We have proposed four partitioning algorithms and
and balance, to describe the distribution of data in the empirically verified that Bk, among the four, is the
databases. Then, we analytically study their effects on the most effective in introducing balance and skewness
performance of the two mining algorithms and verify the into database partitions.
results empirically. Next, we propose algorithms to produce
database partitions that give ªgoodº skewness and balance 1. An itemset is locally large at a processor if it is large within the partition
values. In other words, we propose algorithms to partition at the processor. It is globally large if it is large with respect to the whole
the database so that good skewness and balance values are database [7], [8]. Note that every globally large itemset must be locally large
at some processor. Refer to Section 3 for details.
obtained. Finally, we do experiments to find out how 2. More precise definitions of skewness and workload balance will be
effective these partitioning algorithms are. given in Section 4.
500 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

Fig. 1. The count distribution algorithm.

The rest of this paper is organized as follows: Section 2 be an itemset, the global support of X is the support X:sup of
overviews the parallel mining of association rules. The X in D. When refering to a partition Di , the local support of
techniques of distributed and global prunings, together X at processor i, denoted by X:sup…i† , is the support of X in
with the FPM algorithm are described in Section 3. In the Di . X is globally large if X:sup  minsup  jDj. Similarly, X is
same section, we also investigate the relationship between locally large at processor i if X:sup…i†  minsup  jDi j. Note
the effectiveness of the prunings and the data distribution that, in general, an itemset X which is locally large at some
characteristics. In Section 4, we define two metrics to processor i may not necessary be globally large. On the contrary,
measure the data skewness and workload balance of a every globally large itemset X must be locally large at some
database partitioning and then present the results of an processor i. This result and its application have been
experimental study on the performance behavior of FPM discussed in detail in [7].
and CD. Basing on the results of Section 4, we introduce For convenience, we use the short form k-itemset to stand
algorithms in Section 5 to partition database so as to for size-k itemset, which consist of exactly k items. And we
improve the performance of FPM. This is achieved by use Lk to denote the set of globally large k-itemsets. We have
arranging the tuples in the database carefully to increase pointed out above that there is a distinction between locally
skewness and balance. Experiments in Section 6 evaluates and globally large itemsets. For discussion purpose, we will
the effectiveness of the partitioning algorithms. In Section 7, call a globally large itemset which is also locally large at
we discuss a few issues including possible extensions of processor i, gl-large at processor i. We will use GLk…i† to
FPM to enhance its scalability and we give our conclusions
denote the gl-large k-itemsets at processor i. Note that
in Section 8.
GLk…i†  Lk , 8i; 1  i  n.

2 PARALLEL MINING OF ASSOCIATION RULES 2.2 Count Distribution Algorithm for Parallel Mining
A priori is the most well-known serial algorithm for
2.1 Association Rules
mining association rules [2]. It relies on the apriori_gen
Let I ˆ fi1 ; i2 ;    ; im g be a set of items and D be a database function to generate the candidate sets at each iteration.
of transactions, where each transaction T consists of a set of
CD (Count Distribution) is a parallelized version of Apriori
items such that T  I. An association rule is an implication
for parallel mining [3]. The database D is partitioned into
of the form X ) Y , where X  I, Y  I, and X \ Y ˆ .
D1 ; D2 ;    ; Dn and distributed across n processors. In the
An association rule X ) Y has support s in D if the
first iteration of CD, every processor i scans its partition Di
probability of a transaction in D contains both X and Y is
to compute the local supports of all the size-1 itemsets. All
s. The association rule X ) Y holds in D with confidence c
processors are then engage in one round of support counts
if the probability of a transaction in D which contains X
exchange. After that, they independently find out global
also contains Y is c. The task of mining association rules is
to find all the association rules whose support is larger support counts of all the items and then the large size-1
than a given minimum support threshold and whose itemsets. For the other iteration k, (k > 1), each processor i
confidence is larger than a given minimum confidence runs the program fragment in Fig. 1. In Step 1, it computes
threshold. For an itemset X, we use X:sup to denote its the candidate set Ck by applying the aprior_gen function
support count in database D, which is the number of on Lk 1 , the set of large itemsets found in the previous
transactions in D containing X. An itemset X  I is large if iteration. In Step 2, local support counts of candidates in
X:sup  minsup  jDj, where minsup is the given minimum Ck are computed by a scanning of Di . In Step 3, local
support threshold. For the purpose of presentation, we support counts are exchanged with all other processors to
sometimes just use support to stand for support count of an get global support counts. In Step 4, the globally large
itemset. itemsets Lk are computed independently by each proces-
It has been shown that the problem of mining association sor. In the next iteration, CD increases k by one and
rules can be decomposed into two subproblems [1]: 1) find repeats Steps 1±4 until no more candidate is found.
all large itemsets for a given minimum support threshold
and 2) generate the association rules from the large itemsets
3 PRUNING TECHNIQUES AND THE
found. Since 1) dominates the overall cost, research has been
focused on how to efficiently solve the first subproblem.
FPM ALGORITHM
In the parallel environment, it is useful to distinguish CD has not taken advantage of the data partitioning in the
between the two different notions of locally large and global parallel setting to prune its candidate sets. We propose a
large itemsets. Suppose the entire database D is partitioned new parallel mining algorithm FPM which has adopted the
into D1 ; D2 ;    ; Dn and distributed over n processors. Let X distributed and global prunings proposed first in [7].
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 501

3.1 Candidate Pruning Techniques TABLE 1


3.1.1 Distributed Pruning High Data Skewness and High Workload Balance Case
In Step 4 of CD (Fig. 1), after the support counts exchange in
the kth iteration, each processor can find out not only the
large itemsets Lk , but also the processors at which an itemset X
is locally large, for all itemsets X 2 Lk . In other words, the
subsets GLk…i† , 1  i  n, of Lk can be identified at every
processor. This information of locally large itemsets turns
out to be very valuable in developing a pruning technique
to reduce the number of candidates generated in CD.
Suppose the database is partitioned into D1 and D2 on
processors 1 and 2. Further assume that both A and B are
two size-1 globally large itemsets. In addition, A is gl-large 28 candidates. This shows that it is very effective to use
at processor 1 but not processor 2 and B is gl-large at
the distributed pruning to reduce the candidate sets.
processor 2 but not processor 1. Then, AB can never be
globally large and, hence, does not need to be considered as 3.1.2 Global Pruning
a candidate. A simple proof of this result in the following. If As a result of count exchange at iteration k 1 (Step 3 of CD
AB is globally large, it must be locally large (i.e., gl-large) at in Fig. 1), the local support counts X:sup…i† , for all large
some processor. Assume that it is gl-large at processor 1, …k 1†-itemsets, for all processor i, (1  i  n), are available
then its subset B must also be gl-large at processor 1, which at every processor. Another powerful pruning technique
is contradictory to the assumption. Similarly, we can prove called global pruning is developed by using this information.
that AB cannot be gl-large at processor 2. Hence, AB cannot Let X be a candidate k-itemset. At each processor i,
be globally large at all. X:sup…i†  Y:sup…i† , if Y  X. Thus, X:sup…i† will always be
Following the above result, no two 1-itemsets which are smaller than minfY:sup…i† j Y  X and jY j ˆ k 1g. Hence,
not gl-large together at the same processor can be combined
to form a size-2 globally large itemset. This observation can X
n
X:maxsup ˆ minfY:sup…i† j Y  X and jY j ˆ k 1g
be generalized to size-k candidates. The subsets GLk 1…i† ,
iˆ1
(1  i  n), together form a partition of Lk 1 , (some of them
may overlap). For i 6ˆ j, no candidate need to be generated by is an upper bound of X:sup . X can then be pruned away if
joining sets from GLk 1…i† and GLk 1…j† . In other words, X:maxsup < minsup  jDj. This technique is called global
candidates can be generated by applying apriori_gen on each pruning. Note that the upper bound is computed from the
GLk 1…i† , (1  i  n), separately, and then take their union. local support counts resulting from the previous count
The set of size-k candidates generated with this technique is exchange.
equal to CGk ˆ [niˆ1 CGk…i† ˆ [niˆ1 apriori gen…GLk 1…i† ). Since Table 1 is an example that global pruning could be more
GLk 1…i†  Lk 1 , 8i; 1  i  n, the number of candidates in effective than distributed pruning. In this example, the
CGk could be much less than that in Ck ˆ apriori gen…Lk 1 †, global support count threshold is 15 and the local support
count threshold at each processor is five. Distributed
the candidates in the Apriori and CD algorithms.
pruning cannot prune away CD, as C and D are both gl-
Based on the above result, we can prune away a size-k
large at processor 2. Whereas global pruning can prune
candidate if there is no processor at which all its size-…k 1† away CD, as CD:maxsup ˆ 1 ‡ 12 ‡ 1 < 15.
subsets are gl-large. This pruning technique is called In fact, global pruning subsumes distributed pruning
distributed pruning [7]. The following example taken from which is shown in the following theorem.
[7] shows that the distributed pruning is more effective in
reducing the candidate sets than the pruning in CD. Theorem 1. If X is a k-itemset (k > 1) which is pruned away in
the kth iteration by distributed pruning, then X is also pruned
Example 1. Assuming there are three processors which away by global pruning in the same iteration.
partitions the database D into D1 , D2 , and D3 . Suppose
the set of large 1-itemsets (computed at the first iteration) Proof. If X can be pruned away by distributed pruning,
L1 ˆ fA; B; C; D; E; F ; G; Hg, in which A, B, and C are then there does not exist a processor at which all the size
locally large at processor 1, B, C, and D are locally (k 1) subsets of X are gl-large. Thus, at each processor
i, there exists a size (k 1) subset Y of X such that
large at processor 2, and E, F , G, and H are locally
Y:sup…i† < minsup  jDi j. Hence,
large at processor 3. Therefore, GL1…1† ˆ fA; B; Cg,
GL1…2† ˆ fB; C; Dg, and GL1…3† ˆ fE; F ; G; Hg. X
n
Based on the above discussion, the set of size-2 X:maxsup < …minsup  jDi j† ˆ minsup  jDj:
iˆ1
candidate sets from processor 1 is CG2…1† ˆ apriori gen
…GL1…1† † ˆ fAB; BC; ACg. Similarly, CG2…2† ˆ fBC; CD; Therefore, X is pruned away by global pruning. u
t
BDg and CG2…3† ˆ fEF ; EG; EH; F G; F H; GHg. Hence, The reverse of Theorem 1 is not necessarily true,
the set of size-2 candidate sets is CG…2† ˆ CG2…1† [ however. From the above discussions, it can be seen that
CG2…2† [ CG2…3† , total 11 candidates. the three pruning techniques, the one in apriori_gen, the
However, if apriori_gen is applied to L1 , the set of distributed, and global prunings, have increasing pruning
size-2 candidate sets C2 ˆ apriori gen…L1 † would have power and the latter ones subsume the previous ones.
502 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

Fig. 2. The FPM algorithm.

3.2 Fast Parallel Mining Algorithm (FPM) small number of partitions, the set of large itemsets together
We present the FPM algorithm in this section. It improves can still be distributed either evenly or extremely skewed
CD by adopting the two pruning techniques. The first among the partitions. In one extreme case, most partitions
iteration of FPM is the same as CD. Each processor scans its can have similar number of locally large itemsets, and the
partition to find out local support counts of all size-1 workload balance is high. In the other extreme case, the
itemsets and use one round of count exchange to compute large itemsets are concentrated at a few partitions and,
the global support counts. At the end, in addition to L1 , hence, there are large differences in the number of locally
each processor also finds out the gl-large itemsets GL1…i† , for large itemsets among different partitions. In this case, the
1  i  n. Starting from the second iteration, prunings are workload is unbalance. These two characteristics have
used to reduce the number of the candidate sets. Fig. 2 is the important bearing on the performance of FPM.
program fragment of FPM at processor i for the kth (k > 1) Example 2. Table 1 is a case of high data skewness and
iteration. In Step 1, distributed pruning is used and high workload balance. The supports of the itemsets
apriori_gen is applied to the sets GLk 1…i† , 8i ˆ 1;    ; n, are clustered mostly in one partition, and the skew-
instead of to the set Lk 1 .3 In Step 2, global pruning is ness is high. On the otherhand, every partition has the
applied to the candidates that survive the distributed same number (two) of locally large itemsets.4 Hence,
pruning. The remaining steps are the same as those in the workload balance is also high. CD will generate
CD. As has been discussed, FPM, in general, enjoys smaller 6
2 ˆ 15 candidates in the second iteration, while
candidate sets than CD. Furthermore, it uses a simple one- distributed pruning will generate only three candi-
round message exchange scheme same as CD. If we dates AB, CD, and EF , which shows that the pruning
compare FPM with the FDM algorithm proposed in [7], has good effect.
we will see that this simple communication scheme makes Table 2 is an example of high workload balance and
FPM more suitable than FDM in terms of response time in a low data skewness. The support counts of the items A, B,
parallel system. C, D, E, and F are almost equally distributed over the
Note that since the number of processors, n, would not three processors. Hence, the data skewness is low.
be very large, the cost of generating the candidates with the However, the workload balance is high because every
distributed pruning in Step 1 (Fig. 2) should be on the same partition has the same number (five) of locally large
order as that in CD. As for global pruning, since all local itemsets. Both CD and distributed pruning generate the
support counts are available at each processor, no addi- same 15 candidate sets in the second iteration. However,
tional count exchange is required to perform the pruning. global pruning can prune away the candidates AC, AE,
Furthermore, the pruning in Step 2 (Fig. 2) is performed and CE. FPM still exhibits a 20 percent of improvement
only on the remaining candidates after the distributed over CD in this pathological case of high balance and low
pruning. Therefore, cost for global pruning is small skewness.
comparing with database scanning and count updates.
We will define formal metrics for measuring data
3.3 Data Skewness and Workload Balance skewness and workload balance for partitions in Section 4.
In a partitioned database, two data distribution character- We will also see how high values of balance and skewness
istics, data skewness and workload balance, affect the effec- can be obtained by suitably and carefully partitioning the
tiveness of the pruning and, hence, performance of FPM. database in Section 5. In the following, we will show the
Intuitively, the data skewness of a partitioned database is effects of distributed pruning analytically for some special
high if most large itemsets are locally large only at a few cases.
processors. It is low if a high percentage of the large Theroem 2. Let L1 be the set of size-1 large itemsets, C2…c† and
itemsets are locally large at most of the processors. For a C2…d† be the size-2 candidates generated by CD and distributed
partitioning with high skewness, even though it is highly pruning, respectively. Suppose that each size-1 large itemset is
likely that each large itemset will be locally large at only a
4. As has been mentioned above, the support threshold in Table 1 for
3. Compare this step with Step 1 of Fig. 1 would be useful in order to see globally large itemsets is 15, while that for locally large itemset is 5 for all
the difference between FPM and CD. the partitions.
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 503
 
m0
TABLE 2 jCk…d† j ˆ  n:
k
High Workload Balance and Low Data Skewness Case
jC j
Hence, jCk…d† j ˆ m kk‡1  jLkn 1 j  n. Therefore, jCk…d†
0 0

k…c† j
ˆm k‡1
m k‡1 .
When k ˆ 2, this result becomes Theorem 2. In general,
m0  m, which shows that distributed pruning has sig-
nificant effect in almost all iterations. However, the effect
will decrease when m0 converges to m as k increases.

4 METRICS FOR DATA SKEWNESS AND WORKLOAD


BALANCE
In this section, we define metrics to measure data skewness
gl-large at one and only one processor, and the size-1 large
and workload balance. Then, we present empirical results
itemsets are distributed evenly among the processors, i.e., the
of the effects of skewness and balance on the performance
number of size-1 gl-large itemsets at each processor is jL1 j=n,
of FPM and CD.
where n is the number of processors. Then, It is important to note that the simple intuition of
jC2…d† j …jL1 j=n† 1 1 defining balance as a measure on the evenness of distribut-
ˆ  : ing transactions among the partitions is not suitable in our
jC2…c† j jL1 j 1 n
studies. The performance of the prunings is linked to the
distribution of the large itemsets, not that of the transac-
Proof. Since CD will generate all
 the combinations in L1 as tions. Furthermore, the metrics on skewness and balance
size-2 candidates, jC2…c† j ˆ jL1j
2 ˆ jL21 j …jL1 j 1†. As for should be consistent between themselves. In the following,
distributed pruning, each processor will generate candi- we explain our entropy-based metrics defined for these two
notions.
dates independently from the gl-large size-1 itemsets at
the processor. The total number of candidates it will 4.1 Data Skewness
generate is jC2…d† j ˆ jL12j=n  n ˆ jL21 j …jL1 j=n 1†. Since n We develop a skewness metric based on the well established
is the number of processors, which is much smaller than notion of entropy [5]. Given a random variable X, it's
jL1 j, we have jL1 j  n  1. Hence, we can take the entropy is a measurement on how even or uneven its
probability distribution is over its values. If a database is
approximations jL1 j 1  jL1 j and jL1 j=n 1  jL1 j=n. X
jC2…d† j partitioned over n processors, the value pX …i† ˆ X:sup…i† can be
So, jC2…c† j ˆ …jLjL1 j=n†
1j 1
1
 n1 . u
t :sup
regarded as the probability that a transaction containing
Theorem 2 shows that distributed pruning can drama- itemset X comes from partition Di , (1  i  n). The entropy
Pn
tically prune away almost nn 1 size-2 candidates generated H…X† ˆ iˆ1 …pX …i†  log…pX …i††† is an indication of how
by CD in the high balance and good skewness case. even the supports of X are distributed over the partitions.5
We now consider a special case for the kth iteration For example, if X is skewed completely into a single
(k > 2) of FPM. In general, if Ak 1 is the set of size-…k 1† partition Dk , (1  k  n), i.e., it only occurs in Dk , then
large itemsets, then the maximum number of size-k pX …k† ˆ 1 and pX …i† ˆ 0, 8i 6ˆ k. The value of H…X† ˆ 0 is
candidates that can be generated by applying apriori_gen the minimal in this case. On the other hand, if X is evenly
 distributed among all the partitions, then pX …i†ˆ1
on Ak 1 is equal to mk , where m is the smallest integer such n , 1  i  n,
 and the value of H…X† ˆ log…n† is the maximal in this case.
that km1 ˆ jAk 1 j. In the following, we use this maximal
Therefore, the following metric can be used to measure the
case to estimate the number of candidates that can be
skewness of a database partitioning.
generated in the kth iteration. Let Ck…c† and Ck…d† be the set of
Definition 1. Given a database with n partitions, the skewness
size-k candidates generated by CD and distributed pruning,
S…X† of an itemset X is defined by S…X† ˆ HmaxHmaxH…X†
, where
respectively. Similar to Theorem 2, we investigate the case Pn
H…X† ˆ …p
iˆ1 X …i†  log…p X …i††† and H max ˆ log…n†.
in which all gl-large …k 1†-itemsets are locally large at only The skewness S…X† has the following properties:
one processor, and the number of gl-large itemsets at each
. S…X† ˆ 0 when all pX …i† (1  i  n) are equal. So, the
processor is the same. Let m be the smallest integer such
  skewness is at its lowest value when X is distributed
that km1 ˆ jLk 1 j. Then, we have jCk…c† j ˆ mk . Hence, evenly in all partitions.
jCk…c† j ˆ m kk‡1  jLk 1 j. Similarly, let m0 be the smallest . S…X† ˆ 1 when a pX …i† equals one and all the others
integer such that are zero. So the skewness is at its highest value when
 0  X occurs only in one partition.
m jLk 1 j . 0 < S…X† < 1 in all the other cases.
ˆ :
k 1 n
5. In the computation of H…X†, some of the probability values pX …i† may
Then, be zero. In that case, we take 0 log 0 ˆ 0, in the sense that limh!0 h log h ˆ 0.
504 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

Following the property of entropy, higher values of S…X† However, some combinations of their values are not
corresponds to higher skewness for X. The above definition admissible. For instance, we cannot have a database
gives a metric on the skewness of an itemset. In the partitioning with very low balance and very low skewness.
following, we define the skewness of a partitioned database This is because a very low skewness would accompany
as a weighted sum of the skewness of all its itemsets. with a high balance, while a very low balance would
Definition 2. Given a database D with n partitions, the skewness accompany with a high skewness.
T S…D† is defined by Theorem 3. Let D1 ; D2 ;    ; Dn be the partitions of a database D.
X
T S…D† ˆ S…X†  w…X†; 1. If T S…D† ˆ 1, then the admissible values of T B…D†
X2IS ranges from zero to one. Moreover, if T S…D† ˆ 0, then
X:sup
T B…D† ˆ 1.
where IS is the set of all the itemsets, w…X† ˆ P Y:sup
is the 2. If T B…D† ˆ 1, then the admissible values of T S…D†
Y 2IS
weight of the support of X over all the itemsets, and S…X† is ranges from zero to one. Moreover, if T B…D† ˆ 0, then
the skewness of itemset X. T S…D† ˆ 1.
Proof.
T S…D† has some properties similar to those of S…X†.
1. By definition, 0  T B…D†  1. What we need to
. T S…D† ˆ 0 when the skewness of all the itemsets are prove is that the boundary cases are admissible
at its minimal value. when T S…D† ˆ 1. T S…D† ˆ 1 implies that S…X† ˆ
. T S…D† ˆ 1 when the skewness of all the itemsets are 1 for all large itemsets X. Therefore, each large
at its maximal value. itemset is large at one and only one partition. If all
. 0 < T S…D† < 1 in all the other cases. the large itemsets are large at the same partition
We can compute the skewness of a partitioned database Di , then Wi ˆ 1 and Wk ˆ 0; …1  k  n; k 6ˆ i†.
according to Definition 2. However, in general, the number Thus, T B…D† ˆ 0 is admissible. On the other
of itemsets may be very large. One approximation is to hand, if every partition has the same number of
compute the skewness over the set of globally large itemsets large itemsets, then Wi ˆ n1 ; …1  i  n† and,
only and take the approximation pX …i†  0 if X is not gl- hence, T B…D† ˆ 1. Furthermore, if T S…D† ˆ 0,
large in Di . In the generation of candidate itemsets, only then S…X† ˆ 0 for all large itemsets X. This
globally large itemsets will be joined together to form new implies that Wi are the same for all 1  i  n.
candidates. Hence, only their skewness would impact the Hence, T B…D† ˆ 1.
effectiveness of pruning. Therefore, this approximation is a 2. It follows from the first result of this theorem
reasonable and practical measure. that both T S…D† ˆ 0 and T S…D† ˆ 1 are admis-
sible when T B…D† ˆ 1. Therefore, the first part
4.2 Workload Balance is proven. Furthermore, if T B…D† ˆ 0, there
Workload balance is a measurement on the distribution of exists a partition Di such that Wi ˆ 1 and
the total weights of the locally large itemsets among the Wk ˆ 0; …1  k  n; k 6ˆ i†. This implies that all
processors. Based Pon the definition of w…X† in Definition 2, large itemsets are locally large at only Di .
we define Wi ˆ X2IS w…X†  pX …i† to be the itemset work- Hence, T S…D† ˆ 1. u
t
Pn Di , where IS is the set of all the itemsets.
load of partition
Even though 0  T S…D†  1 and 0  T B…D†  1, not all
Note that iˆ1 Wi ˆ 1. A database has high workload
possible combinations are admissible. In general, the
balance if the Wi 0 s are the same for all partitions Di ,
admissible combinations is a subset of the unit square,
1  i  n. On the other hand, if the values of Wi exhibit
represented by the shaded region in Fig. 6. It always
large differences among themselves, the workload balance
contains the two line segments T S…D† ˆ 1 (S = 1 in Fig. 6)
is low. Thus, our definition of the workload balance metric is
and T B…D† ˆ 1 (B = 1 in Fig. 6), but not the origin, (S = 0, B
also based on the entropy measure.
= 0). After defining the metrics and studying their
Definition 3. For a database D with n partitions, the workload characteristics, we can experimentally validate our analysis
balance factor (workload
P balance, for short) T B…D† is defined (see Section 3.3) on the relationship between data skewness,
n
Wi log…Wi †
as T B…D† ˆ iˆ1
. workload balance and performance of FPM and CD.
log…n†
The metric T B…D† has the following properties: We would like to note that the two metrics are based on
total entropy which is a good model to measure evenness
.T B…D† ˆ 1 when the workload across all processors (or unevenness) of data distribution. Also, they are
are the same. consistent with each other.
. T B…D† ˆ 0 when the workload is concentrated at
one processor. 4.3 Performance Behaviors of FPM and CD
. 0 < T B…D† < 1 in all the other cases. We study the performance behaviors of FPM and CD in
Similar to the skewness metric, we can approximate the response to various skewness and balance values on an IBM
value of T B…D† by only considering globally large itemsets. SP2 parallel processing machine with 32 nodes. Each node
The data skewness and workload balance are not consists of a POWER2 processor with a CPU clock rate of
independent of each other. Theoretically, each one of them 66.7 MHz and 64 MB of main memory. The system runs the
may attain values between zero and one, inclusively. AIX operating system. Communication between processors
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 505

TABLE 3
Performance Improvement of FPM over CD

Fig. 3. Relative performance on databases with high balance.

D3278K.T5.I2.S10 and comparing D1140K.T20.I6.S90


against D2016K.T10.I4.S10, we observe that the larger the
transaction sizes and longer the large-itemsets, the more
significant the improvements are in general. Obviously, the
local pruning and global pruning adopted in FPM are very
effective. The sizes of the candidate sets are significantly
are done through a high performance switch with an reduced and, hence, FPM outperforms CD significantly.
aggregated peak bandwidth of 40 MBps and a latency about
40 microseconds. The appropriate database partition is 4.3.2 Performance of FPM with High Workload Balance
downloaded to the local disk of each processor before Fig. 3 shows the response time of the FPM and CD algorithms
mining starts. The databases used for the experiments are for databases with various skewness values and a high
all synthesized according to a model which is an enhance- balance value of B ˆ 100.7 FPM outperforms CD signifi-
ment of the model adopted in [2]. Due to insufficient paper cantly even when the skewness is in the moderate range,
space, the description is omitted. (0:1  s  0:5). This trend can be read also from Table 3.
When B ˆ 100, FPM is 36 percent to 110 percent faster than
4.3.1 Improvement of FPM over CD CD. In the more skewed cases, i.e., 70  S  90, FPM is at
In order to compare the performance of FPM and CD, we least 107 percent faster than CD. When skewness is moderate
have generated a number of databases. The size of every (S ˆ 30), FPM is still 55 percent faster than CD. When B ˆ 90,
partition of these databases is about 100MB, and the FPM maintains the performance lead over CD. The results
number of partitions is 16, i.e., n ˆ 16.6 We also set
clearly demonstrate that, given a high workload balance,
N ˆ 1; 000, L ˆ 2; 000, correlation level to 0.5. The name of
FPM outperforms CD significantly when the skewness is in
each database is in the form Dx:Ty:Iz:Sr:Bl, where x is the
the range of high to moderate.
average number of transactions per partition, y is the
average size of the transactions, and z is the average size of 4.3.3 Performance of FPM with High Skewness
the itemsets. These three values are the control values for
Fig. 4 plots the response time of FPM and CD for databases
the database generation. On the other hand, the two values
with various balance values. The skewness is maintained at
r and l are the control values (in percentage) of the
S ˆ 90. In this case, FPM behaves slightly differently from
skewness and balance, respectively. They are added to the
the high workload balance case presented in the previous
name in the sense that they are intrinsic properties of the
section. FPM performs much better than CD when the
database.
workload balance is relatively high (b > 0:5). However, its
We ran FPM and CD on various databases. The
performance improvement over CD in the moderate
minimum support threshold is 0.5 percent. The improve-
balance range (0:1  b  0:5) is marginal. This implies that
ment of FPM over CD in response time on these databases
FPM is more sensitive to workload balance than skewness.
are recorded in Table 3. In the table, each entry corresponds
In other words, if the workload balance has dropped to a
to the results of one database, and the value of the entry is moderate value, even a high skewness cannot stop the
the speedup ratio of FPM to CD, i.e., the response time of degradation in performance improvement.
CD over that of FPM. Entries corresponding to the same This trend can also be inferred from Table 3. When
skewness value are put onto the same row, while entries S ˆ 90, FPM is 6 percent to 110 percent faster than CD
under the same column correspond to databases with the depending on the workload balance. In the more balanced
same balance value. The result is very encouraging. FPM is
consistently faster than CD in all cases. Comparing the 7. We use B and S to represent the control value of the balance and
skewness in the databases in Table 3. Hence, their unit is in percentage and
figures for databases D2016K.T10.I4 against those of their value is in the range of ‰0; 100Š. On the other hand, in Figs. 3, 4, 5, and 6,
we use s and b to represent the skewness and balance of the databases,
6. Even though the SP2 we use has 32 nodes, because of administration which are values in the range of ‰0; 1Š. In real term, both B and b, S and s
policy, we can only use 16 nodes in our experiments. have the same value except different units.
506 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

Fig. 4. Relative performance on databases with high skewness.


Fig. 5. Relative performance on databases when both skewness and
balance are varied.
databases, i.e., 90  B  100, FPM is at least 69 percent
faster than CD. In the moderate balance case (50  B  70),
several regions, as shown in Fig. 6. Region A is the region
the performance gain drops to the 23 percent to 36 percent
in which FPM outperforms CD the most. In this region, the
range. This result shows that a high skewness has to be
balance is high and the skewness varies from high to
accompanied by a high workload balance in order for FPM
moderate and FPM performs 45 percent to 221 percent
to deliver a good improvement. The effect of a high
faster than CD. In region B, the workload balance value
skewness with a moderate balance is not as good as that
has degraded moderately and the skewness remains high.
of a high balance with a moderate skewness. The change in workload balance has brought the perfor-
4.3.4 Performance of FPM with Moderate Skewness mance gain in FPM down to a lower range of 35 percent to
45 percent. Region C covers combinations that have
and Balance
undesirable workload balance. Even though the skewness
In Fig. 5, we vary both the skewness and balance together could be rather high in this region, because of the low
from a low values combination to a high values combina- balance value, the performance gain in FPM drops to a
tion. The trend shows that the improvement of FPM over moderate range of 15 percent to 30 percent. Region D
CD increases from a low percentage at s ˆ 0:5, b ˆ 0:5 to a contains those combinations on the bottom of the
high percentage at s ˆ 0:9, b ˆ 0:9. Reading Table 3, we find performance ladder in which FPM only has marginal
that the performance gain of FPM increases from around performance gain.
6 percent (S ˆ 50, B ˆ 50) to 69 percent (S ˆ 90, B ˆ 90) as Thus, our empirical study clearly shows that high values
skewness and balance increase simultaneously. The combi- of skewness and balance favors FPM. Furthermore, between
nation of S ˆ 50, B ˆ 50, in fact, is a point of low skewness and balance, workload balance is more important
performance gain of FPM in the set of all admissible than skewness. Here, the study not only has shown that the
combinations in our experiments. performance of the prunings is very sensitive to the data
distribution, it also has demonstrated that the two metrics
4.3.5 Summary of the Performance Behaviors of FPM are useful in distinguishing ºfavorableº distributions from
and CD ºunfavorableº distributions. In Fig. 6, regions C and D may
We have done some other experiments to study the effects cover more than half of the whole admissible area. For
of skewness and balance on the performance of FPM and database partitions which fall in these regions, FPM may
CD. Combining these results with our observations in the not be much better than CD. Because of this, it is important
above three cases, we can divide the admissible area into to partition a database in such a way that the resulted

Fig. 6. Division of the admissible regions according to the performance improvement of FPM over CD (FPM/CD).
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 507

partitions would be in a more favorable region. In Sections 5 the signature, i;j , is the support count of the 1-itemset
and 6, we will show that this is, in fact, possible by using
containing item number j in chunk Ci . Note that each
partitioning algorithms we will proposed.
signature i is a vector. This allows us to use functions
and operations on vectors to describe our algorithms. The
5 PARTITIONING OF THE DATABASE
signatures i are then used as representatives of their
Suppose now that we have a centralized database and want corresponding chunks Ci . All the i 0 s can be computed by
to mine it for association rules. We have to divide the scanning the database once. Moreover, we can immedi-
database into partitions and then run FPM on the partitions. P
ately deduce the support counts of all 1-itemsets as i .
If we can divide the database in a way to yield high balance
This can indeed be exploited by FPM to avoid the first
and skewness across the partitions, we can have much more
savings on resources by using FPM. Even for an already iteration, thus saving one scan in FPM (see Section 7.1 for
partitioned database, redistributing the data among the details). So, overall, we can obtain the signatures without
partitions may also increase the skewness and balance, thus any extra database scans.
benefiting FPM. So, how to divide the database into The second step is to divide the signatures i into
partitions to achieve high balance and skewness becomes n groups Gk , where n is the number of partitions to be
an interesting and important problem. produced. Each group corresponds to a partition in the
resulting partitioning. The number of partitions n should be
Note that not all databases can be divided into a given
equal to the number of processors to be used for running
number of partitions to yield high skewness and balance. If
FPM. A good partitioning algorithm should assign the
the data in the database is already highly uniform and
signatures to the groups according to the following criteria.
homogeneous, with not much variations, then any method
To increase the resulting skewness, we should put the
of dividing it into the given number of partitions would signatures into groups so that the distance8 between
produce similar skewness and balance. However, most real- signatures in the same group is small, but the distance
life database are not uniform, and there are many variations between signatures in different groups is high. This would
within them. It is possible to find a wise way of dividing the tend to make the resulting partitions more different from
data tuples into different partitions to give a very high one another and make the transactions within each partition
balance and skewness than an arbitrary partition. Therefore, more similar to themselves. Hence, it would have higher
if a database intrinsically has nonuniformness, we may dig skewness. To increase the workload balance, each group
out such nonuniformness and exploit it to partition the should have similar signature sum. One way to achieve this
database. is to assign more or less the same number of signatures to
So, it would be beneficial to partition the database each group such that the total signatures in each group are
carefully. Ideally, the partitioning method should maximize very close.
the skewness and balance metrics for any given database. The third step of the framework distributes the transac-
tions to different partitions. For each chunk Ci , we check
However, doing such an optimization would be no easier
which group its signature i was assigned to in Step 2. If it
than finding out the association rules. The overhead of this was assigned to group Gk , then we send all transactions in
would be too high to be worth doing. So, instead of that chunk to partition k. After sending out all the chunks,
optimizing the skewness and balance values, we would use the whole database is partitioned.
low-cost algorithms that produce reasonably high balance It can be easily noticed that Step 2 is the core part of the
and skewness values. These algorithms should be simple partitioning algorithm framework. Steps 1 and 3 are simply
enough so that not much overhead is incurred. Such small the preprocessing and postprocessing parts. By using a
overhead would be far compensated by the subsequent suitable chunk size, we can reduce the total number of
savings in running FPM. chunks, and hence signatures, to a suitable value, so that
they can all be processed in RAM by the clustering
5.1 Framework of the Partitioning Algorithms algorithm in Step 2. This effectively reduces the amount
To make the partitioning algorithms simple, we will base on of information that our clustering algorithm has to handle
and, hence, the partitioning algorithms are very efficient.
the following framework. The core part of the framework is
We would like to note that, in our approach, we have
a clustering algorithm, for which we will plug in different only made use of the signatures from size-1 itemsets. We
clustering algorithms to give different partitioning algo- could have used those of larger size itemsets. However,
rithms. The framework can be divided into three steps. that would cost more and, eventually, it becomes the
Conceptually, the first step of the framework divides problem of finding the large itemsets. Using size-1 itemsets
is a good trade-off, and the resources used in finding them
the transactions in the database into equal-sized chunks as we have noted is not wasted. Furthermore, our
Ci . Each chunk contains the same number, l m y, of empirical results (in Section 6) have shown that by just
transactions. So, there will be a total of z ˆ jDj chunks. using the size-1 itemsets, we can already achieve reason-
y
able skewness and high balance.
For each chunk, we define a signature i , which is an
jIj-dimensional vector. The jth (j ˆ 1; 2; . . . ; jIj) element of 8. Any valid distance function in the jIj-dimensional space may be used.
508 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

5.2 The Clustering Problem TABLE 4


With this framework, we have reduced the partitioning An Example Demonstrating the Idea of SHEI
problem to a clustering algorithm. Our original problem is
to partition a given database so that the resulting partitions
give high skewness and balance values, with balance
receiving more attention. Now, we have turned it into a
clustering problem, which is stated as follows:
Problem. Given z I-dimensional vectors i (i ˆ 1; 2; . . . ; z),
called ªsignatures,º assign them to n groups Gj
(j ˆ 1; 2; . . . ; n), so as to achieve the following to
criteria:
Pz Pn
1. (Skewness) iˆ1 jˆ1 jj i j jj2  …i; j† is mini- centroids of their corresponding groups, it meets the
P z
mized, where j ˆ 1j iˆ1 i  …i; j† and …i; j† ˆ 1 skewness criterion (see Section 5.2). However, the balance
if i is assigned to Gj and …i; j† ˆ 0 otherwise and criterion is completely ignored since we impose no
P
j ˆ ziˆ1 …i; j†. restrictions on the size of each group Gj . Consequently,
2. (Balance) j1 ˆ j2 for all j1 ; j2 ˆ 1; 2; . . . ; n. some groups may get more signatures then the others and,
Here, note that the vector j is the geometric centroid of hence, the corresponding partitions will receive more
the signatures assigned to group Gj , while j is the number chunks of transactions. Thus, this algorithm should yield
of signatures assigned to that group.9 The notation jjxjj high skewness, but does not guarantee good workload
denotes the distance of vector x from the origin, measured balance. In the subsequent discussions, we shall use the
with the chosen distance function. symbol ªkº to denote the partitioning algorithm employ-
The first criterion above says that we want to minimize ing the k-means clustering algorithm.
the distance between the signatures assigned to the same Note that the random partitioning algorithm yields high
group. This is to achieve high skewness. The second balance but poor skewness, while k yields high skewness
criterion reads that each group shall be assigned the same but low balance. They do not achieve our goal of getting
number of signatures, so as to achieve high balance. high skewness as well as balance. Nonetheless, these
Note that it is nontrivial to meet both criteria at the same algorithms give us an idea of how high a balance or
time. So, we develop algorithms that attempt to find skewness value can be achieved by suitably partitioning a
approximate solutions. Below, we will give several cluster- given database. The result of the random algorithm
ing algorithms that are to be plugged into Step 2 of the suggests the highest achievable balance of a database, while
framework to give various partitioning algorithms. the k algorithm gives the highest achievable skewness
value. Thus, they give us reference values to evaluate the
5.3 Some Straightforward Approaches effectiveness of the following two algorithms.
The simplest idea to solve the clustering problem is to
assign the signatures i to the groups Gj randomly.10 For 5.4 Sorting by the Highest-Entropy Item (SHEI)
each signature, we choose a uniformly random integer r To achieve high skewness, we may use sorting. This idea
between 1 and n (the number of partitions) and assign the comes from the fact that sorting decreases the degree of
signature to group Gr . As a result, each group would disorder and hence entropy. So, the skewness measure
eventually receive roughly the same amount of signatures, should, according to Definitions 1 and 2, increase. If we sort
and hence chunks and transactions. This satisfies the the signatures i (i ˆ 1; 2; . . . ; z) in ascending order of the
balance criterion (see Section 5.2), but leaves the skewness k1 th coordinate value (i.e., i;k1 ), which is the support count
criterion unattacked. So, this clustering method should of item number k1 , and then divide the sorted list evenly
yield high balance, which is close to unity. However, into n equal-length consecutive sublists and assign each
skewness is close to zero because of the completely random sublist to a group Gj , we can obtain a partitioning with
assignment of signatures to groups. With this clustering good skewness. Since each sublist has the same length, the
algorithm, we get a partitioning algorithm, which we will resulting groups have the same number of signatures.
refer to ªrandom partitioning.º Consequently, the partitions generated will have equal
To achieve good skewness, we shall assign signatures to amount of transactions. This should give a balance better
groups such that signatures near to one another should go then k.
to the same group. Many clustering algorithms with such a This idea is illustrated with the example database in
goal have been developed. Here, we shall use one of the Table 4. This database has only six signatures and we divide
most famous ones: the k-means algorithm [14]. (We refer the it into two groups (z ˆ 6, n ˆ 2). The table shows the
readers to relevant publications for a detailed description of coordinate values of the items k1 and k2 of each signature.
the algorithm.) Since the k-means algorithm minimizes the Using the idea mentioned above, we sort the signatures
sum of the distances of the signatures to the geometric according to the k1 th coordinate values. This gives the
resulting ordering as shown in the table. Next, we divide the
9. In the subsequent sections, the index j will be used for the domain of signatures into two groups (since n ˆ 2), each of size 3, with
groups. So, it will implicitly take values from 1 to n.
10. In the subsequent sections, the index i will be used for the domain of the order of the signature being preserved. So, the first three
signatures. So, it will implicitly take values 1; 2; . . . ; z. signatures are assigned to group G1 , while the last three
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 509

signatures are assigned to G2 . This assignment is also shown support count of the item), the secondary key in descending
in the table. The database is subsequently partitioned by order of the coordinate values, etc. By alternating the
delivering the corresponding chunks Ci to the correspond- direction of sorting in successive keys, our algorithm can
ing partitions. Observe how the sorting has brought the distribute the workload quite evenly, while maintaining a
transactions that contribute to the support count of item k1 reasonably high skewness.
to the partition 2. Since the coordinate values are indeed This idea of using auxiliary sort keys and alternating sort
support counts of the corresponding 1-itemsets, we know orders is illustrated in Table 4. Here, the primary sort key is
that partition 1 has a support count of 0 ‡ 3 ‡ 3 ˆ 6 for item k1 , while k2 is the secondary sort key. The primary sort key
k1 , while partition 2 has a support count of 3 ‡ 5 ‡ 7 ˆ 15. is processed in ascending order, while the secondary key is
Thus, the item k1 has been made to be skewed towards processed in descending order. Note how the secondary
partition 2. sort key has helped assigning 6 and 3 to one group and 2
Now, why did we choose item k1 for the sort key, but to another, when they have the same value in the primary
not item k2 ? Indeed, we should not choose an arbitrary sort key. This has helped in increasing the skewness by
item for the sort key because not all items can give good gathering the support counts of k2 for those signatures
skewness after the sorting. For example, if the item k1 sharing the same value in the primary sort key. Note also
occurs equally frequent in each chunk Ci , then all how the alternating sort order has slightly improved the
signatures i will have the same value in the coordinate balance, by assigning one less unit of support count, carried
corresponding to the support count of item k1 . Sorting the by C2 (corresponding to 2 ), to G2 .
signatures using this key will not help increasing the So, this resulting partitioning algorithm, which we shall
skewness. On the other hand, if item k1 has very uneven call ªSorting by Highest Entropy Itemº (abbreviated as
distribution among the chunks Ci , sorting would tend to ªSHEIº) should, thus, give high balance and reasonably
deliver chunks with higher occurrence of k1 to the same or good skewness.
near-by groups. In this case, skewness can be increased
significantly by sorting. So, we shall choose the items with 5.5 Balanced k-means Clustering (Bk)
uneven distribution among the chunks Ci for the sort key. The balanced k-means clustering algorithm, which we will
To measure the unevenness, we use the statistical entropy abbreviate as ªBk,º is a modification of the k-means
measure again. For every item X, we evaluate its statistical algorithm. The k algorithm achieves the skewness criter-
entropy value among the signature coordinate values i;X ion. However, it does not pay any effort to achieve to the
over all chunks Ci . The item with the highest entropy value balance criterion. In Bk, we remedy this by assigning the
is the most unevenly distributed. Besides considering the signatures i to the groups Gj while minimizing the value
unevenness, we have to consider the support count of of the following expression. We also add the constraint that
item X, too. If the support count of X is very small, then each group receives the same number of signatures. The
we gain little by sorting on X. So, we should consider both problem is stated as follows:
the unevenness and the support count of each item X in
1 Xz X n
order to determine the sort key. We multiply the entropy Minimize F ˆ  jj i j jj2  …i; j†
value with the total support count of the item in the whole ‡ iˆ1 jˆ1
database (which can be computed by summing up all !
X
n
signature vectors). The product gives us a measure of how ‡ j  jj j Ejj 2

frequent and how uneven an item is in the database. The jˆ1


item which gets the highest value for this product is chosen 
j1 ˆ j2 …j1 ; j2 ˆ 1; 2; . . . ; n†
for the sort key because it is both frequent and unevenly subject to
…i; j† ˆ 0 or 1 …i ˆ 1; 2; . . . ; z; j ˆ 1; 2; . . . ; n†;
distributed in the database. In other words, we choose as Pn
1
the sort key the item where E ˆ z jˆ1 j j is constant (which depends on the
 X Pzwhich has the largest value of
X ˆ entropyziˆ1 i;X  iˆ1 i;X . The item with the sec-
whole database) and ,  are constant control parameters.
ond highest value for this product is used for the Actually, E is the arithmetic mean of all the signatures.
secondary sort key. We can similarly determine tertiary (Note that j and j , j ˆ 1; . . . ; n, have been defined in
and quaternary sort keys, etc. Section 5.2.) All j s and …i; j†s are variables in the above
Sorting the signatures according the keys selected will problem. Each j represents the geometric centroid of each
yield reasonably high skewness. However, the balance group Gj , while each …i; j† takes the value of 1 or 0
factor is not good. The balance factor is primarily accordingly as whether the signature i is currently
guaranteed by the equal size of the groups Gj . However, assigned to group Gj or not.
this is not sufficient. This is because the sorting tends to Note that the first term inside parenthesis is exactly the
move the signatures with large coordinate values, and skewness criterion when we set j ˆ j . Thus, minimizing
hence the large itemsets, to the same group. The ªheavierº this term brings about high skewness. The second term
signatures will be concentrated in a few groups, while the inside parenthesis is introduced so as to achieve balance.
ªlighterº signatures are concentrated in a few other groups. Since the vector E is the average of all signatures, it gives
Since the coordinate values are indeed support counts, the the ideal value of j (the position of the geometric centroids
sorting would thus distribute the workloads unevenly. To of each group of signatures) for high balance. The second
partly overcome this problem, we sort on the primary key term measures how far away the actual values of j are
in ascending order of coordinate values (which is the from this ideal value. Minimizing this term would bring us
510 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

balance. Therefore, minimizing the above expression would Under this constraint, it strives to maximize the skewness
achieve both high balance and skewness. The values of  (by minimizing signature-centroid distance) like the k
and  let us control the weight of each criterion. A higher algorithm does. So, the skewness is in a reasonably high
value of  gives more emphasis on the skewness criterion, level. The algorithm actually produces very high balance
while a higher value of  would make the algorithm focus and, while maintaining such high balance workload, it
on achieving high balance. In our experiments (see attempts to maximize skewness. So, essentially the balance
Section 6), we set  ˆ  ˆ 1:0. In addition, we use the factor is given primary consideration. This should suit
Euclidean distance function in the calculations. FPM well.
This minimization problem is not trivial. Therefore, we
take an iterative approach, based on the framework of the
k algorithm. We first make an arbitrary initial assignment 6 EXPERIMENTAL EVALUATION OF THE
of the signatures to the groups, thus giving an initial value PARTITIONING ALGORITHMS
for each …i; j†. Then, we iteratively improve this assign- To find out whether the partitioning algorithms introduced
ment to lower the value of the objective function. in Section 5 are effective, we have done two sets of
Each iteration is divided into two steps. In the first step, experiments. In these experiments, we first generate
we treat the values of …i; j† as constants and try to synthetic databases. The generated databases are already
minimize F by assigning suitable values to each j . In the partitioned, with the desired skewness and workload
next step, we treat all j as constants and adjust the values balance. So, the databases are intrinsically nonuniform.
This is, however, not suitable for the experiments for
of …i; j† to minimize F . Thus, the values of j and …i; j† are
evaluating whether our partitioning algorithms can dig out
adjusted alternatively to reduce minimize value of F . The
the skewness and workload balance from a database. So, we
details are as follows: In each iteration, we first use the same ªdestroyº the apparent skewness and workload balance
approach as k to calculate the geometric centroids j for that already exist among the partitions, by concatenating
each group. To reduce the value of the second term (balance the partitions to form a centralized database and then
consideration) in the objective function F , we temporarily shuffling the transactions in the concatenated database. The
treat the values of …i; j† as constants and find out the partial shuffling would destroy the orderness of the transaction, so
derivatives of the object function with regard to each j . that arbitrary partitioning of the resulting database would
Solving @ @F
ˆ 0 (j ˆ 1; 2; . . . ; n), we find that we shall make give a partitioning with low balance and skewness. We can
j
j ‡E then test whether our partitioning algorithms can produce
the assignments j ˆ ‡ , where j ˆ 1; 2; . . . ; n in order
partitions that give higher workload balance and skewness
to minimize the objective function F . After determining j ,
than an arbitrary partitioning.
we next adjust the values of …i; j†, treating the values of j The databases used here are similar to those generated in
as constants. Since the second term inside parenthesis now Section 4.3. The number of partitions in the databases is 16.
does not involve any variable …i; j†, we may reduce the Each partition has 100,000 transactions. Chunk size is set to
minimization problem to the following problem: 1,000 transactions; hence, each partition has 100 chunks and
the total number of chunks is 1,600. In order to evaluate the
z X
X n
Minimize jj i j jj2  …i; j† effectiveness of the partitioning algorithms, we have to
iˆ1 jˆ1 compare the skewness and workload balance of the
 resulting partitions against the skewness and balance
j 1 ˆ j 2 …j1 ; j2 ˆ 1; 2; . . . ; n†
subject to intrinsic in the database. For this purpose, we take the
…i; j† ˆ 0 or 1 …i ˆ 1; 2; . . . ; z; j ˆ 1; 2; . . . ; n†;
skewness and workload balance before concatenation as the
where …i; j† are the variables. Note that this is a linear intrinsic values. All the skewness and balance values
programming problem. We shall call it the ªgeneralized reported below are obtained by measurement on the
assignment problem.º It is indeed a generalization of the partitions before the concatenation as well as after the
Assignment Problem and a specialization of the Transpor- partitioning, not the corresponding control values for the
tation Problem in the literature of linear programming. data generation. So, they reflect the actual values for those
There are many efficient algorithms for solving such metrics.
problems. The Hungarian algorithm [9], which is designed For each generated database, we run all the four
for solving the Assignment Problem, has been extended to partitioning algorithms given in Section 5. The skewness
solve the generalized assignment problem. This extended and workload balance of the resulting partitions are noted
Hungarian algorithm is incorporated as a part of the and compared with one another together with the intrinsic
clustering algorithm; Like k, it iteratively improves its values. As discussed in Section 5.3, the result of the
solution. The iterations are stopped when the assignment random algorithm suggests the highest achievable balance
becomes stable. value, while the result of k gives the highest achievable
Since the algorithm imposes the constraint that each skewness value.
group gets assigned the same number11 of signatures, We did two series of experiments: 1) the first series
workload balanced is guaranteed in the final partitioning. varied the intrinsic skewness while keeping the intrinsic
balance at a high level; 2) the second series varied the
11. Practically, the constraint is relaxed to allow up to a difference of one
between the j s, due to the remainder when the number of signatures is intrinsic balance value while keeping the intrinsic skewness
divided by the number of groups. almost constant.
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 511

Fig. 7. High intrinsic balance and varied intrinsic skewnessÐresulting Fig. 8. High intrinsic balance and varied intrinsic skewnessÐresulting
skewness. balance.

6.1 Effects of High Intrinsic Balance and Varied From this series of experiments, we can conclude that:
Intrinsic Skewness Given a database with good intrinsic balance, even if the
The first series of experiments we did was to find out how intrinsic skewness is not high, both Bk and SHEI can
the intrinsic balance and skewness values would affect the increase the balance to a level as good as that in a random
effectiveness of SHEI and Bk given that the intrinsic partitioning; in addition, Bk can, at the same time, deliver
balance is in a high value and the skewness changes from a good level of skewness, much better than that from the
high to low. Fig. 7 shows the results for the skewness of the random partitioning. The skewness achieved by Bk is also
resulting partitionings. The vertical axis gives the skewness better than SHEI and is in an order comparable to what can
values of the resulting databases. Every four points on the be achieved by the benchmark algorithm k.
same vertical line represent the results of partitioning the Another way to look at the result of these experiments is
same initial database. The intrinsic skewness of the to fit the intrinsic skewness and balance value pairs (those
database is given on the horizontal axis. For your reference, on the horizontal axis of Fig. 7) into the regions in Fig. 6.
the intrinsic balance values are given directly under the These pairs, which represent the intrinsic skewness and
skewness value of the database. Different curves in the balance of the initial partitions, all fall into region C.
figure show the results of different partitioning algorithms. Combining the results from Figs. 7 and 8, among the
The k algorithm, as explained before, gives the highest resulting partitions from Bk, five of them have moved to
skewness achievable by a suitable partitioning. Indeed, the region A, the most favorable region. For the other three,
resulting skewness values of k are very close to the even though their workload balance have been increased,
intrinsic values. Both SHEI and Bk do not achieve this their skewness have not been increased enough to move
skewness value. This is primarily because they put more them out of region C. In summary, a high percentage of the
emphasis on balance than skewness. Yet, the results show resulting partitions have been benefited substantially from
that some of the intrinsic skewness can be recovered. using Bk.
According to the figure, the resulting skewness of Bk is
6.2 Effects of Reducing Intrinsic Balance
almost always twice of that of SHEI. So, the Bk algorithm
performs better than SHEI in terms of resulting skewness. Our second series of experiments attempted to find out how
This is due to the fact that Bk uses a more sophisticated SHEI and Bk would be affected when the intrinsic balance
method of achieving high skewness. Most importantly, the is reduced to a lower level.
Fig. 9 presents the resulting skewness values against the
skewness achieved by Bk is between 50 percent to
intrinsic skewness. The numbers in the row below show the
60 percent of that of the benchmark algorithm k, which
intrinsic balance values for the corresponding databases.
indicates that Bk can maintain a significant degree of the
Fig. 10 shows the resulting balance values of the four
intrinsic skewness.
Fig. 8 shows the workload balance values for the same algorithms on the same partitioning results.
Again, the random algorithm suggests the highest
partitioned databases. This time, the vertical axis shows the
achievable balance value, which is close to 1.0 in all cases.
resulting balance. Again, every four points on the same
Both SHEI and Bk are able to achieve the same high
vertical line represent the partitioning of the same original balance value which is the most important requirement
database. The horizontal axis gives the intrinsic balance (Fig. 10). Thus, they are very good at yielding a good
values of the databases, with the intrinsic skewness values balance even if the intrinsic balance is low. As for the
of the corresponding databases given in the row below. resulting skewness, the results of the k algorithm gives the
The generated databases all have an intrinsic balance highest achievable skewness values, which are very close to
very close to 0.90. Random partitioning of course yields a the intrinsic values (Fig. Fig. 9). Both SHEI and Bk can
high balance value very close to 1.0, which can be taken as recover parts of the intrinsic skewness. However, the
the highest achievable balance value. It is encouraging to skewness is reduced more when the intrinsic balance is in
discover that SHEI and Bk also give good balance values the less favorable range (< 0:7). These results are consistent
which are very close to that of the random partitioning. with our understanding. When the intrinsic balance is low,
512 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

Fig. 9. Reducing intrinsic balanceÐresulting skewness. Fig. 10. Reducing intrinsic balanceÐresulting balance.

spending effort in rearranging the transactions in the general framework of the partitioning algorithms (see
partitions to achieve high balance would tend to reduce Section 5.1) requires only one extra scan of the database,
the skewness. whose purpose is to calculate the signatures i . This cost can
Both Bk and SHEI are low-cost algorithms, they spend be compensated by the saving of the first database scan of
more effort in achieving a better balance while at the same FPM (see Section 7.1). After that, no more extra I/O is
time trying to maintain certain level of skewness. Between required. We assume that the chunk size y, specified by the
Bk and SHEI, the resulting skewness of is at least twice of user, is large enough so that the total number of chunks z is
that of SHEI in all cases. This shows that Bk is better than small enough to allow all the signatures i be handled in
SHEI in both cases of high and low intrinsic balance. It is main memory. In this case, the total overhead of the
also important to note that the skewness achieved by Bk is partitioning algorithms is far compensated by the subse-
always better than that of the random partitioning. quent resource savings. For a more detailed discussion on
Again, we can fit the skewness and balance values of the overhead of these partitioning algorithms, please refer
the initial databases and their resulting partitions (Figs. 9 to Section 7.1.
and 10) into the regions in Fig. 6. What we have found
out is that: after the partitioning performed by Bk, four
partitionings have moved from region C to A, two from
7 DISCUSSION
region D to C, two others remain unchanged. This again To restrict the search of a large itemset in a small set of
is very encouraging and shows the effectiveness of candidates is very essential to the performance of mining
BkÐmore than 70 percent of the databases would have association rules. After a database is partitioned over a
their performance improved substantially by using Bk number of processors, we have information on the
and FPM together. support counts of the itemsets at a finer granularity. This
enables us to use distributed and global prunings
6.3 Summary of the Experimental Results discussed in Section 3. However, the effectiveness of these
The above results show that our clustering algorithms SHEI pruning techniques is very dependent on the distribution
and Bk are very good preprocessors, which prepare a of transactions among the partitions. We discuss two
partitioned database that can be mined by FPM efficiently. issues here related to the database partitioning and
It is encouraging to note that both Bk and SHEI can performance of FPM.
achieve a balance as good as random partitioning and also a
much better skewness. Between themselves, in general, Bk 7.1 Overhead of Using the Partitioning Algorithms
gives much better skewness values than SHEI. Also, the We have already shown that FPM benefits over CD the
results are true for a wide range of intrinsic balance and most when the database is partitioned in a way such that
skewness values. Referring back to Section 4.3, what we the skewness and workload balance measures are high.
have achieved is being able to partition a database such that Consequently, we suggest that partitioning algorithms such
the workload balance would fall into the ideal high level as Bk and SHEI be used before FPM, so as to increase the
range while at the same time maintaining certain level of skewness and workload balance for FPM to work faster.
skewness. Therefore, the resulting partitioning would fall But, this suggestion is good only if the overhead of the
into the favorable regions in Fig. 6. Given this result, we partitioning is not high. We will make this claim by
recommend using Bk for the partitioning (or repartition- dividing the overhead of the partitioning algorithms into
ing) of the database before running FPM. two parts for analysis.
Note that we did no study on the time performance of The first part is CPU cost. First, the partitioning
the partitioning algorithms. This is primarily because the algorithms calculate the signatures of the data chunks. This
algorithms are so simple that they consume negligible involves only simple arithmetic operations. Next, the
amounts of CPU time. In our experiments, the amount of algorithms call a clustering algorithm to divide the
CPU time is no more than 5 percent of the time spent by the signatures into groups. Since the amount of signatures is
subsequent running of FPM. As for I/O overhead, the much smaller than the number of transactions in the whole
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 513

database, the algorithms process much less information the parallel algorithm HD (Section 1). Our pruning
than a mining algorithm. Moreover, the clustering algo- technique is orthogonal to the parallelization technique
rithms are designed to be simple, so that they are in HD and can be used to enhance its performance.
computationally not costly. Finally, the program delivers
the transaction in the database to different partitions. This
involves little CPU cost. So, overall, the CPU overhead of 8 CONCLUSIONS
the partitioning algorithms is very low. Experimental A parallel algorithm FPM for mining association rules has
results show that it is no more than 5 percent of the been proposed. FPM is a modification of FDM and it
CPU cost of the subsequent run of FPM. requires fewer rounds of message exchanges. Performance
The second part is the I/O cost. The partitioning studies carried out on an IBM SP2 shared-nothing memory
algorithms in Section 5 all read the original database twice parallel system show that FPM consistently outperforms
and write the partitioned database to disk once. In order to CD. The gain in performance in FPM is due mainly to the
enjoy the power of parallel machines for mining, we have to
pruning techniques incorporated.
partition the database anyway. Comparing with the
It has been found that the effectiveness of the pruning
simplest partitioning algorithm, which must inevitably read
the original database once and write the partitioned techniques depend highly on two data distribution char-
database to disk once, our partitioning algorithms only acteristics: data skewness and workload balance. An
does one extra database scan. But, it shall be remarked that entropy-based metric has been proposed to measure these
in our clustering algorithm, this extra scan is for computing two characteristics. Our analysis and experiment results
the signatures of the chunks. Once the signatures are found, show that the pruning techniques are very sensitive to
the support counts of all 1-itemsets can be deduced by workload balance, though good skewness will also have
summation, which involves no extra I/O overhead. So, we important positive effects. The techniques are very effective
can indeed find out the support counts of all 1-itemsets in the best-case of high balance and high skewness. The
essentially for free. This can be exploited to eliminate the combination of high balance and moderate skewness is the
first iteration of FPM, so that FPM can start straight into the second best-case.
second iteration to find large 2-itemsets. This saves one This is our motivation to introduce algorithms to
database scan from FPM and, hence, as a whole, the 1-scan partition database in a wise way, so as to get higher
overhead of the partitioning algorithm is compensated.
balance and skewness values. We have compared four
Thus, the partitioning algorithms essentially introduce
partitioning algorithms. With the balanced k-means (Bk)
negligible CPU and I/O overhead to the whole mining
clustering algorithm, we can achieve a very high work-
activity. Therefore, it is worthwhile to employ our parti-
load balance, while at the same time a reasonably good
tioning algorithms to partition the database before running
skewness. Our experiments have demonstrated that many
FPM. The great savings by running FPM on a carefully
partitioned database far compensates the overhead of our unfavorable partitions can be repartitioned by into
partitioning algorithms. partitions that allow FPM to perform more efficiently.
Moreover, the overhead of the partitioning algorithms
7.2 Scalability in FPM is negligible and can be compensated by saving one
Our performance studies of FPM were carried out on a database scan in the mining process. Therefore, we can
32-processor SP2 (Section 4.3). If the number of proces- obtain very high association rule mining efficiency by
sors, n, is very large, global pruning may need a large partitioning a database with Bk and then mining it with
memory to store the local support counts from all the FPM. We have also discussed a cluster approach which
partitions for all the large itemsets found in an iteration. can bring scalability to FPM.
Also, there could be cases that candidates generated after
pruning is still too large to fit into the memory. We
suggest to use a cluster approach to solve this problem.
ACKNOWLEDGMENTS
The n processors can be grouped into p clusters, (p  n), This research is supported in part by the Hong Kong
so that each cluster would have np processors. In the top Research Grants Council (RGC) grant, project number
level, support counts will be exchanged between the p HKU 7023/98E.
clusters instead of the n processors. The counts exchanged
in this level will be the sum of the supports from the
processors within each cluster. Both distributed and
REFERENCES
[1] R. Agrawal, T. Imielinski, and A. Swami, ªMining Association
global prunings can be applied by treating the data in a
Rules between Sets of Items in Large Databases,º Proc. ACM-
cluster together as a partition. Within a cluster, the SIGMOD Int'l Conf. Management of Data, 1993.
candidates are distributed across the processors and the [2] R. Agrawal and R. Srikant, ªFast Algorithms for Mining
support counts in this second level can be computed by Association Rules,º Proc. 20th Very Large Databases Conf., 1994.
[3] R. Agrawal and J.C. Shafer, ªParallel Mining of Association Rules:
count exchange among the processors inside the cluster. Design, Implementation and Experience,º Technical Report
In this approach, we only need to ensure that the total TJ10004, IBM Research Division, Almaden Research Center, 1996.
distributed memory of the processors in each cluster is [4] S. Brin, R. Motwani, J. Ullman, and S. Tsur, ªDynamic Itemsets
large enough to hold the candidates. From the setting, this Counting and Implication Rules for Market Basket Data,º Proc.
ACM-SIGMOD Int'l Conf. Management of Data, 1997.
approach is highly scalable. In fact, we can regard this [5] T.M. Cover and T.A. Thomas, Elements of Information Theory. John
approach as an integration of our pruning techniques into Wiley & Sons, 1991.
514 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002

[6] D.W. Cheung, J. Han, V. Ng, and C.Y. Wong, ªMaintenance of David W. Cheung received the MSc and PhD
Discovered Association Rules in Large Databases: An Incremental degrees in computer science from Simon Fraser
Updating Technique,º Proc. 12th Int'l Conf. Data Eng., 1996. University, Canada, in 1985 and 1989, respec-
[7] D.W. Cheung, J. Han, V.T. Ng, A.W. Fu, and Y. Fu, ªA Fast tively. He also received the BSc degree in
Distributed Algorithm for Mining Association Rules,º Proc. Fourth mathematics from the Chinese University of
Int'l Conf. Parallel and Distributed Information Systems, 1996. Hong Kong. From 1989 to 1993, he was with
[8] D.W. Cheung, V.T. Ng, A.W. Fu, and Y. Fu, ªEfficient Mining of Bell Northern Research, Canada, where he was
Association Rules in Distributed Databases. Special Issue in Data a senior member of the scientific staff. Since
Mining,º IEEE Trans. Knowledge and Data Eng., vol. 8, no. 6, 1994, he has been a faculty member of the
pp. 911±922, Dec. 1996. Department of Computer Science and Informa-
[9] S.K. Gupta, Linear Programming and Network Models. New Delhi: tion Systems in The University of Hong Kong. He is now the associate
Affiliated East-West Press, 1985. director of the E-Business Technology Institute in HKU. His research
[10] E. Han, G. Karypis, and V. Kumar, ªScalable Parallel Data Mining interests include data mining, data warehouse, Web-based information
for Association Rules,º Proc. ACM-SIGMOD Int'l Conf. Manage- retrieval, and XML technology for e-commerce. He is the program
ment of Data, 1997. committee chairman of the Fifth Pacific-Asia Conference on Knowledge
[11] J Han and Y Fu, ªDiscovery of Multiple-Level Association Rules Discovery and Data Mining (PAKDD-01) to be held in Hong Kong. He is
from Large Databases,º Proc. 21th Very Large Databases Conf., 1995. also a member of the ACM and the IEEE and the IEEE Computer
[12] M.A.W. Houtsma and A.N. Swami, ªSet-Oriented Mining for Society.
Association Rules in Relational Databases,º Proc. 11th Int'l Conf.
Data Eng., 1995. Sau D. Lee received the MPhil degree in
[13] International Business Machines, Scalable POWERparallel Systems, computer science in 1998 and his BSc (compu-
GA23-2475-02 ed. Feb. 1995. ter science) degree with first class honors in
[14] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An 1995 and from the University of Hong Kong. He
Introduction to Cluster Analysis. John Wiley and Sons, 1990. is a technology officer in the E-Business
[15] H. Mannila, H. Toivonen, and A.I. Verkamo, ªEfficient Algo- Technology Institute, the University of Hong
rithms for Discovering Association Rules,º AAAI Workshop Kong. He is working in the team specializing in
Knowledge Discovery in Databases (KDD-94), July 1994. XML-related technologies, taking the role of
[16] Message Passing Interface Forum, MPI: A Message-Passing Interface system architect and project coordinator. Prior
Standard. May 1994. to joining ETI in 1999, he has worked as
[17] J.S. Park, M.S. Chen, and P.S. Yu, ªAn Effective Hash-Based research assistant in the University of Hong Kong during the years
Algorithm for Mining Association Rules,º Proc. ACM-SIGMOD 1997 and 1998. His research interests include Web-based technologies,
Int'l Conf. Management of Data, May 1995. database systems, data mining, and indexing. From 1995 to 1997, he
[18] J.S. Park, M.S. Chen, and P.S. Yu, ªEfficient Parallel Mining for was also a teaching assistant of the Computer Science Department at
Association Rules,º Proc. Fourth Int'l Conf. Information and Knowl- University of Hong Kong.
edge Management, 1995.
[19] A. Savasere, E. Omiecinski, and S. Navathe, ªAn Efficient Yongqiao Xiao received the BS degree in
Algorithm for Mining Association Rules in Large Databases,º accounting with minor in information systems
Proc. 21th Very Large Databases Conf., 1995. from Renmin University of China in 1992, the MS
[20] R. Srikant and R. Agrawal, ªMining Generalized Association degree in computer science in Zhongshan
Rules,º Proc. 21th Very Large Databases Conf., 1995. University in 1995, and the PhD degree in
[21] R. Srikant and R. Agrawal, ªMining Sequential Patterns: General- computer science in Southern Methodist Uni-
izations and Performance Improvements,º Proc. Fifth Int'l Conf. versity in 2000. He is currently working with
Extending Database Technology, 1996. Trilogy Software Inc., Austin, Texas. He has
[22] R. Srikant and R. Agrawal, ªMining Quantitative Association been a reviewer for IEEE Transactions on
Rules in Large Relational Tables,º Proc. ACM-SIGMOD Int'l Conf. Knowledge and Data Engineering, Very Large
Management of Data, 1996. Databases Conference, etc. His major research
[23] T. Shintani and M. Kitsuregawa, ªHash Based Parallel Algorithms interests include data mining, clickstream analysis, and parallel
for Mining Association Rules,º Proc. Fourth Int'l Conf. Parallel and computing.
Distributed Information Systems, 1996.
[24] H. Toivonen, ªSampling Large Databases for Mining Association
Rules,º Proc. 22th Very Large Databases Conf., 1996.
[25] M.J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li, ªParallel Data
Mining for Association Rules on Shared-Memory Multi-Proces- . For more information on this or any computing topic, please visit
sors,º Technical Report 618, Computer Science Dept., The Univ. of our Digital Library at http://computer.org/publications/dlib.
Rochester, May 1996.

You might also like