Effect of Data Skewness and Workload Balance in Parallel Data Mining
Effect of Data Skewness and Workload Balance in Parallel Data Mining
Effect of Data Skewness and Workload Balance in Parallel Data Mining
3, MAY/JUNE 2002
AbstractÐTo mine association rules efficiently, we have developed a new parallel mining algorithm FPM on a distributed share-
nothing parallel system in which data are partitioned across the processors. FPM is an enhancement of the FDM algorithm, which we
previously proposed for distributed mining of association rules [8]. FPM requires fewer rounds of message exchanges than FDM and,
hence, has a better response time in a parallel environment. The algorithm has been experimentally found to outperform CD, a
representative parallel algorithm for the same goal [2]. The efficiency of FPM is attributed to the incorporation of two powerful
candidate sets pruning techniques: distributed and global prunings. The two techniques are sensitive to two data distribution
characteristics, data skewness, and workload balance. Metrics based on entropy are proposed for these two characteristics. The
prunings are very effective when both the skewness and balance are high. In order to increase the efficiency of FPM, we have
developed methods to partition a database so that the resulting partitions have high balance and skewness. Experiments have shown
empirically that our partitioning algorithms can achieve these aims very well, in particular, the results are consistently better than a
random partitioning. Moreover, the partitioning algorithms incur little overhead. So, using our partitioning algorithms and FPM together,
we can mine association rules from a database efficiently.
Index TermsÐAssociation rules, data mining, data skewness, workload balance, parallel mining, partitioning.
1 INTRODUCTION
generates a lot of redundant computation because every We have captured the distribution characteristics in two
transaction is processed as many times as the number of factors: data skewness and workload balance. Intuitively, a
processors. In addition, it requires a lot of communication, partitioned database has high data skewness if most
and its performance is worse than CD [3]. IDD (Intelligent globally large itemsets1 are locally large only at a few
Data Distribution) and its variant HD (Hybrid Distribution) partitions. Loosely speaking, a partitioned database has a
are important improvements on DD [10]. They partition the high workload balance if all the processors have similar
candidates across the processors based on the first item of a number of locally large itemsets.2 We have defined
candidate. Therefore, each processor only needs to handle quantitative metrics to measure data skewness and work-
the subsets of a transaction which begin with the items load balance. We found out that both distributed and
assigned to the processor. This significantly reduces the global prunings have super performance in the best-case of
redundant computation in DD. HPA (Hash Based Parallel), high data skewness and high workload balance. The
which is very similar to IDD, uses a hashing technique to combination of high balance with moderate skewness is
distribute the candidates to different processors [23]. the second best-case. Inspired by this finding, we investi-
One problem of data distribution that needs to be noted: gate the feasibility of planning the partitioning of the
It requires at least two rounds of communication in each database. We want to divide the data into different
iteration at each processorÐto send its transaction data to partitions so as to maximize the workload balance and
other processors; to broadcast the large itemsets found yield high skewness. Mining a database by partitioning it
subsequently to other processors for candidate sets genera- appropriately and then employing FPM gives us excellent
tion of the next iteration. This two-round communication mining performance. We have implemented FPM on an
scheme puts data distribution in an unfavorable situation IBM SP2 parallel machine with 32 processors. Extensive
when considering response time. performance studies have been carried out. The results
In this work, we investigate parallel mining employing confirm our observation on the relationship between
the count distribution approach. This approach requires less
pruning effectiveness and data distribution.
bandwidth and has a simple one-round communication
For the purpose of partitioning, we have proposed four
scheme. To tackle the problem of a large number of
algorithms. We have implemented these algorithms to
candidate sets in count distribution, we adopt two effective
techniques, distributed pruning and global pruning, to prune study their effectiveness. K-means clustering, like most
and reduce the number of candidates in each iteration. clustering algorithm, provides good skewness. However, it
These two techniques make use of the local support counts would, in general, destroy the balance. Random partition-
of large itemsets found in an iteration to prune candidates ing, in general, can deliver high balance but very low
for the next iteration. These two pruning techniques have skewness. We introduce an optimization constraint to
been adopted in a mining algorithm FDM (Fast Distributed control the balance factor in the k-means clustering
Mining) previously proposed by us for distributed data- algorithm. This modification, called Bk (balanced k-means
bases [7], [8]. However, FDM is not suitable for parallel clustering), produces results which exhibit as good a
environment, it requires at least two rounds of message balance as the random partitioning and also high skewness.
exchanges in each iteration that increases the response time In conclusion, we found that Bk is the most favorable
significantly. We have adopted the two pruning techniques partitioning algorithms among those we have studied.
to develop a new parallel mining algorithm FPM (Fast We summarize our contributions as follows:
Parallel Mining), which requires only one round of message
exchange in each iteration. Its communication scheme is as 1. We have enhanced FDM to FPM for mining associa-
simple as that in CD, and it has a much smaller number of tion rules on a distributed share-nothing parallel
candidate sets due to the pruning. In the rare case that the system which requires fewer rounds of message
set of candidates are still too large to fit into the memory of communication.
each processor even after the pruning, we can integrate the 2. We have analytically shown that the performance of
pruning techniques with the algorithm HD into a 2-level the pruning techniques in FPM are very sensitive to
cluster algorithm. This approach will provide the scalability the data distribution characteristics of skewness and
to handle candidate sets of any size and at the same balance, and proposed entropy-based metrics to
maintain the benefit of effective pruning. (For details, please measure these two characteristics.
see discussion in Section 7.2). 3. We have implemented FPM on an SP2 parallel
In this paper, we focus on studying the performance machine and experimentally verified its perfor-
behavior of FPM and CD. It depends heavily on the mance behavior with respect to skewness and
distribution of data among the partitions of the database. To balance.
study this issue, we first introduce two metrics, skewness 4. We have proposed four partitioning algorithms and
and balance, to describe the distribution of data in the empirically verified that Bk, among the four, is the
databases. Then, we analytically study their effects on the most effective in introducing balance and skewness
performance of the two mining algorithms and verify the into database partitions.
results empirically. Next, we propose algorithms to produce
database partitions that give ªgoodº skewness and balance 1. An itemset is locally large at a processor if it is large within the partition
values. In other words, we propose algorithms to partition at the processor. It is globally large if it is large with respect to the whole
the database so that good skewness and balance values are database [7], [8]. Note that every globally large itemset must be locally large
at some processor. Refer to Section 3 for details.
obtained. Finally, we do experiments to find out how 2. More precise definitions of skewness and workload balance will be
effective these partitioning algorithms are. given in Section 4.
500 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002
The rest of this paper is organized as follows: Section 2 be an itemset, the global support of X is the support X:sup of
overviews the parallel mining of association rules. The X in D. When refering to a partition Di , the local support of
techniques of distributed and global prunings, together X at processor i, denoted by X:sup
i , is the support of X in
with the FPM algorithm are described in Section 3. In the Di . X is globally large if X:sup minsup jDj. Similarly, X is
same section, we also investigate the relationship between locally large at processor i if X:sup
i minsup jDi j. Note
the effectiveness of the prunings and the data distribution that, in general, an itemset X which is locally large at some
characteristics. In Section 4, we define two metrics to processor i may not necessary be globally large. On the contrary,
measure the data skewness and workload balance of a every globally large itemset X must be locally large at some
database partitioning and then present the results of an processor i. This result and its application have been
experimental study on the performance behavior of FPM discussed in detail in [7].
and CD. Basing on the results of Section 4, we introduce For convenience, we use the short form k-itemset to stand
algorithms in Section 5 to partition database so as to for size-k itemset, which consist of exactly k items. And we
improve the performance of FPM. This is achieved by use Lk to denote the set of globally large k-itemsets. We have
arranging the tuples in the database carefully to increase pointed out above that there is a distinction between locally
skewness and balance. Experiments in Section 6 evaluates and globally large itemsets. For discussion purpose, we will
the effectiveness of the partitioning algorithms. In Section 7, call a globally large itemset which is also locally large at
we discuss a few issues including possible extensions of processor i, gl-large at processor i. We will use GLk
i to
FPM to enhance its scalability and we give our conclusions
denote the gl-large k-itemsets at processor i. Note that
in Section 8.
GLk
i Lk , 8i; 1 i n.
2 PARALLEL MINING OF ASSOCIATION RULES 2.2 Count Distribution Algorithm for Parallel Mining
A priori is the most well-known serial algorithm for
2.1 Association Rules
mining association rules [2]. It relies on the apriori_gen
Let I fi1 ; i2 ; ; im g be a set of items and D be a database function to generate the candidate sets at each iteration.
of transactions, where each transaction T consists of a set of
CD (Count Distribution) is a parallelized version of Apriori
items such that T I. An association rule is an implication
for parallel mining [3]. The database D is partitioned into
of the form X ) Y , where X I, Y I, and X \ Y .
D1 ; D2 ; ; Dn and distributed across n processors. In the
An association rule X ) Y has support s in D if the
first iteration of CD, every processor i scans its partition Di
probability of a transaction in D contains both X and Y is
to compute the local supports of all the size-1 itemsets. All
s. The association rule X ) Y holds in D with confidence c
processors are then engage in one round of support counts
if the probability of a transaction in D which contains X
exchange. After that, they independently find out global
also contains Y is c. The task of mining association rules is
to find all the association rules whose support is larger support counts of all the items and then the large size-1
than a given minimum support threshold and whose itemsets. For the other iteration k, (k > 1), each processor i
confidence is larger than a given minimum confidence runs the program fragment in Fig. 1. In Step 1, it computes
threshold. For an itemset X, we use X:sup to denote its the candidate set Ck by applying the aprior_gen function
support count in database D, which is the number of on Lk 1 , the set of large itemsets found in the previous
transactions in D containing X. An itemset X I is large if iteration. In Step 2, local support counts of candidates in
X:sup minsup jDj, where minsup is the given minimum Ck are computed by a scanning of Di . In Step 3, local
support threshold. For the purpose of presentation, we support counts are exchanged with all other processors to
sometimes just use support to stand for support count of an get global support counts. In Step 4, the globally large
itemset. itemsets Lk are computed independently by each proces-
It has been shown that the problem of mining association sor. In the next iteration, CD increases k by one and
rules can be decomposed into two subproblems [1]: 1) find repeats Steps 1±4 until no more candidate is found.
all large itemsets for a given minimum support threshold
and 2) generate the association rules from the large itemsets
3 PRUNING TECHNIQUES AND THE
found. Since 1) dominates the overall cost, research has been
focused on how to efficiently solve the first subproblem.
FPM ALGORITHM
In the parallel environment, it is useful to distinguish CD has not taken advantage of the data partitioning in the
between the two different notions of locally large and global parallel setting to prune its candidate sets. We propose a
large itemsets. Suppose the entire database D is partitioned new parallel mining algorithm FPM which has adopted the
into D1 ; D2 ; ; Dn and distributed over n processors. Let X distributed and global prunings proposed first in [7].
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 501
3.2 Fast Parallel Mining Algorithm (FPM) small number of partitions, the set of large itemsets together
We present the FPM algorithm in this section. It improves can still be distributed either evenly or extremely skewed
CD by adopting the two pruning techniques. The first among the partitions. In one extreme case, most partitions
iteration of FPM is the same as CD. Each processor scans its can have similar number of locally large itemsets, and the
partition to find out local support counts of all size-1 workload balance is high. In the other extreme case, the
itemsets and use one round of count exchange to compute large itemsets are concentrated at a few partitions and,
the global support counts. At the end, in addition to L1 , hence, there are large differences in the number of locally
each processor also finds out the gl-large itemsets GL1
i , for large itemsets among different partitions. In this case, the
1 i n. Starting from the second iteration, prunings are workload is unbalance. These two characteristics have
used to reduce the number of the candidate sets. Fig. 2 is the important bearing on the performance of FPM.
program fragment of FPM at processor i for the kth (k > 1) Example 2. Table 1 is a case of high data skewness and
iteration. In Step 1, distributed pruning is used and high workload balance. The supports of the itemsets
apriori_gen is applied to the sets GLk 1
i , 8i 1; ; n, are clustered mostly in one partition, and the skew-
instead of to the set Lk 1 .3 In Step 2, global pruning is ness is high. On the otherhand, every partition has the
applied to the candidates that survive the distributed same number (two) of locally large itemsets.4 Hence,
pruning. The remaining steps are the same as those in the workload balance is also high. CD will generate
CD. As has been discussed, FPM, in general, enjoys smaller 6
2 15 candidates in the second iteration, while
candidate sets than CD. Furthermore, it uses a simple one- distributed pruning will generate only three candi-
round message exchange scheme same as CD. If we dates AB, CD, and EF , which shows that the pruning
compare FPM with the FDM algorithm proposed in [7], has good effect.
we will see that this simple communication scheme makes Table 2 is an example of high workload balance and
FPM more suitable than FDM in terms of response time in a low data skewness. The support counts of the items A, B,
parallel system. C, D, E, and F are almost equally distributed over the
Note that since the number of processors, n, would not three processors. Hence, the data skewness is low.
be very large, the cost of generating the candidates with the However, the workload balance is high because every
distributed pruning in Step 1 (Fig. 2) should be on the same partition has the same number (five) of locally large
order as that in CD. As for global pruning, since all local itemsets. Both CD and distributed pruning generate the
support counts are available at each processor, no addi- same 15 candidate sets in the second iteration. However,
tional count exchange is required to perform the pruning. global pruning can prune away the candidates AC, AE,
Furthermore, the pruning in Step 2 (Fig. 2) is performed and CE. FPM still exhibits a 20 percent of improvement
only on the remaining candidates after the distributed over CD in this pathological case of high balance and low
pruning. Therefore, cost for global pruning is small skewness.
comparing with database scanning and count updates.
We will define formal metrics for measuring data
3.3 Data Skewness and Workload Balance skewness and workload balance for partitions in Section 4.
In a partitioned database, two data distribution character- We will also see how high values of balance and skewness
istics, data skewness and workload balance, affect the effec- can be obtained by suitably and carefully partitioning the
tiveness of the pruning and, hence, performance of FPM. database in Section 5. In the following, we will show the
Intuitively, the data skewness of a partitioned database is effects of distributed pruning analytically for some special
high if most large itemsets are locally large only at a few cases.
processors. It is low if a high percentage of the large Theroem 2. Let L1 be the set of size-1 large itemsets, C2
c and
itemsets are locally large at most of the processors. For a C2
d be the size-2 candidates generated by CD and distributed
partitioning with high skewness, even though it is highly pruning, respectively. Suppose that each size-1 large itemset is
likely that each large itemset will be locally large at only a
4. As has been mentioned above, the support threshold in Table 1 for
3. Compare this step with Step 1 of Fig. 1 would be useful in order to see globally large itemsets is 15, while that for locally large itemset is 5 for all
the difference between FPM and CD. the partitions.
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 503
m0
TABLE 2 jCk
d j n:
k
High Workload Balance and Low Data Skewness Case
jC j
Hence, jCk
d j m kk1 jLkn 1 j n. Therefore, jCk
d
0 0
k
c j
m k1
m k1 .
When k 2, this result becomes Theorem 2. In general,
m0 m, which shows that distributed pruning has sig-
nificant effect in almost all iterations. However, the effect
will decrease when m0 converges to m as k increases.
Following the property of entropy, higher values of S
X However, some combinations of their values are not
corresponds to higher skewness for X. The above definition admissible. For instance, we cannot have a database
gives a metric on the skewness of an itemset. In the partitioning with very low balance and very low skewness.
following, we define the skewness of a partitioned database This is because a very low skewness would accompany
as a weighted sum of the skewness of all its itemsets. with a high balance, while a very low balance would
Definition 2. Given a database D with n partitions, the skewness accompany with a high skewness.
T S
D is defined by Theorem 3. Let D1 ; D2 ; ; Dn be the partitions of a database D.
X
T S
D S
X w
X; 1. If T S
D 1, then the admissible values of T B
D
X2IS ranges from zero to one. Moreover, if T S
D 0, then
X:sup
T B
D 1.
where IS is the set of all the itemsets, w
X P Y:sup
is the 2. If T B
D 1, then the admissible values of T S
D
Y 2IS
weight of the support of X over all the itemsets, and S
X is ranges from zero to one. Moreover, if T B
D 0, then
the skewness of itemset X. T S
D 1.
Proof.
T S
D has some properties similar to those of S
X.
1. By definition, 0 T B
D 1. What we need to
. T S
D 0 when the skewness of all the itemsets are prove is that the boundary cases are admissible
at its minimal value. when T S
D 1. T S
D 1 implies that S
X
. T S
D 1 when the skewness of all the itemsets are 1 for all large itemsets X. Therefore, each large
at its maximal value. itemset is large at one and only one partition. If all
. 0 < T S
D < 1 in all the other cases. the large itemsets are large at the same partition
We can compute the skewness of a partitioned database Di , then Wi 1 and Wk 0;
1 k n; k 6 i.
according to Definition 2. However, in general, the number Thus, T B
D 0 is admissible. On the other
of itemsets may be very large. One approximation is to hand, if every partition has the same number of
compute the skewness over the set of globally large itemsets large itemsets, then Wi n1 ;
1 i n and,
only and take the approximation pX
i 0 if X is not gl- hence, T B
D 1. Furthermore, if T S
D 0,
large in Di . In the generation of candidate itemsets, only then S
X 0 for all large itemsets X. This
globally large itemsets will be joined together to form new implies that Wi are the same for all 1 i n.
candidates. Hence, only their skewness would impact the Hence, T B
D 1.
effectiveness of pruning. Therefore, this approximation is a 2. It follows from the first result of this theorem
reasonable and practical measure. that both T S
D 0 and T S
D 1 are admis-
sible when T B
D 1. Therefore, the first part
4.2 Workload Balance is proven. Furthermore, if T B
D 0, there
Workload balance is a measurement on the distribution of exists a partition Di such that Wi 1 and
the total weights of the locally large itemsets among the Wk 0;
1 k n; k 6 i. This implies that all
processors. Based Pon the definition of w
X in Definition 2, large itemsets are locally large at only Di .
we define Wi X2IS w
X pX
i to be the itemset work- Hence, T S
D 1. u
t
Pn Di , where IS is the set of all the itemsets.
load of partition
Even though 0 T S
D 1 and 0 T B
D 1, not all
Note that i1 Wi 1. A database has high workload
possible combinations are admissible. In general, the
balance if the Wi 0 s are the same for all partitions Di ,
admissible combinations is a subset of the unit square,
1 i n. On the other hand, if the values of Wi exhibit
represented by the shaded region in Fig. 6. It always
large differences among themselves, the workload balance
contains the two line segments T S
D 1 (S = 1 in Fig. 6)
is low. Thus, our definition of the workload balance metric is
and T B
D 1 (B = 1 in Fig. 6), but not the origin, (S = 0, B
also based on the entropy measure.
= 0). After defining the metrics and studying their
Definition 3. For a database D with n partitions, the workload characteristics, we can experimentally validate our analysis
balance factor (workload
P balance, for short) T B
D is defined (see Section 3.3) on the relationship between data skewness,
n
Wi log
Wi
as T B
D i1
. workload balance and performance of FPM and CD.
log
n
The metric T B
D has the following properties: We would like to note that the two metrics are based on
total entropy which is a good model to measure evenness
.T B
D 1 when the workload across all processors (or unevenness) of data distribution. Also, they are
are the same. consistent with each other.
. T B
D 0 when the workload is concentrated at
one processor. 4.3 Performance Behaviors of FPM and CD
. 0 < T B
D < 1 in all the other cases. We study the performance behaviors of FPM and CD in
Similar to the skewness metric, we can approximate the response to various skewness and balance values on an IBM
value of T B
D by only considering globally large itemsets. SP2 parallel processing machine with 32 nodes. Each node
The data skewness and workload balance are not consists of a POWER2 processor with a CPU clock rate of
independent of each other. Theoretically, each one of them 66.7 MHz and 64 MB of main memory. The system runs the
may attain values between zero and one, inclusively. AIX operating system. Communication between processors
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 505
TABLE 3
Performance Improvement of FPM over CD
Fig. 6. Division of the admissible regions according to the performance improvement of FPM over CD (FPM/CD).
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 507
partitions would be in a more favorable region. In Sections 5 the signature,
i;j , is the support count of the 1-itemset
and 6, we will show that this is, in fact, possible by using
containing item number j in chunk Ci . Note that each
partitioning algorithms we will proposed.
signature
i is a vector. This allows us to use functions
and operations on vectors to describe our algorithms. The
5 PARTITIONING OF THE DATABASE
signatures
i are then used as representatives of their
Suppose now that we have a centralized database and want corresponding chunks Ci . All the
i 0 s can be computed by
to mine it for association rules. We have to divide the scanning the database once. Moreover, we can immedi-
database into partitions and then run FPM on the partitions. P
ately deduce the support counts of all 1-itemsets as
i .
If we can divide the database in a way to yield high balance
This can indeed be exploited by FPM to avoid the first
and skewness across the partitions, we can have much more
savings on resources by using FPM. Even for an already iteration, thus saving one scan in FPM (see Section 7.1 for
partitioned database, redistributing the data among the details). So, overall, we can obtain the signatures without
partitions may also increase the skewness and balance, thus any extra database scans.
benefiting FPM. So, how to divide the database into The second step is to divide the signatures
i into
partitions to achieve high balance and skewness becomes n groups Gk , where n is the number of partitions to be
an interesting and important problem. produced. Each group corresponds to a partition in the
resulting partitioning. The number of partitions n should be
Note that not all databases can be divided into a given
equal to the number of processors to be used for running
number of partitions to yield high skewness and balance. If
FPM. A good partitioning algorithm should assign the
the data in the database is already highly uniform and
signatures to the groups according to the following criteria.
homogeneous, with not much variations, then any method
To increase the resulting skewness, we should put the
of dividing it into the given number of partitions would signatures into groups so that the distance8 between
produce similar skewness and balance. However, most real- signatures in the same group is small, but the distance
life database are not uniform, and there are many variations between signatures in different groups is high. This would
within them. It is possible to find a wise way of dividing the tend to make the resulting partitions more different from
data tuples into different partitions to give a very high one another and make the transactions within each partition
balance and skewness than an arbitrary partition. Therefore, more similar to themselves. Hence, it would have higher
if a database intrinsically has nonuniformness, we may dig skewness. To increase the workload balance, each group
out such nonuniformness and exploit it to partition the should have similar signature sum. One way to achieve this
database. is to assign more or less the same number of signatures to
So, it would be beneficial to partition the database each group such that the total signatures in each group are
carefully. Ideally, the partitioning method should maximize very close.
the skewness and balance metrics for any given database. The third step of the framework distributes the transac-
tions to different partitions. For each chunk Ci , we check
However, doing such an optimization would be no easier
which group its signature
i was assigned to in Step 2. If it
than finding out the association rules. The overhead of this was assigned to group Gk , then we send all transactions in
would be too high to be worth doing. So, instead of that chunk to partition k. After sending out all the chunks,
optimizing the skewness and balance values, we would use the whole database is partitioned.
low-cost algorithms that produce reasonably high balance It can be easily noticed that Step 2 is the core part of the
and skewness values. These algorithms should be simple partitioning algorithm framework. Steps 1 and 3 are simply
enough so that not much overhead is incurred. Such small the preprocessing and postprocessing parts. By using a
overhead would be far compensated by the subsequent suitable chunk size, we can reduce the total number of
savings in running FPM. chunks, and hence signatures, to a suitable value, so that
they can all be processed in RAM by the clustering
5.1 Framework of the Partitioning Algorithms algorithm in Step 2. This effectively reduces the amount
To make the partitioning algorithms simple, we will base on of information that our clustering algorithm has to handle
and, hence, the partitioning algorithms are very efficient.
the following framework. The core part of the framework is
We would like to note that, in our approach, we have
a clustering algorithm, for which we will plug in different only made use of the signatures from size-1 itemsets. We
clustering algorithms to give different partitioning algo- could have used those of larger size itemsets. However,
rithms. The framework can be divided into three steps. that would cost more and, eventually, it becomes the
Conceptually, the first step of the framework divides problem of finding the large itemsets. Using size-1 itemsets
is a good trade-off, and the resources used in finding them
the transactions in the database into equal-sized chunks as we have noted is not wasted. Furthermore, our
Ci . Each chunk contains the same number, l m y, of empirical results (in Section 6) have shown that by just
transactions. So, there will be a total of z jDj chunks. using the size-1 itemsets, we can already achieve reason-
y
able skewness and high balance.
For each chunk, we define a signature
i , which is an
jIj-dimensional vector. The jth (j 1; 2; . . . ; jIj) element of 8. Any valid distance function in the jIj-dimensional space may be used.
508 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002
signatures are assigned to G2 . This assignment is also shown support count of the item), the secondary key in descending
in the table. The database is subsequently partitioned by order of the coordinate values, etc. By alternating the
delivering the corresponding chunks Ci to the correspond- direction of sorting in successive keys, our algorithm can
ing partitions. Observe how the sorting has brought the distribute the workload quite evenly, while maintaining a
transactions that contribute to the support count of item k1 reasonably high skewness.
to the partition 2. Since the coordinate values are indeed This idea of using auxiliary sort keys and alternating sort
support counts of the corresponding 1-itemsets, we know orders is illustrated in Table 4. Here, the primary sort key is
that partition 1 has a support count of 0 3 3 6 for item k1 , while k2 is the secondary sort key. The primary sort key
k1 , while partition 2 has a support count of 3 5 7 15. is processed in ascending order, while the secondary key is
Thus, the item k1 has been made to be skewed towards processed in descending order. Note how the secondary
partition 2. sort key has helped assigning
6 and
3 to one group and
2
Now, why did we choose item k1 for the sort key, but to another, when they have the same value in the primary
not item k2 ? Indeed, we should not choose an arbitrary sort key. This has helped in increasing the skewness by
item for the sort key because not all items can give good gathering the support counts of k2 for those signatures
skewness after the sorting. For example, if the item k1 sharing the same value in the primary sort key. Note also
occurs equally frequent in each chunk Ci , then all how the alternating sort order has slightly improved the
signatures
i will have the same value in the coordinate balance, by assigning one less unit of support count, carried
corresponding to the support count of item k1 . Sorting the by C2 (corresponding to
2 ), to G2 .
signatures using this key will not help increasing the So, this resulting partitioning algorithm, which we shall
skewness. On the other hand, if item k1 has very uneven call ªSorting by Highest Entropy Itemº (abbreviated as
distribution among the chunks Ci , sorting would tend to ªSHEIº) should, thus, give high balance and reasonably
deliver chunks with higher occurrence of k1 to the same or good skewness.
near-by groups. In this case, skewness can be increased
significantly by sorting. So, we shall choose the items with 5.5 Balanced k-means Clustering (Bk)
uneven distribution among the chunks Ci for the sort key. The balanced k-means clustering algorithm, which we will
To measure the unevenness, we use the statistical entropy abbreviate as ªBk,º is a modification of the k-means
measure again. For every item X, we evaluate its statistical algorithm. The k algorithm achieves the skewness criter-
entropy value among the signature coordinate values
i;X ion. However, it does not pay any effort to achieve to the
over all chunks Ci . The item with the highest entropy value balance criterion. In Bk, we remedy this by assigning the
is the most unevenly distributed. Besides considering the signatures
i to the groups Gj while minimizing the value
unevenness, we have to consider the support count of of the following expression. We also add the constraint that
item X, too. If the support count of X is very small, then each group receives the same number of signatures. The
we gain little by sorting on X. So, we should consider both problem is stated as follows:
the unevenness and the support count of each item X in
1 Xz X n
order to determine the sort key. We multiply the entropy Minimize F jj
i j jj2
i; j
value with the total support count of the item in the whole i1 j1
database (which can be computed by summing up all !
X
n
signature vectors). The product gives us a measure of how j jjj Ejj 2
balance. Therefore, minimizing the above expression would Under this constraint, it strives to maximize the skewness
achieve both high balance and skewness. The values of (by minimizing signature-centroid distance) like the k
and let us control the weight of each criterion. A higher algorithm does. So, the skewness is in a reasonably high
value of gives more emphasis on the skewness criterion, level. The algorithm actually produces very high balance
while a higher value of would make the algorithm focus and, while maintaining such high balance workload, it
on achieving high balance. In our experiments (see attempts to maximize skewness. So, essentially the balance
Section 6), we set 1:0. In addition, we use the factor is given primary consideration. This should suit
Euclidean distance function in the calculations. FPM well.
This minimization problem is not trivial. Therefore, we
take an iterative approach, based on the framework of the
k algorithm. We first make an arbitrary initial assignment 6 EXPERIMENTAL EVALUATION OF THE
of the signatures to the groups, thus giving an initial value PARTITIONING ALGORITHMS
for each
i; j. Then, we iteratively improve this assign- To find out whether the partitioning algorithms introduced
ment to lower the value of the objective function. in Section 5 are effective, we have done two sets of
Each iteration is divided into two steps. In the first step, experiments. In these experiments, we first generate
we treat the values of
i; j as constants and try to synthetic databases. The generated databases are already
minimize F by assigning suitable values to each j . In the partitioned, with the desired skewness and workload
next step, we treat all j as constants and adjust the values balance. So, the databases are intrinsically nonuniform.
This is, however, not suitable for the experiments for
of
i; j to minimize F . Thus, the values of j and
i; j are
evaluating whether our partitioning algorithms can dig out
adjusted alternatively to reduce minimize value of F . The
the skewness and workload balance from a database. So, we
details are as follows: In each iteration, we first use the same ªdestroyº the apparent skewness and workload balance
approach as k to calculate the geometric centroids j for that already exist among the partitions, by concatenating
each group. To reduce the value of the second term (balance the partitions to form a centralized database and then
consideration) in the objective function F , we temporarily shuffling the transactions in the concatenated database. The
treat the values of
i; j as constants and find out the partial shuffling would destroy the orderness of the transaction, so
derivatives of the object function with regard to each j . that arbitrary partitioning of the resulting database would
Solving @ @F
0 (j 1; 2; . . . ; n), we find that we shall make give a partitioning with low balance and skewness. We can
j
j E then test whether our partitioning algorithms can produce
the assignments j , where j 1; 2; . . . ; n in order
partitions that give higher workload balance and skewness
to minimize the objective function F . After determining j ,
than an arbitrary partitioning.
we next adjust the values of
i; j, treating the values of j The databases used here are similar to those generated in
as constants. Since the second term inside parenthesis now Section 4.3. The number of partitions in the databases is 16.
does not involve any variable
i; j, we may reduce the Each partition has 100,000 transactions. Chunk size is set to
minimization problem to the following problem: 1,000 transactions; hence, each partition has 100 chunks and
the total number of chunks is 1,600. In order to evaluate the
z X
X n
Minimize jj
i j jj2
i; j effectiveness of the partitioning algorithms, we have to
i1 j1 compare the skewness and workload balance of the
resulting partitions against the skewness and balance
j 1 j 2
j1 ; j2 1; 2; . . . ; n
subject to intrinsic in the database. For this purpose, we take the
i; j 0 or 1
i 1; 2; . . . ; z; j 1; 2; . . . ; n;
skewness and workload balance before concatenation as the
where
i; j are the variables. Note that this is a linear intrinsic values. All the skewness and balance values
programming problem. We shall call it the ªgeneralized reported below are obtained by measurement on the
assignment problem.º It is indeed a generalization of the partitions before the concatenation as well as after the
Assignment Problem and a specialization of the Transpor- partitioning, not the corresponding control values for the
tation Problem in the literature of linear programming. data generation. So, they reflect the actual values for those
There are many efficient algorithms for solving such metrics.
problems. The Hungarian algorithm [9], which is designed For each generated database, we run all the four
for solving the Assignment Problem, has been extended to partitioning algorithms given in Section 5. The skewness
solve the generalized assignment problem. This extended and workload balance of the resulting partitions are noted
Hungarian algorithm is incorporated as a part of the and compared with one another together with the intrinsic
clustering algorithm; Like k, it iteratively improves its values. As discussed in Section 5.3, the result of the
solution. The iterations are stopped when the assignment random algorithm suggests the highest achievable balance
becomes stable. value, while the result of k gives the highest achievable
Since the algorithm imposes the constraint that each skewness value.
group gets assigned the same number11 of signatures, We did two series of experiments: 1) the first series
workload balanced is guaranteed in the final partitioning. varied the intrinsic skewness while keeping the intrinsic
balance at a high level; 2) the second series varied the
11. Practically, the constraint is relaxed to allow up to a difference of one
between the j s, due to the remainder when the number of signatures is intrinsic balance value while keeping the intrinsic skewness
divided by the number of groups. almost constant.
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 511
Fig. 7. High intrinsic balance and varied intrinsic skewnessÐresulting Fig. 8. High intrinsic balance and varied intrinsic skewnessÐresulting
skewness. balance.
6.1 Effects of High Intrinsic Balance and Varied From this series of experiments, we can conclude that:
Intrinsic Skewness Given a database with good intrinsic balance, even if the
The first series of experiments we did was to find out how intrinsic skewness is not high, both Bk and SHEI can
the intrinsic balance and skewness values would affect the increase the balance to a level as good as that in a random
effectiveness of SHEI and Bk given that the intrinsic partitioning; in addition, Bk can, at the same time, deliver
balance is in a high value and the skewness changes from a good level of skewness, much better than that from the
high to low. Fig. 7 shows the results for the skewness of the random partitioning. The skewness achieved by Bk is also
resulting partitionings. The vertical axis gives the skewness better than SHEI and is in an order comparable to what can
values of the resulting databases. Every four points on the be achieved by the benchmark algorithm k.
same vertical line represent the results of partitioning the Another way to look at the result of these experiments is
same initial database. The intrinsic skewness of the to fit the intrinsic skewness and balance value pairs (those
database is given on the horizontal axis. For your reference, on the horizontal axis of Fig. 7) into the regions in Fig. 6.
the intrinsic balance values are given directly under the These pairs, which represent the intrinsic skewness and
skewness value of the database. Different curves in the balance of the initial partitions, all fall into region C.
figure show the results of different partitioning algorithms. Combining the results from Figs. 7 and 8, among the
The k algorithm, as explained before, gives the highest resulting partitions from Bk, five of them have moved to
skewness achievable by a suitable partitioning. Indeed, the region A, the most favorable region. For the other three,
resulting skewness values of k are very close to the even though their workload balance have been increased,
intrinsic values. Both SHEI and Bk do not achieve this their skewness have not been increased enough to move
skewness value. This is primarily because they put more them out of region C. In summary, a high percentage of the
emphasis on balance than skewness. Yet, the results show resulting partitions have been benefited substantially from
that some of the intrinsic skewness can be recovered. using Bk.
According to the figure, the resulting skewness of Bk is
6.2 Effects of Reducing Intrinsic Balance
almost always twice of that of SHEI. So, the Bk algorithm
performs better than SHEI in terms of resulting skewness. Our second series of experiments attempted to find out how
This is due to the fact that Bk uses a more sophisticated SHEI and Bk would be affected when the intrinsic balance
method of achieving high skewness. Most importantly, the is reduced to a lower level.
Fig. 9 presents the resulting skewness values against the
skewness achieved by Bk is between 50 percent to
intrinsic skewness. The numbers in the row below show the
60 percent of that of the benchmark algorithm k, which
intrinsic balance values for the corresponding databases.
indicates that Bk can maintain a significant degree of the
Fig. 10 shows the resulting balance values of the four
intrinsic skewness.
Fig. 8 shows the workload balance values for the same algorithms on the same partitioning results.
Again, the random algorithm suggests the highest
partitioned databases. This time, the vertical axis shows the
achievable balance value, which is close to 1.0 in all cases.
resulting balance. Again, every four points on the same
Both SHEI and Bk are able to achieve the same high
vertical line represent the partitioning of the same original balance value which is the most important requirement
database. The horizontal axis gives the intrinsic balance (Fig. 10). Thus, they are very good at yielding a good
values of the databases, with the intrinsic skewness values balance even if the intrinsic balance is low. As for the
of the corresponding databases given in the row below. resulting skewness, the results of the k algorithm gives the
The generated databases all have an intrinsic balance highest achievable skewness values, which are very close to
very close to 0.90. Random partitioning of course yields a the intrinsic values (Fig. Fig. 9). Both SHEI and Bk can
high balance value very close to 1.0, which can be taken as recover parts of the intrinsic skewness. However, the
the highest achievable balance value. It is encouraging to skewness is reduced more when the intrinsic balance is in
discover that SHEI and Bk also give good balance values the less favorable range (< 0:7). These results are consistent
which are very close to that of the random partitioning. with our understanding. When the intrinsic balance is low,
512 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002
Fig. 9. Reducing intrinsic balanceÐresulting skewness. Fig. 10. Reducing intrinsic balanceÐresulting balance.
spending effort in rearranging the transactions in the general framework of the partitioning algorithms (see
partitions to achieve high balance would tend to reduce Section 5.1) requires only one extra scan of the database,
the skewness. whose purpose is to calculate the signatures
i . This cost can
Both Bk and SHEI are low-cost algorithms, they spend be compensated by the saving of the first database scan of
more effort in achieving a better balance while at the same FPM (see Section 7.1). After that, no more extra I/O is
time trying to maintain certain level of skewness. Between required. We assume that the chunk size y, specified by the
Bk and SHEI, the resulting skewness of is at least twice of user, is large enough so that the total number of chunks z is
that of SHEI in all cases. This shows that Bk is better than small enough to allow all the signatures
i be handled in
SHEI in both cases of high and low intrinsic balance. It is main memory. In this case, the total overhead of the
also important to note that the skewness achieved by Bk is partitioning algorithms is far compensated by the subse-
always better than that of the random partitioning. quent resource savings. For a more detailed discussion on
Again, we can fit the skewness and balance values of the overhead of these partitioning algorithms, please refer
the initial databases and their resulting partitions (Figs. 9 to Section 7.1.
and 10) into the regions in Fig. 6. What we have found
out is that: after the partitioning performed by Bk, four
partitionings have moved from region C to A, two from
7 DISCUSSION
region D to C, two others remain unchanged. This again To restrict the search of a large itemset in a small set of
is very encouraging and shows the effectiveness of candidates is very essential to the performance of mining
BkÐmore than 70 percent of the databases would have association rules. After a database is partitioned over a
their performance improved substantially by using Bk number of processors, we have information on the
and FPM together. support counts of the itemsets at a finer granularity. This
enables us to use distributed and global prunings
6.3 Summary of the Experimental Results discussed in Section 3. However, the effectiveness of these
The above results show that our clustering algorithms SHEI pruning techniques is very dependent on the distribution
and Bk are very good preprocessors, which prepare a of transactions among the partitions. We discuss two
partitioned database that can be mined by FPM efficiently. issues here related to the database partitioning and
It is encouraging to note that both Bk and SHEI can performance of FPM.
achieve a balance as good as random partitioning and also a
much better skewness. Between themselves, in general, Bk 7.1 Overhead of Using the Partitioning Algorithms
gives much better skewness values than SHEI. Also, the We have already shown that FPM benefits over CD the
results are true for a wide range of intrinsic balance and most when the database is partitioned in a way such that
skewness values. Referring back to Section 4.3, what we the skewness and workload balance measures are high.
have achieved is being able to partition a database such that Consequently, we suggest that partitioning algorithms such
the workload balance would fall into the ideal high level as Bk and SHEI be used before FPM, so as to increase the
range while at the same time maintaining certain level of skewness and workload balance for FPM to work faster.
skewness. Therefore, the resulting partitioning would fall But, this suggestion is good only if the overhead of the
into the favorable regions in Fig. 6. Given this result, we partitioning is not high. We will make this claim by
recommend using Bk for the partitioning (or repartition- dividing the overhead of the partitioning algorithms into
ing) of the database before running FPM. two parts for analysis.
Note that we did no study on the time performance of The first part is CPU cost. First, the partitioning
the partitioning algorithms. This is primarily because the algorithms calculate the signatures of the data chunks. This
algorithms are so simple that they consume negligible involves only simple arithmetic operations. Next, the
amounts of CPU time. In our experiments, the amount of algorithms call a clustering algorithm to divide the
CPU time is no more than 5 percent of the time spent by the signatures into groups. Since the amount of signatures is
subsequent running of FPM. As for I/O overhead, the much smaller than the number of transactions in the whole
CHEUNG ET AL.: EFFECT OF DATA SKEWNESS AND WORKLOAD BALANCE IN PARALLEL DATA MINING 513
database, the algorithms process much less information the parallel algorithm HD (Section 1). Our pruning
than a mining algorithm. Moreover, the clustering algo- technique is orthogonal to the parallelization technique
rithms are designed to be simple, so that they are in HD and can be used to enhance its performance.
computationally not costly. Finally, the program delivers
the transaction in the database to different partitions. This
involves little CPU cost. So, overall, the CPU overhead of 8 CONCLUSIONS
the partitioning algorithms is very low. Experimental A parallel algorithm FPM for mining association rules has
results show that it is no more than 5 percent of the been proposed. FPM is a modification of FDM and it
CPU cost of the subsequent run of FPM. requires fewer rounds of message exchanges. Performance
The second part is the I/O cost. The partitioning studies carried out on an IBM SP2 shared-nothing memory
algorithms in Section 5 all read the original database twice parallel system show that FPM consistently outperforms
and write the partitioned database to disk once. In order to CD. The gain in performance in FPM is due mainly to the
enjoy the power of parallel machines for mining, we have to
pruning techniques incorporated.
partition the database anyway. Comparing with the
It has been found that the effectiveness of the pruning
simplest partitioning algorithm, which must inevitably read
the original database once and write the partitioned techniques depend highly on two data distribution char-
database to disk once, our partitioning algorithms only acteristics: data skewness and workload balance. An
does one extra database scan. But, it shall be remarked that entropy-based metric has been proposed to measure these
in our clustering algorithm, this extra scan is for computing two characteristics. Our analysis and experiment results
the signatures of the chunks. Once the signatures are found, show that the pruning techniques are very sensitive to
the support counts of all 1-itemsets can be deduced by workload balance, though good skewness will also have
summation, which involves no extra I/O overhead. So, we important positive effects. The techniques are very effective
can indeed find out the support counts of all 1-itemsets in the best-case of high balance and high skewness. The
essentially for free. This can be exploited to eliminate the combination of high balance and moderate skewness is the
first iteration of FPM, so that FPM can start straight into the second best-case.
second iteration to find large 2-itemsets. This saves one This is our motivation to introduce algorithms to
database scan from FPM and, hence, as a whole, the 1-scan partition database in a wise way, so as to get higher
overhead of the partitioning algorithm is compensated.
balance and skewness values. We have compared four
Thus, the partitioning algorithms essentially introduce
partitioning algorithms. With the balanced k-means (Bk)
negligible CPU and I/O overhead to the whole mining
clustering algorithm, we can achieve a very high work-
activity. Therefore, it is worthwhile to employ our parti-
load balance, while at the same time a reasonably good
tioning algorithms to partition the database before running
skewness. Our experiments have demonstrated that many
FPM. The great savings by running FPM on a carefully
partitioned database far compensates the overhead of our unfavorable partitions can be repartitioned by into
partitioning algorithms. partitions that allow FPM to perform more efficiently.
Moreover, the overhead of the partitioning algorithms
7.2 Scalability in FPM is negligible and can be compensated by saving one
Our performance studies of FPM were carried out on a database scan in the mining process. Therefore, we can
32-processor SP2 (Section 4.3). If the number of proces- obtain very high association rule mining efficiency by
sors, n, is very large, global pruning may need a large partitioning a database with Bk and then mining it with
memory to store the local support counts from all the FPM. We have also discussed a cluster approach which
partitions for all the large itemsets found in an iteration. can bring scalability to FPM.
Also, there could be cases that candidates generated after
pruning is still too large to fit into the memory. We
suggest to use a cluster approach to solve this problem.
ACKNOWLEDGMENTS
The n processors can be grouped into p clusters, (p n), This research is supported in part by the Hong Kong
so that each cluster would have np processors. In the top Research Grants Council (RGC) grant, project number
level, support counts will be exchanged between the p HKU 7023/98E.
clusters instead of the n processors. The counts exchanged
in this level will be the sum of the supports from the
processors within each cluster. Both distributed and
REFERENCES
[1] R. Agrawal, T. Imielinski, and A. Swami, ªMining Association
global prunings can be applied by treating the data in a
Rules between Sets of Items in Large Databases,º Proc. ACM-
cluster together as a partition. Within a cluster, the SIGMOD Int'l Conf. Management of Data, 1993.
candidates are distributed across the processors and the [2] R. Agrawal and R. Srikant, ªFast Algorithms for Mining
support counts in this second level can be computed by Association Rules,º Proc. 20th Very Large Databases Conf., 1994.
[3] R. Agrawal and J.C. Shafer, ªParallel Mining of Association Rules:
count exchange among the processors inside the cluster. Design, Implementation and Experience,º Technical Report
In this approach, we only need to ensure that the total TJ10004, IBM Research Division, Almaden Research Center, 1996.
distributed memory of the processors in each cluster is [4] S. Brin, R. Motwani, J. Ullman, and S. Tsur, ªDynamic Itemsets
large enough to hold the candidates. From the setting, this Counting and Implication Rules for Market Basket Data,º Proc.
ACM-SIGMOD Int'l Conf. Management of Data, 1997.
approach is highly scalable. In fact, we can regard this [5] T.M. Cover and T.A. Thomas, Elements of Information Theory. John
approach as an integration of our pruning techniques into Wiley & Sons, 1991.
514 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 14, NO. 3, MAY/JUNE 2002
[6] D.W. Cheung, J. Han, V. Ng, and C.Y. Wong, ªMaintenance of David W. Cheung received the MSc and PhD
Discovered Association Rules in Large Databases: An Incremental degrees in computer science from Simon Fraser
Updating Technique,º Proc. 12th Int'l Conf. Data Eng., 1996. University, Canada, in 1985 and 1989, respec-
[7] D.W. Cheung, J. Han, V.T. Ng, A.W. Fu, and Y. Fu, ªA Fast tively. He also received the BSc degree in
Distributed Algorithm for Mining Association Rules,º Proc. Fourth mathematics from the Chinese University of
Int'l Conf. Parallel and Distributed Information Systems, 1996. Hong Kong. From 1989 to 1993, he was with
[8] D.W. Cheung, V.T. Ng, A.W. Fu, and Y. Fu, ªEfficient Mining of Bell Northern Research, Canada, where he was
Association Rules in Distributed Databases. Special Issue in Data a senior member of the scientific staff. Since
Mining,º IEEE Trans. Knowledge and Data Eng., vol. 8, no. 6, 1994, he has been a faculty member of the
pp. 911±922, Dec. 1996. Department of Computer Science and Informa-
[9] S.K. Gupta, Linear Programming and Network Models. New Delhi: tion Systems in The University of Hong Kong. He is now the associate
Affiliated East-West Press, 1985. director of the E-Business Technology Institute in HKU. His research
[10] E. Han, G. Karypis, and V. Kumar, ªScalable Parallel Data Mining interests include data mining, data warehouse, Web-based information
for Association Rules,º Proc. ACM-SIGMOD Int'l Conf. Manage- retrieval, and XML technology for e-commerce. He is the program
ment of Data, 1997. committee chairman of the Fifth Pacific-Asia Conference on Knowledge
[11] J Han and Y Fu, ªDiscovery of Multiple-Level Association Rules Discovery and Data Mining (PAKDD-01) to be held in Hong Kong. He is
from Large Databases,º Proc. 21th Very Large Databases Conf., 1995. also a member of the ACM and the IEEE and the IEEE Computer
[12] M.A.W. Houtsma and A.N. Swami, ªSet-Oriented Mining for Society.
Association Rules in Relational Databases,º Proc. 11th Int'l Conf.
Data Eng., 1995. Sau D. Lee received the MPhil degree in
[13] International Business Machines, Scalable POWERparallel Systems, computer science in 1998 and his BSc (compu-
GA23-2475-02 ed. Feb. 1995. ter science) degree with first class honors in
[14] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An 1995 and from the University of Hong Kong. He
Introduction to Cluster Analysis. John Wiley and Sons, 1990. is a technology officer in the E-Business
[15] H. Mannila, H. Toivonen, and A.I. Verkamo, ªEfficient Algo- Technology Institute, the University of Hong
rithms for Discovering Association Rules,º AAAI Workshop Kong. He is working in the team specializing in
Knowledge Discovery in Databases (KDD-94), July 1994. XML-related technologies, taking the role of
[16] Message Passing Interface Forum, MPI: A Message-Passing Interface system architect and project coordinator. Prior
Standard. May 1994. to joining ETI in 1999, he has worked as
[17] J.S. Park, M.S. Chen, and P.S. Yu, ªAn Effective Hash-Based research assistant in the University of Hong Kong during the years
Algorithm for Mining Association Rules,º Proc. ACM-SIGMOD 1997 and 1998. His research interests include Web-based technologies,
Int'l Conf. Management of Data, May 1995. database systems, data mining, and indexing. From 1995 to 1997, he
[18] J.S. Park, M.S. Chen, and P.S. Yu, ªEfficient Parallel Mining for was also a teaching assistant of the Computer Science Department at
Association Rules,º Proc. Fourth Int'l Conf. Information and Knowl- University of Hong Kong.
edge Management, 1995.
[19] A. Savasere, E. Omiecinski, and S. Navathe, ªAn Efficient Yongqiao Xiao received the BS degree in
Algorithm for Mining Association Rules in Large Databases,º accounting with minor in information systems
Proc. 21th Very Large Databases Conf., 1995. from Renmin University of China in 1992, the MS
[20] R. Srikant and R. Agrawal, ªMining Generalized Association degree in computer science in Zhongshan
Rules,º Proc. 21th Very Large Databases Conf., 1995. University in 1995, and the PhD degree in
[21] R. Srikant and R. Agrawal, ªMining Sequential Patterns: General- computer science in Southern Methodist Uni-
izations and Performance Improvements,º Proc. Fifth Int'l Conf. versity in 2000. He is currently working with
Extending Database Technology, 1996. Trilogy Software Inc., Austin, Texas. He has
[22] R. Srikant and R. Agrawal, ªMining Quantitative Association been a reviewer for IEEE Transactions on
Rules in Large Relational Tables,º Proc. ACM-SIGMOD Int'l Conf. Knowledge and Data Engineering, Very Large
Management of Data, 1996. Databases Conference, etc. His major research
[23] T. Shintani and M. Kitsuregawa, ªHash Based Parallel Algorithms interests include data mining, clickstream analysis, and parallel
for Mining Association Rules,º Proc. Fourth Int'l Conf. Parallel and computing.
Distributed Information Systems, 1996.
[24] H. Toivonen, ªSampling Large Databases for Mining Association
Rules,º Proc. 22th Very Large Databases Conf., 1996.
[25] M.J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li, ªParallel Data
Mining for Association Rules on Shared-Memory Multi-Proces- . For more information on this or any computing topic, please visit
sors,º Technical Report 618, Computer Science Dept., The Univ. of our Digital Library at http://computer.org/publications/dlib.
Rochester, May 1996.