Proceedings of the 37th Hawaii International Conference on System Sciences - 2004
CBW: An Efficient Algorithm for Frequent Itemset Mining∗
Ja-Hwung Su
Institute of Information Engineering
I-Shou University
Kaohsiung 840, Taiwan
bb0820@ms22.hinet.net
Abstract
Frequent itemset generation is the prerequisite and
most time-consuming process for association rule
mining. Nowadays, most efficient Apriori-like
algorithms rely heavily on the minimum support
constraint to prune a vast amount of non-candidate
itemsets. This pruning technique, however, becomes
less useful for some real applications where the
supports of interesting itemsets are extremely small,
such as medical diagnosis, fraud detection, among the
others. In this paper, we propose a new algorithm that
maintains its performance even at relative low
supports. Empirical evaluations show that our
algorithm is, on the average, more than an order of
magnitude faster than Apriori-like algorithms.
1. Introduction
Mining association rules from a large database of
business data, such as transaction records, has been a
hot topic within the area of data mining. This problem
is motivated by applications known as market basket
analysis to find relationships between items purchased
by customers [2], that is, what kinds of products tend
to be purchased together.
An association rule is an expression of the form X
Y, where X and Y are sets of items. Such a rule
reveals that transactions in the database containing
items in X tend to contain items in Y, and the
probability, measured as the fraction of transactions
containing X also containing Y, is called the confidence
of the rule. The support of the rule is the fraction of the
transactions that contain all items both in X and Y.
For an association rule to hold, the support and the
confidence of the rule should satisfy a user-specified
minimum support, called minsup, and minimum
confidence, called minconf, respectively. The problem
∗
Wen-Yang Lin
Dept. of Information Management
I-Shou University
Kaohsiung 840, Taiwan
wylin@isu.edu.tw
of mining association rules is to discover all
association rules that satisfy minsup and minconf.
In general, the work of association rules mining can
be decomposed into two phases: (1) Frequent itemsets
generation: find out all itemsets that sufficiently
exceed the minsup, and (2) Rules construction: from
the frequent itemsets generate all association rules
having confidence higher than the minconf. Since the
second phase is straightforward and less expensive, we
concentrate only on the first phase for finding all
frequent itemsets.
Nowadays, most efficient Apriori-like algorithms
rely heavily on the minimum support constraint to
prune a vast amount of non-candidate itemsets. This
pruning technique, however, becomes less useful for
some real applications where the supports of
interesting itemsets are extremely small, such as
medical diagnosis, fraud detection, among the others.
This is because the number of candidate itemsets
exponentially increases as the minimum support
threshold decreases, and ultimately, almost all itemsets
will become candidates at very low support threshold.
In this paper, we propose a new algorithm, called
CBW, which employs a bi-directional search strategy
and hybridizes various techniques in frequent itemset
generation. Empirical evaluations show that our
algorithm can maintain its performance even at relative
low support thresholds, and can be more than two
orders of magnitude faster than Apriori-like
algorithms.
The rest of this paper is organized as follows. A
review of previous work is given in Section 2. In
Section 3, we describe the proposed algorithm for
finding frequent itemsets. Empirical evaluations of our
algorithm on Foodmart2000 and IBM’s synthetic data
set are described in Section 4. Finally, conclusion and
future work are stated in Section 5.
This work was partially supported by the National Science Council of ROC under grant No.NSC90-2213-E-214-040.
0-7695-2056-1/04 $17.00 (C) 2004 IEEE
1
Proceedings of the 37th Hawaii International Conference on System Sciences - 2004
2. Previous work
In the literature, there have been a substantial
number of methods for mining association rules. The
most well-known and influential algorithm is Apriori
[2], which uses an a priori knowledge of frequent kitemsets to generate candidate (k+1)-itemets and
employs an innovative technique to prune nonpromising candidates.
The most criticized drawback of Apriori is that
when the cardinality of the longest frequent itemsets is
k, Apriori needs k passes of database scans. In addition,
the Apirori algorithm is computation-intensive in
generating the candidate itemsets and counting the
support values, especially for applications with very
low support threshold and/or a vast amount of items.
Many variants thus were proposed to improve the
efficiency, including DHP [11], Partition [12], DIC [4],
Eclat [14], Top-down [14], FP-growth [6], among the
others. Although these variants adopt different
techniques and employ in different view of points, they
can be categorized from three different algorithmic
aspects: (1) Counting strategy: vertical intersection vs.
horizontal counting; (2) Search strategy: breadth-first
search (BFS) vs. depth-first search (DFS); and (3)
Search direction: top-down vs. bottom-up. Although
the first two aspects have been addressed in [7][8], no
comparison has revealed the influence of the last
aspect. Table 1 shows a three-dimensional
classification of prevailing Apriori-like algorithms.
Table 1. A three-dimensional classification view of
prevailing Apriori-like algorithms
Search direction:
bottom-up
top-down
Counting
Search strategy: Search strategy:
DFS
BFS
DFS BFS
strategy:
counting
FPApriori
Top-down
growth DHP
DIC
intersection Eclat
Partition
Counting Strategy. This refers to the methods used
to count the occurrences of candidate itemsets. Up to
date, there are two main approaches: horizontal
counting and vertical intersection. The horizontal
counting determines the support value of a candidate
itemset by scanning transaction one by one, and
increasing the counter of the itemset if it is a subset of
the transaction. This approach works well for a rarely
occurred candidate because only those transactions
containing that itemset need to be inspected. The
candidate look up operation, however, is costly for
candidates of large size.
Vertical intersection, on the other hand, is employed
when the database is represented as a vertical format
such that each record is associated with an item to store
the identifiers of the transactions containing that item,
called tidlist. Though the vertical intersection scheme
eliminates the I/O cost for database scan, it has the
following deficiency: When the support count of a
candidate itemset is quite less than the number of
transactions, there occurs a large amount of
unnecessary intersections.
Search direction. Nowadays, most Apriori-like
approaches adopt bottom-up traversal of the search
space, starting from all frequent 1-itemsets upward to
the longest frequent itemsets. The main advantage of
this paradigm is that it can effectively prune the search
space by exploiting downward closure property: once
an itemset is recognized as infrequent, all of its
supersets are infrequent as well. This advantage fades,
however, when most of the maximal frequent itemsets
locating near the largest itemset of the search lattice,
due to a relatively small support threshold. In this case,
there are very few itemsets to be pruned.
Another itemset traversal is employed in the
opposite direction, i.e. starting from the longest
itemsets downward to the frequent 1-itemsets, or topdown for short. This strategy is traditionally adopted
for discovering maximal frequent itemsets [1][3][13].
But notice that though all of the frequent itemsets can
be derived from their maximal ones, further counting
strategies are required to obtain their exact supports for
computing the confidences of association rules.
Meanwhile, if there are vast numbers of items and/or
the support threshold is very low, many infrequent
itemsets have to be visited before the maximal frequent
itemsets are identified. This is why most work on
frequent itemsets mining embraces the bottom-up
paradigm instead.
Search strategy. While the search direction guides
the way that the search space is exploited, the search
strategy refers to the order in which itemsets are visited.
Most Apriori-like algorithms employ breadth-first
search because it can facilitate the pruning of
candidates with downward closure. This strategy,
however, requires more memory to keep the frequent
subsets of the pruned candidates.
An alternative strategy called DFS, on the other
hand, recursively visits the descendants of an itemset.
In the literature [14], this strategy is usually combined
with the counting strategy of vertical intersection
because it suffices to keep in memory the tidlists
corresponding to the itemsets on the path from the root
down to the currently inspected one.
0-7695-2056-1/04 $17.00 (C) 2004 IEEE
2
Proceedings of the 37th Hawaii International Conference on System Sciences - 2004
3. The proposed Cut-Both-Ways (CBW)
algorithm
3.1. Algorithm basics
As mentioned in the previous section, the
performance bottleneck of frequent itemsets generation
lies in two aspects: the database scan and support
counting. Most contemporary efficient algorithms for
frequent itemset generation are devoted to attack these
two issues. We notice, however, almost of these
algorithms suffer from performance degradation as the
minimum support decreases; they behave well under
large minimum supports, but as the minimum support
decreases, their performances decrease significantly.
Unfortunately, in some applications, the minimum
support must be specified relatively small to mine
interesting patterns from database.
The effect of a lower support threshold has two
facets: On the one hand, much more candidate itemsets
are generated and inspected, and on the other hand, the
cardinalities of maximum frequent itemsets become
larger. Therefore, the computation of support counting
grows dramatically. To alleviate this problem, we
propose an algorithm called CBW (Cut the space &
employ Both-Ways search). The basic idea is
illustrated in Figure 1.
Itemset Pyramid
Longest frequent itemsets
Cutting level α
Upward search
Frequent (α+1)-itmesets
Frequent α-itmesets
Frequent (α−1)-itmesets
Downward search
Frequent 1-itmesets
Figure 1. Concept illustration of CBW
Viewing the solution space as a pyramid that
contains frequent itemsets located at different levels
equal to their cardinalities, we first pursue an
appropriate cutting level α to divide the space into two
different parts. After identifying all frequent itemset at
this level, we perform a downward search to enumerate
all frequent itemsets below the cutting level α and
determine their support values, followed by an upward
search to enumerate all frequent itemsets with
cardinalities larger than α.
The insight behind this paradigm is that, as stated in
Section 2, no approach based on single algorithmic
strategy performs the best in all cases. Bottom-up
search will suffer too many database scans as well as
wasted set enumeration to count the supports of
candidates at higher cardinality levels, especially in the
case of low support threshold. Top-down search, on the
other hand, can not utilize the anti-monotone property
to prune the search space. Furthermore, according to
the investigation in [7], vertical intersection performs
much better in counting the support values of
candidates with larger cardinality, while horizontal
counting favors the opposite situation. All of these
suggest hybridizing different algorithmic approaches to
attack the itemset mining problem.
Our paradigm has the following features. First, by
guessing the appropriate cutting level, we can identify
the most promising cardinality to perform downward
search. If our guess is correct, i.e., most of the itemsets
under this level are frequent, then the effectiveness of
Apriori pruning is overwhelmed by its cost. Therefore,
it is more economic to enumerate all candidate itemsets
and count their supports within one database scan.
Second, for upward searching frequent itemsets with
cardinalities larger than the cutting level, the synergy
gained from the Apriori pruning and vertical
intersection can save lots of unnecessary computations.
3.2. Algorithm detail
Before carrying out the paradigm shown in Figure 1,
we need to determine an appropriate cutting level, a
crucial factor to the effectiveness of CBW. As will be
clear later, if the cutting level is too low, unnecessary
intersections will happen frequently during upward
searching. On the contrary, if the cutting level is too
high, the downward search will spend much more time
in itemsets enumeration and counting their supports.
Therefore, we have to trade the favors between upward
search and downward search to find an appropriate
cutting level. An insightful idea is to pursue the
average cardinality of frequent itemsets, expecting that
most of the frequent itemsets will appear in this level.
This value, however, is impossible to be obtained
without knowing the frequent itemsets. We thus adopt
a simple heuristic described in the following.
Definition 1. Let D be a transaction table and ti the
i-th transaction. The cutting level α is defined below:
ª ¦ t i⊥minsup º
»,
D
«
»
¬
¼
α = ΙΝΤ «
0-7695-2056-1/04 $17.00 (C) 2004 IEEE
3
Proceedings of the 37th Hawaii International Conference on System Sciences - 2004
where INT[r] denote the nearest integer of r, for r ≥ 1,
and ti ⊥minsup be the set of items in ti with support larger
than minsup. More specifically,
ti ⊥minsup = {x | x ∈ ti and sup(x) ≥ minsup}.
For an illustration, consider the transaction data in
Table 2. Assume that minsup = 40%. Then, the
frequent 1-itemsets include {A}, {B}, {C}, and {D}.
The cutting level α thus is (3 + 2 + 1 + 4 + 3 + 2 + 3 +
3 + 3 + 3) / 10 ≈ 3.
Table 2. An example transaction data
tid
1
2ʳ
3
4
5
6
7
8
9
10
items
B, C, D, Eʳ
A, C
B
A, B, C, D
A, B, D
C, D
A, B, D, E
B, C, D
B, C, D, E
B, C, D
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
C2 = the set of candidate 2-itemsets generated
from F1;
numf = 0;
for i = 1 to |D| do
scan the i-th transaction ti;
delete the items in ti that is not in F1;
numf += | ti|;
if | ti| ≥ 3 then
add this tid into the tidlist of each item in ti;
for each 2-subset X of ti and X ∈ C2 do
X.count++;
end for
F2={X | sup(X) ≥ minsup};
α = numf / |D|;
return F2 and α;
Figure 3. Procedure Trans
1.
2.
3.
Detail description of the CBW algorithm is given in
Figures 2, 3, 4 and 5, where sup(X) denotes the support
of an itemset X and T denotes the set of tidlists.
Input: The transaction database D and minimum
support minsup;
Output: The set of frequent itemsets F;
1. scan D to generate all frequent 1-itemsets F1;
2. Trans(D, T, F1, F2, α);
3. Dwnsearch (D, DF, Fα, α, minsup);
4. Upsearch (T, UF, Fα, α, minsup);
5. return F = DF ∪ UF;
Figure 2. Algorithm CBW
The algorithm starts with scanning the database to
generate the set of all frequent 1-itemsets F1, which is
necessary for computing the cutting level defined in
Equation 1.
Procedure Trans is responsible for three different
tasks: (1) computing the cutting level α; (2) generating
the set of frequent 2-itemsets F2; and (3) transforming
the database into vertical tidlists T. For each scanned
transaction t , the number of frequent items is
accumulated for later computation of α. Also, for each
frequent item, we add this tid into its tidlist if the
cardinality of t is no less than 3. In essence, only those
transactions with cardinality larger than α have to be
4.
5.
6.
7.
8.
for i = 1 to |D| do
scan the i-th transaction ti;
delete the items in ti that appear less than two
itemsets in F2;
for each subset X of ti and 3 ≤ |X| ≤ α do
X.count++;
end for
DF={X | sup(X) ≥ minsup};
return DF;
Figure 4. Procedure Dwnsearch
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
if Fα = ∅ return UF= ∅;
read the set of tidlists T and prune non-candidate
tidlists;
k = α, Fk = Fα;
repeat
k++;
Ck = the set of new candidate k-itemsets
generated from Fk-1;
for each X ⊆ Ck do
perform bit-vector intersection on X;
compute the support of X, sup(X);
endfor
Fk = {X| sup(X) ≥ minsup, X ∈ Ck};
UF = UF ∪ Fk;
until Fk = ∅
return UF;
Figure 5. Procedure Upsearch
kept because the upward search starting from level α+1.
But recall that in this stage we still have no idea of α.
Cardinality 3 represents the best we can achieve
currently to facilitate tidlist pruning. To reduce the
memory requirement, all of the tidlists generated in
0-7695-2056-1/04 $17.00 (C) 2004 IEEE
4
Proceedings of the 37th Hawaii International Conference on System Sciences - 2004
this phase are stored in disk for later use in the course
of upward search.
For example, consider Table 2 again. We have C2 =
{{A, B}, {A, C}, {A, D}, {B, C}, {B, D}, {C, D}}.
Since item E does not appear in F1, there is no need to
create the tidlist of E. Furthermore, tids of t2, t3 and t6
are not included in the tidlist of any frequent item
because their cardinalities are less than 3. The resulting
tidlists after this stage is shown in Figure 6.
item
A
B
Cʳ
D
tidlist
4, 5, 7
1, 4, 5, 7, 8, 9, 10ʳ
1, 4, 8, 9, 10
1, 4, 5, 7, 8, 9, 10
Figure 6. The resulting tidlists generated by
procedure Trans
Next, procedure Dwnsearch is executed. Each
transaction is scanned, and, according to Proposition 1,
those items that appear less than two frequent itemsets
of F2 are pruned. The trimmed transaction then
undergoes set enumeration to generate all candidate
itemsets with cardinalities between α and 3, and count
their supports.
Proposition 1. If an item x appears in less than k−1
frequent itemsets of Fk−1, then x is not contained in any
frequent itemset of Fk.
Rationale. Let I be a k-itemset of Fk, I = {a1, a2, …,
ai = x, aI+1, …, ak}. Note that I has exactly k (k−1)itemsets, of which there are k−1 itemsets containing
item x, except itemset {a1, a2, …, ai−1, ai+1…, ak}.
According to downward closure, if I is frequent then
all of its subsets should be frequent as well. The
proposition then follows.
For example, for the first transaction t1, item E is
pruned first because it is not frequent. Transaction t2
and t3 are discarded because their cardinalities are less
than 3. In this way, we can get the support counts of all
itemsets. Finally, we discard the itemsets with supports
less than minsup. The resulting 3-itemsets is {{B, C,
D}}.
After generating all frequent itemsets with
cardinalities no more than α, the CBW algorithm
performs upward searching for the other frequent
itemsets. Following the Apriori paradigm, the
Upsearch procedure generates the frequent itemsets
level by level in a bottom-up fashion starting from the
frequent itemsets at level α. It first checks if Fα is
empty. Otherwise, it reads the tidlists T and prunes the
tidlists for items that appear in less than α frequent
itemsets of Fα. Furthermore, to accelerate the
intersections of tidlists for counting the itemsets, we
adopt the techniques of fast intersection [14] and
caching intermediate result [8].
4. Empirical evaluations
To evaluate the performance of CBW, we tested
several data sets, including Foodmart2000 provided in
Microsoft SQL2000, and several synthetic data
T6.I4.D100K, T15.I4.D200K, T15.I6.D200K and
T15.I8.D200K, which were generated using the IBM
data generator [2]. The experiments were performed on
a HP LH6000R workstation with 1GB RAM and 18GB
HD running Windows 2000 Server.
For comparison, we has implemented two leading
Apriori variants: Apriori and Partition, the former is
based on horizontal counting strategy using hash tree
structure, while the latter relies on vertical tidlists
intersection. But note that for pair comparison with
other methods, our implementation of Partition did not
"partition" the database into several chunks. That is,
our Partition is an in-core method. We also included a
publicly available implementation of FP-growth
provided by Database Research Group at Chinese
University of Hong Kong [5] in all evaluations.
4.1. Foodmart2000 database
We first compared the execution times of the three
approaches on Foodmart2000, using different
minimum support counts ranging from 3 to 15. Table 3
showed the characteristics of Foodmart2000. The
results were shown in Figure 7.
Table 3. Data parameters of Foodmart2000
|D|
|t|
N
Parameter
Number of transactions
Average size of transactions
Number of items
Value
60000
4
200
We observed that
1. Because the support counts of most itemsets in
Foodmart2000 are very small, the minsup must
be set relatively small to generate interesting
patterns. But a small minsup leads to a huge
amount of candidate itemsets, causing too many
redundant intersections of tidlists. As such, the
computation of support count using tidlist
intersection is more than that using hash tree.
That is why Apriori outperforms Partition.
2. At high minimum supports, the speedup of CBW
over Apriori is not significant, because the
number of candidate itemsets at lower itemset
0-7695-2056-1/04 $17.00 (C) 2004 IEEE
5
Proceedings of the 37th Hawaii International Conference on System Sciences - 2004
levels generated by CBW is larger than that by
Apriori. But at higher itemset levels, the counting
cost of CBW is much less than that of Apriori.
35
30
Time (sec.)
3. FP-growth performs well for high support
thresholds but degrades rapidly as the support
threshold decreases. This is because though FPgrowth does not explicitly generate and store
candidate itemsets, it needs to construct
conditional pattern base and from which to
construct FP-tree to generate all frequent itemsets.
When minsup is very low, the overhead spent on
building the conditional pattern base and
conditional FP-tree, and recursively traversal of
the tree to generate frequent patterns will
overwhelm the cost saved in candidate itemset
generation.
40
25
20
15
10
5
0
1
2
3
4
5
6
Cutting level α
Figure 8. Execution times of CBW for various α’s,
with minsup count = 12
700
CBW
FP-growth
Partition
Apriori
600
4
400
3
Best α
Time(sec.)
500
5
300
200
2
1
100
0
15
12
9
6
Mininmum support count
3
Figure 7. Execution times of Apriori, Partition, FPgrowth and CBW on Foodmart2000
We also conducted experiments to see the influence
of different cutting levels. The results were depicted in
Figure 8. We observed that the cutting level has great
effects on the performance of CBW, and that there
exists a best cutting level. The reason is that counting
at a higher level would speed the execution of upward
searching but increase the computation of downward
searching. On the contrary, a lower cutting level would
speed up the execution of downward searching but
slow down the performance of upward searching.
Figure 9 shows that the best cutting level is affected by
the minimum support count; the larger the minimum
support count, the smaller the best cutting level.
0
12
9
6
Minimum support count
3
Figure 9. Evolution of the best cutting level under
different minsups
4.2 Synthetic database
We next compared the four algorithms on the
synthetic data sets under different minsups. The results
were shown in Figures 10, 11 and 12. Our observations
were as follows:
1. CBW outperformed all other methods in all
cases.
2. While all algorithms suffered from the
combinatorial exploration of itemsets due to low
support constraints, our CBW exhibited the best
in maintaining its performance.
3. The longer the itemsets become, the worse all
four algorithms performed. The reason is that the
cost for candidate generation, support counting,
0-7695-2056-1/04 $17.00 (C) 2004 IEEE
6
Proceedings of the 37th Hawaii International Conference on System Sciences - 2004
and conditional pattern and FP-tree construction
grows as the itemset length increases.
2400
CBW
FP-growth
Partition
Apriori
2000
2000
CBW
FP-growth
Partition
Apriori
Time(sec.)
Time(sec.)
1600
1600
1200
1200
800
400
800
0
400
0.0
0
0.0
0.5
1.0
1.5
minsup (%)
2.0
2.5
0.5
1.0
1.5
minsup (%)
2.0
2.5
Figure 12. Execution times of Apriori, Partition,
FP-growth and CBW on T15.I8.D200K
Figure 10. Execution times of Apriori, Partition,
FP-growth and CBW on T15.I4.D200K
120
2400
CBW
FP-growth
Partition
Apriori
2000
Time(sec.)
Time (sec.)
100
1600
80
60
40
20
1200
0
800
1
400
2
3
4
5
6
Cutting level α
0
0.0
0.5
1.0
1.5
minsup (%)
2.0
2.5
Figure 11. Execution times of Apriori, Partition,
FP-growth and CBW on T15.I6.D200K
We also evaluated the execution times of CBW
under different cutting levels (minsup = 1.0%), and the
influence of minsup to the cutting levels. We only
showed the results for T6.I4.D100K in Figure 13 and
14; similar results were observed for the other datasets.
The results conformed to those observed in Figures 8
and 9.
Finally, we conducted an experiment to evaluate the
scalability of the four algorithms. The results were
shown in Figure 15, where we omitted Apriori because
its performance was significantly inferior to the others.
As the figure showed, our CBW exhibited the best in
scalability while FP-growth exhibited the worst.
Figure 13. Execution times of CBW on
T6.I4.D100K for various α’s
5. Concluding remarks
5.1. Summary
In this paper, we have described a new efficient
algorithm for frequent itemsets mining. Unlike
contemporary algorithms that either adopt a top-down
or a bottom-up traversal throughout the itemset lattice
to search for frequent itemsets, our algorithm employs
a clever guess on the most promising itemset level
(cutting-level) to generate all frequent itemsets located
there. Then it performs a downward search, followed
by an upward search to discover all other frequent
itemsets. Empirical study showed that our algorithm is
more than an order of magnitude faster than the Apriori
variants.
0-7695-2056-1/04 $17.00 (C) 2004 IEEE
7
Proceedings of the 37th Hawaii International Conference on System Sciences - 2004
Best cutting level
5
4
3
2
1
0
1.00%
0.75%
0.50%
0.25%
minsup
Figure 14. Evolution of the best cutting level for
CBW under different minsups, running on
T6.I4.D100K
Our CBW algorithm has been incorporated into an
online multidimensional association rule mining
system currently under development [10]. In the future,
we will incorporate into CBW the taxonomy
information and extend it to allow multiple minimum
support specification [13].
800
CBW
FP-growth
Partition
700
Time (sec.)
600
500
400
300
200
100
0
2
4
6
8 10 12 14 16 18
Number of transactions (x 10,000)
20
Figure 15. Scalability evaluation of CBW, FPgrowth and Partition running on T15.I8.D200K
with minsup = 1.0%
5.2. Comparison with related work
To our knowledge, [9] is the only work on
combining the top-down and the bottom-up searches
for association mining. But their approach and
intention are quite different from ours.
First, rather than starting from the middle of the
search space and progressively searching towards both
ends, their approach proceeds from both ends of the
search lattice and progressively searches towards the
middle.
Second, their approach aims at discovering, instead
of all frequent itemsets, the maximal frequent itemsets,
i.e., itemsets having no supersets, which work is quite
simple compared to the work for discovering all
frequent itemsets. Furthermore, on applying their
approach to the work of frequent itemsets mining, the
"top-down pruning" technique on which their approach
relies will become useless because the subsets of a
frequent itemset found in top-down search still have to
been counted to know their supports. In this way, the
top-down search becomes unnecessary and their
method will degenerate to the Apriori algorithm.
On the contrary, we believe that our method can be
adapted to the problem of mining maximal frequent
itemsets [1][3][9]. Indeed, we are currently working on
applying our CBW to this problem and hope to have
result in the near future.
References
[1] R.C. Agarwal, C.C. Aggarwal, and V.V.V. Prasad,
"Depth first generation of long patterns," in Proceedings
of 6th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2000, pp.
108−118.
[2] R. Agrawal and R. Srikant, "Fast Algorithms for Mining
Association Rules," in Proceedings of the 20th VLDB
Conference, 1994, pp. 487−499.
[3] R.J. Bayardo Jr., "Efficiently Mining Long Patterns from
Databases," in Proceedings of 1998 ACM SIGMOD
International Conference on Management of Data.
Seattle, Washington, USA, 1998, pp. 85−93.
[4] S. Brin, R. Motwani, J.D. Ullman, and S. Tsur,
"Dynamic Itemset Counting and Implication Rules for
Market Baseket Data," SIGMOD Record, Vol. 26, 1997,
pp. 255−264.
[5] Database Research Group in the Department of
Computer Science and Engineering at the Chinese
University
of
Hong
Kong,
http://www.cse.cuhk.edu.hk/~kdd/program.html.
[6] J. Han, J. Pei, and Y. Yin, "Mining Frequent Patterns
Without Candidate Generation," in Proceedings of the
2000 ACM SIGMOD International Conference on
Management of Data, Dallas, TX, USA, 2000, pp. 1−12.
[7] J. Hipp, U. Guntzer, and G. Nakhaeizadeh, "Algorithms
for Association Rule MiningA General Survey and
Comparison," SIGKDD Explorations, Vol. 2, 2000, pp.
58−64.
[8] J. Hipp, U. Guntzer, and G. Nakhaeizadeh, "Mining
Association Rules: Deriving a Superior Algorithm by
Analyzing Today’s Approaches," in Proceedings of 4th
European Symposium on Principles of Data Mining and
Knowledge Discovery (PKDD’00), 2000, pp. 159−168.
[9] D. Lin and Z.M. Kedem, "Pincer-search: An Efficient
Algorithm for Discovering the Maximum Frequent Set,"
0-7695-2056-1/04 $17.00 (C) 2004 IEEE
8
Proceedings of the 37th Hawaii International Conference on System Sciences - 2004
IEEE Transactions on Knowledge and Data
Engineering, Vol. 14, No. 3, 2002, pp. 553−566.
[10] W.Y. Lin, J.H. Su and M.C. Tseng, "OMARS: The
Framework of an Online Multi-dimensional Association
Rules Mining System," in Proceedings of the 2nd
International Conference on Electronic Business, Taipei,
Taiwan, 2002, pp. 216−225.
[11] J.S. Park, M.S. Chen, and P.S. Yu, "An Effective HashBased Algorithm for Mining Association Rules," in
Proceedings of the 1995 ACM SIGMOD International
Conference on Management of Data, San Jose, CA,
USA, 1995, pp. 175−186.
[12] A. Savasere, E. Omiecinski, and S. Navathe, "An
Efficient Algorithm for Mining Association Rules in
Large Databases," in Proceedings of the 24th VLDB
Conference, 1995, pp. 432−444.
[13] M.C. Tseng and W.Y. Lin, "Mining Generalized
Association Rules with Multiple Minimum Supports," in
Proceedings of International Conference on Data
Warehousing and Knowledge Discovery, Munich,
Germany, 2001, pp. 11-20.
[14] M.J. Zaki, "Scalable Algorithms for Association
Mining," IEEE Transactions on Knowledge and Data
Engineering, Vol. 12, No. 2, 2000, pp. 372−390.
0-7695-2056-1/04 $17.00 (C) 2004 IEEE
View publication stats
9