CBW: an efficient algorithm for frequent itemset mining

Wen-Yang  Lin

CBW: an efficient algorithm for frequent itemset mining

37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the, 2004

CBW: An Efficient Algorithm for Frequent Itemset Mining ∗ Ja-Hwung Su Institute of Information Engineering I-Shou University Kaohsiung 840, Taiwan bb0820@ms22.hinet.net Wen-Yang Lin Dept. of Information Management I-Shou University Kaohsiung 840, Taiwan wylin@isu.edu.tw ∗ This work was partially supported by the National Science Council of ROC under grant No.NSC90-2213-E-214-040. Abstract Frequent itemset generation is the prerequisite and most time-consuming process for association rule mining. Nowadays, most efficient Apriori-like algorithms rely heavily on the minimum support constraint to prune a vast amount of non-candidate itemsets. This pruning technique, however, becomes less useful for some real applications where the supports of interesting itemsets are extremely small, such as medical diagnosis, fraud detection, among the others. In this paper, we propose a new algorithm that maintains its performance even at relative low supports. Empirical evaluations show that our algorithm is, on the average, more than an order of magnitude faster than Apriori-like algorithms. 1. Introduction Mining association rules from a large database of business data, such as transaction records, has been a hot topic within the area of data mining. This problem is motivated by applications known as market basket analysis to find relationships between items purchased by customers [2], that is, what kinds of products tend to be purchased together. An association rule is an expression of the form X Y, where X and Y are sets of items. Such a rule reveals that transactions in the database containing items in X tend to contain items in Y, and the probability, measured as the fraction of transactions containing X also containing Y, is called the confidence of the rule. The support of the rule is the fraction of the transactions that contain all items both in X and Y. For an association rule to hold, the support and the confidence of the rule should satisfy a user-specified minimum support, called minsup, and minimum confidence, called minconf, respectively. The problem of mining association rules is to discover all association rules that satisfy minsup and minconf. In general, the work of association rules mining can be decomposed into two phases: (1) Frequent itemsets generation: find out all itemsets that sufficiently exceed the minsup, and (2) Rules construction: from the frequent itemsets generate all association rules having confidence higher than the minconf. Since the second phase is straightforward and less expensive, we concentrate only on the first phase for finding all frequent itemsets. Nowadays, most efficient Apriori-like algorithms rely heavily on the minimum support constraint to prune a vast amount of non-candidate itemsets. This pruning technique, however, becomes less useful for some real applications where the supports of interesting itemsets are extremely small, such as medical diagnosis, fraud detection, among the others. This is because the number of candidate itemsets exponentially increases as the minimum support threshold decreases, and ultimately, almost all itemsets will become candidates at very low support threshold. In this paper, we propose a new algorithm, called CBW, which employs a bi-directional search strategy and hybridizes various techniques in frequent itemset generation. Empirical evaluations show that our algorithm can maintain its performance even at relative low support thresholds, and can be more than two orders of magnitude faster than Apriori-like algorithms. The rest of this paper is organized as follows. A review of previous work is given in Section 2. In Section 3, we describe the proposed algorithm for finding frequent itemsets. Empirical evaluations of our algorithm on Foodmart2000 and IBM’s synthetic data set are described in Section 4. Finally, conclusion and future work are stated in Section 5. Proceedings of the 37th Hawaii International Conference on System Sciences - 2004 0-7695-2056-1/04 $17.00 (C) 2004 IEEE 1

2. Previous work In the literature, there have been a substantial number of methods for mining association rules. The most well-known and influential algorithm is Apriori [2], which uses an a priori knowledge of frequent k- itemsets to generate candidate (k+1)-itemets and employs an innovative technique to prune non- promising candidates. The most criticized drawback of Apriori is that when the cardinality of the longest frequent itemsets is k, Apriori needs k passes of database scans. In addition, the Apirori algorithm is computation-intensive in generating the candidate itemsets and counting the support values, especially for applications with very low support threshold and/or a vast amount of items. Many variants thus were proposed to improve the efficiency, including DHP [11], Partition [12], DIC [4], Eclat [14], Top-down [14], FP-growth [6], among the others. Although these variants adopt different techniques and employ in different view of points, they can be categorized from three different algorithmic aspects: (1) Counting strategy: vertical intersection vs. horizontal counting; (2) Search strategy: breadth-first search (BFS) vs. depth-first search (DFS); and (3) Search direction: top-down vs. bottom-up. Although the first two aspects have been addressed in [7][8], no comparison has revealed the influence of the last aspect. Table 1 shows a three-dimensional classification of prevailing Apriori-like algorithms. Table 1. A three-dimensional classification view of prevailing Apriori-like algorithms Search direction: bottom-up top-down Counting Search strategy: Search strategy: strategy: DFS BFS DFS BFS counting FP- growth Apriori DHP DIC Top-down intersection Eclat Partition Counting Strategy. This refers to the methods used to count the occurrences of candidate itemsets. Up to date, there are two main approaches: horizontal counting and vertical intersection. The horizontal counting determines the support value of a candidate itemset by scanning transaction one by one, and increasing the counter of the itemset if it is a subset of the transaction. This approach works well for a rarely occurred candidate because only those transactions containing that itemset need to be inspected. The candidate look up operation, however, is costly for candidates of large size. Vertical intersection, on the other hand, is employed when the database is represented as a vertical format such that each record is associated with an item to store the identifiers of the transactions containing that item, called tidlist. Though the vertical intersection scheme eliminates the I/O cost for database scan, it has the following deficiency: When the support count of a candidate itemset is quite less than the number of transactions, there occurs a large amount of unnecessary intersections. Search direction. Nowadays, most Apriori-like approaches adopt bottom-up traversal of the search space, starting from all frequent 1-itemsets upward to the longest frequent itemsets. The main advantage of this paradigm is that it can effectively prune the search space by exploiting downward closure property: once an itemset is recognized as infrequent, all of its supersets are infrequent as well. This advantage fades, however, when most of the maximal frequent itemsets locating near the largest itemset of the search lattice, due to a relatively small support threshold. In this case, there are very few itemsets to be pruned. Another itemset traversal is employed in the opposite direction, i.e. starting from the longest itemsets downward to the frequent 1-itemsets, or top- down for short. This strategy is traditionally adopted for discovering maximal frequent itemsets [1][3][13]. But notice that though all of the frequent itemsets can be derived from their maximal ones, further counting strategies are required to obtain their exact supports for computing the confidences of association rules. Meanwhile, if there are vast numbers of items and/or the support threshold is very low, many infrequent itemsets have to be visited before the maximal frequent itemsets are identified. This is why most work on frequent itemsets mining embraces the bottom-up paradigm instead. Search strategy. While the search direction guides the way that the search space is exploited, the search strategy refers to the order in which itemsets are visited. Most Apriori-like algorithms employ breadth-first search because it can facilitate the pruning of candidates with downward closure. This strategy, however, requires more memory to keep the frequent subsets of the pruned candidates. An alternative strategy called DFS, on the other hand, recursively visits the descendants of an itemset. In the literature [14], this strategy is usually combined with the counting strategy of vertical intersection because it suffices to keep in memory the tidlists corresponding to the itemsets on the path from the root down to the currently inspected one. Proceedings of the 37th Hawaii International Conference on System Sciences - 2004 0-7695-2056-1/04 $17.00 (C) 2004 IEEE 2

Proceedings of the 37th Hawaii International Conference on System Sciences - 2004 CBW: An Efficient Algorithm for Frequent Itemset Mining∗ Ja-Hwung Su Institute of Information Engineering I-Shou University Kaohsiung 840, Taiwan bb0820@ms22.hinet.net Abstract Frequent itemset generation is the prerequisite and most time-consuming process for association rule mining. Nowadays, most efficient Apriori-like algorithms rely heavily on the minimum support constraint to prune a vast amount of non-candidate itemsets. This pruning technique, however, becomes less useful for some real applications where the supports of interesting itemsets are extremely small, such as medical diagnosis, fraud detection, among the others. In this paper, we propose a new algorithm that maintains its performance even at relative low supports. Empirical evaluations show that our algorithm is, on the average, more than an order of magnitude faster than Apriori-like algorithms. 1. Introduction Mining association rules from a large database of business data, such as transaction records, has been a hot topic within the area of data mining. This problem is motivated by applications known as market basket analysis to find relationships between items purchased by customers [2], that is, what kinds of products tend to be purchased together. An association rule is an expression of the form X Y, where X and Y are sets of items. Such a rule reveals that transactions in the database containing items in X tend to contain items in Y, and the probability, measured as the fraction of transactions containing X also containing Y, is called the confidence of the rule. The support of the rule is the fraction of the transactions that contain all items both in X and Y. For an association rule to hold, the support and the confidence of the rule should satisfy a user-specified minimum support, called minsup, and minimum confidence, called minconf, respectively. The problem ∗ Wen-Yang Lin Dept. of Information Management I-Shou University Kaohsiung 840, Taiwan wylin@isu.edu.tw of mining association rules is to discover all association rules that satisfy minsup and minconf. In general, the work of association rules mining can be decomposed into two phases: (1) Frequent itemsets generation: find out all itemsets that sufficiently exceed the minsup, and (2) Rules construction: from the frequent itemsets generate all association rules having confidence higher than the minconf. Since the second phase is straightforward and less expensive, we concentrate only on the first phase for finding all frequent itemsets. Nowadays, most efficient Apriori-like algorithms rely heavily on the minimum support constraint to prune a vast amount of non-candidate itemsets. This pruning technique, however, becomes less useful for some real applications where the supports of interesting itemsets are extremely small, such as medical diagnosis, fraud detection, among the others. This is because the number of candidate itemsets exponentially increases as the minimum support threshold decreases, and ultimately, almost all itemsets will become candidates at very low support threshold. In this paper, we propose a new algorithm, called CBW, which employs a bi-directional search strategy and hybridizes various techniques in frequent itemset generation. Empirical evaluations show that our algorithm can maintain its performance even at relative low support thresholds, and can be more than two orders of magnitude faster than Apriori-like algorithms. The rest of this paper is organized as follows. A review of previous work is given in Section 2. In Section 3, we describe the proposed algorithm for finding frequent itemsets. Empirical evaluations of our algorithm on Foodmart2000 and IBM’s synthetic data set are described in Section 4. Finally, conclusion and future work are stated in Section 5. This work was partially supported by the National Science Council of ROC under grant No.NSC90-2213-E-214-040. 0-7695-2056-1/04 $17.00 (C) 2004 IEEE 1 Proceedings of the 37th Hawaii International Conference on System Sciences - 2004 2. Previous work In the literature, there have been a substantial number of methods for mining association rules. The most well-known and influential algorithm is Apriori [2], which uses an a priori knowledge of frequent kitemsets to generate candidate (k+1)-itemets and employs an innovative technique to prune nonpromising candidates. The most criticized drawback of Apriori is that when the cardinality of the longest frequent itemsets is k, Apriori needs k passes of database scans. In addition, the Apirori algorithm is computation-intensive in generating the candidate itemsets and counting the support values, especially for applications with very low support threshold and/or a vast amount of items. Many variants thus were proposed to improve the efficiency, including DHP [11], Partition [12], DIC [4], Eclat [14], Top-down [14], FP-growth [6], among the others. Although these variants adopt different techniques and employ in different view of points, they can be categorized from three different algorithmic aspects: (1) Counting strategy: vertical intersection vs. horizontal counting; (2) Search strategy: breadth-first search (BFS) vs. depth-first search (DFS); and (3) Search direction: top-down vs. bottom-up. Although the first two aspects have been addressed in [7][8], no comparison has revealed the influence of the last aspect. Table 1 shows a three-dimensional classification of prevailing Apriori-like algorithms. Table 1. A three-dimensional classification view of prevailing Apriori-like algorithms Search direction: bottom-up top-down Counting Search strategy: Search strategy: DFS BFS DFS BFS strategy: counting FPApriori Top-down growth DHP DIC intersection Eclat Partition Counting Strategy. This refers to the methods used to count the occurrences of candidate itemsets. Up to date, there are two main approaches: horizontal counting and vertical intersection. The horizontal counting determines the support value of a candidate itemset by scanning transaction one by one, and increasing the counter of the itemset if it is a subset of the transaction. This approach works well for a rarely occurred candidate because only those transactions containing that itemset need to be inspected. The candidate look up operation, however, is costly for candidates of large size. Vertical intersection, on the other hand, is employed when the database is represented as a vertical format such that each record is associated with an item to store the identifiers of the transactions containing that item, called tidlist. Though the vertical intersection scheme eliminates the I/O cost for database scan, it has the following deficiency: When the support count of a candidate itemset is quite less than the number of transactions, there occurs a large amount of unnecessary intersections. Search direction. Nowadays, most Apriori-like approaches adopt bottom-up traversal of the search space, starting from all frequent 1-itemsets upward to the longest frequent itemsets. The main advantage of this paradigm is that it can effectively prune the search space by exploiting downward closure property: once an itemset is recognized as infrequent, all of its supersets are infrequent as well. This advantage fades, however, when most of the maximal frequent itemsets locating near the largest itemset of the search lattice, due to a relatively small support threshold. In this case, there are very few itemsets to be pruned. Another itemset traversal is employed in the opposite direction, i.e. starting from the longest itemsets downward to the frequent 1-itemsets, or topdown for short. This strategy is traditionally adopted for discovering maximal frequent itemsets [1][3][13]. But notice that though all of the frequent itemsets can be derived from their maximal ones, further counting strategies are required to obtain their exact supports for computing the confidences of association rules. Meanwhile, if there are vast numbers of items and/or the support threshold is very low, many infrequent itemsets have to be visited before the maximal frequent itemsets are identified. This is why most work on frequent itemsets mining embraces the bottom-up paradigm instead. Search strategy. While the search direction guides the way that the search space is exploited, the search strategy refers to the order in which itemsets are visited. Most Apriori-like algorithms employ breadth-first search because it can facilitate the pruning of candidates with downward closure. This strategy, however, requires more memory to keep the frequent subsets of the pruned candidates. An alternative strategy called DFS, on the other hand, recursively visits the descendants of an itemset. In the literature [14], this strategy is usually combined with the counting strategy of vertical intersection because it suffices to keep in memory the tidlists corresponding to the itemsets on the path from the root down to the currently inspected one. 0-7695-2056-1/04 $17.00 (C) 2004 IEEE 2 Proceedings of the 37th Hawaii International Conference on System Sciences - 2004 3. The proposed Cut-Both-Ways (CBW) algorithm 3.1. Algorithm basics As mentioned in the previous section, the performance bottleneck of frequent itemsets generation lies in two aspects: the database scan and support counting. Most contemporary efficient algorithms for frequent itemset generation are devoted to attack these two issues. We notice, however, almost of these algorithms suffer from performance degradation as the minimum support decreases; they behave well under large minimum supports, but as the minimum support decreases, their performances decrease significantly. Unfortunately, in some applications, the minimum support must be specified relatively small to mine interesting patterns from database. The effect of a lower support threshold has two facets: On the one hand, much more candidate itemsets are generated and inspected, and on the other hand, the cardinalities of maximum frequent itemsets become larger. Therefore, the computation of support counting grows dramatically. To alleviate this problem, we propose an algorithm called CBW (Cut the space & employ Both-Ways search). The basic idea is illustrated in Figure 1. Itemset Pyramid Longest frequent itemsets Cutting level α Upward search Frequent (α+1)-itmesets Frequent α-itmesets Frequent (α−1)-itmesets Downward search Frequent 1-itmesets Figure 1. Concept illustration of CBW Viewing the solution space as a pyramid that contains frequent itemsets located at different levels equal to their cardinalities, we first pursue an appropriate cutting level α to divide the space into two different parts. After identifying all frequent itemset at this level, we perform a downward search to enumerate all frequent itemsets below the cutting level α and determine their support values, followed by an upward search to enumerate all frequent itemsets with cardinalities larger than α. The insight behind this paradigm is that, as stated in Section 2, no approach based on single algorithmic strategy performs the best in all cases. Bottom-up search will suffer too many database scans as well as wasted set enumeration to count the supports of candidates at higher cardinality levels, especially in the case of low support threshold. Top-down search, on the other hand, can not utilize the anti-monotone property to prune the search space. Furthermore, according to the investigation in [7], vertical intersection performs much better in counting the support values of candidates with larger cardinality, while horizontal counting favors the opposite situation. All of these suggest hybridizing different algorithmic approaches to attack the itemset mining problem. Our paradigm has the following features. First, by guessing the appropriate cutting level, we can identify the most promising cardinality to perform downward search. If our guess is correct, i.e., most of the itemsets under this level are frequent, then the effectiveness of Apriori pruning is overwhelmed by its cost. Therefore, it is more economic to enumerate all candidate itemsets and count their supports within one database scan. Second, for upward searching frequent itemsets with cardinalities larger than the cutting level, the synergy gained from the Apriori pruning and vertical intersection can save lots of unnecessary computations. 3.2. Algorithm detail Before carrying out the paradigm shown in Figure 1, we need to determine an appropriate cutting level, a crucial factor to the effectiveness of CBW. As will be clear later, if the cutting level is too low, unnecessary intersections will happen frequently during upward searching. On the contrary, if the cutting level is too high, the downward search will spend much more time in itemsets enumeration and counting their supports. Therefore, we have to trade the favors between upward search and downward search to find an appropriate cutting level. An insightful idea is to pursue the average cardinality of frequent itemsets, expecting that most of the frequent itemsets will appear in this level. This value, however, is impossible to be obtained without knowing the frequent itemsets. We thus adopt a simple heuristic described in the following. Definition 1. Let D be a transaction table and ti the i-th transaction. The cutting level α is defined below: ª ¦ t i⊥minsup º », D « » ¬ ¼ α = ΙΝΤ « 0-7695-2056-1/04 $17.00 (C) 2004 IEEE 3 Proceedings of the 37th Hawaii International Conference on System Sciences - 2004 where INT[r] denote the nearest integer of r, for r ≥ 1, and ti ⊥minsup be the set of items in ti with support larger than minsup. More specifically, ti ⊥minsup = {x | x ∈ ti and sup(x) ≥ minsup}. For an illustration, consider the transaction data in Table 2. Assume that minsup = 40%. Then, the frequent 1-itemsets include {A}, {B}, {C}, and {D}. The cutting level α thus is (3 + 2 + 1 + 4 + 3 + 2 + 3 + 3 + 3 + 3) / 10 ≈ 3. Table 2. An example transaction data tid 1 2ʳ 3 4 5 6 7 8 9 10 items B, C, D, Eʳ A, C B A, B, C, D A, B, D C, D A, B, D, E B, C, D B, C, D, E B, C, D 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. C2 = the set of candidate 2-itemsets generated from F1; numf = 0; for i = 1 to |D| do scan the i-th transaction ti; delete the items in ti that is not in F1; numf += | ti|; if | ti| ≥ 3 then add this tid into the tidlist of each item in ti; for each 2-subset X of ti and X ∈ C2 do X.count++; end for F2={X | sup(X) ≥ minsup}; α = numf / |D|; return F2 and α; Figure 3. Procedure Trans 1. 2. 3. Detail description of the CBW algorithm is given in Figures 2, 3, 4 and 5, where sup(X) denotes the support of an itemset X and T denotes the set of tidlists. Input: The transaction database D and minimum support minsup; Output: The set of frequent itemsets F; 1. scan D to generate all frequent 1-itemsets F1; 2. Trans(D, T, F1, F2, α); 3. Dwnsearch (D, DF, Fα, α, minsup); 4. Upsearch (T, UF, Fα, α, minsup); 5. return F = DF ∪ UF; Figure 2. Algorithm CBW The algorithm starts with scanning the database to generate the set of all frequent 1-itemsets F1, which is necessary for computing the cutting level defined in Equation 1. Procedure Trans is responsible for three different tasks: (1) computing the cutting level α; (2) generating the set of frequent 2-itemsets F2; and (3) transforming the database into vertical tidlists T. For each scanned transaction t , the number of frequent items is accumulated for later computation of α. Also, for each frequent item, we add this tid into its tidlist if the cardinality of t is no less than 3. In essence, only those transactions with cardinality larger than α have to be 4. 5. 6. 7. 8. for i = 1 to |D| do scan the i-th transaction ti; delete the items in ti that appear less than two itemsets in F2; for each subset X of ti and 3 ≤ |X| ≤ α do X.count++; end for DF={X | sup(X) ≥ minsup}; return DF; Figure 4. Procedure Dwnsearch 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. if Fα = ∅ return UF= ∅; read the set of tidlists T and prune non-candidate tidlists; k = α, Fk = Fα; repeat k++; Ck = the set of new candidate k-itemsets generated from Fk-1; for each X ⊆ Ck do perform bit-vector intersection on X; compute the support of X, sup(X); endfor Fk = {X| sup(X) ≥ minsup, X ∈ Ck}; UF = UF ∪ Fk; until Fk = ∅ return UF; Figure 5. Procedure Upsearch kept because the upward search starting from level α+1. But recall that in this stage we still have no idea of α. Cardinality 3 represents the best we can achieve currently to facilitate tidlist pruning. To reduce the memory requirement, all of the tidlists generated in 0-7695-2056-1/04 $17.00 (C) 2004 IEEE 4 Proceedings of the 37th Hawaii International Conference on System Sciences - 2004 this phase are stored in disk for later use in the course of upward search. For example, consider Table 2 again. We have C2 = {{A, B}, {A, C}, {A, D}, {B, C}, {B, D}, {C, D}}. Since item E does not appear in F1, there is no need to create the tidlist of E. Furthermore, tids of t2, t3 and t6 are not included in the tidlist of any frequent item because their cardinalities are less than 3. The resulting tidlists after this stage is shown in Figure 6. item A B Cʳ D tidlist 4, 5, 7 1, 4, 5, 7, 8, 9, 10ʳ 1, 4, 8, 9, 10 1, 4, 5, 7, 8, 9, 10 Figure 6. The resulting tidlists generated by procedure Trans Next, procedure Dwnsearch is executed. Each transaction is scanned, and, according to Proposition 1, those items that appear less than two frequent itemsets of F2 are pruned. The trimmed transaction then undergoes set enumeration to generate all candidate itemsets with cardinalities between α and 3, and count their supports. Proposition 1. If an item x appears in less than k−1 frequent itemsets of Fk−1, then x is not contained in any frequent itemset of Fk. Rationale. Let I be a k-itemset of Fk, I = {a1, a2, …, ai = x, aI+1, …, ak}. Note that I has exactly k (k−1)itemsets, of which there are k−1 itemsets containing item x, except itemset {a1, a2, …, ai−1, ai+1…, ak}. According to downward closure, if I is frequent then all of its subsets should be frequent as well. The proposition then follows. For example, for the first transaction t1, item E is pruned first because it is not frequent. Transaction t2 and t3 are discarded because their cardinalities are less than 3. In this way, we can get the support counts of all itemsets. Finally, we discard the itemsets with supports less than minsup. The resulting 3-itemsets is {{B, C, D}}. After generating all frequent itemsets with cardinalities no more than α, the CBW algorithm performs upward searching for the other frequent itemsets. Following the Apriori paradigm, the Upsearch procedure generates the frequent itemsets level by level in a bottom-up fashion starting from the frequent itemsets at level α. It first checks if Fα is empty. Otherwise, it reads the tidlists T and prunes the tidlists for items that appear in less than α frequent itemsets of Fα. Furthermore, to accelerate the intersections of tidlists for counting the itemsets, we adopt the techniques of fast intersection [14] and caching intermediate result [8]. 4. Empirical evaluations To evaluate the performance of CBW, we tested several data sets, including Foodmart2000 provided in Microsoft SQL2000, and several synthetic data T6.I4.D100K, T15.I4.D200K, T15.I6.D200K and T15.I8.D200K, which were generated using the IBM data generator [2]. The experiments were performed on a HP LH6000R workstation with 1GB RAM and 18GB HD running Windows 2000 Server. For comparison, we has implemented two leading Apriori variants: Apriori and Partition, the former is based on horizontal counting strategy using hash tree structure, while the latter relies on vertical tidlists intersection. But note that for pair comparison with other methods, our implementation of Partition did not "partition" the database into several chunks. That is, our Partition is an in-core method. We also included a publicly available implementation of FP-growth provided by Database Research Group at Chinese University of Hong Kong [5] in all evaluations. 4.1. Foodmart2000 database We first compared the execution times of the three approaches on Foodmart2000, using different minimum support counts ranging from 3 to 15. Table 3 showed the characteristics of Foodmart2000. The results were shown in Figure 7. Table 3. Data parameters of Foodmart2000 |D| |t| N Parameter Number of transactions Average size of transactions Number of items Value 60000 4 200 We observed that 1. Because the support counts of most itemsets in Foodmart2000 are very small, the minsup must be set relatively small to generate interesting patterns. But a small minsup leads to a huge amount of candidate itemsets, causing too many redundant intersections of tidlists. As such, the computation of support count using tidlist intersection is more than that using hash tree. That is why Apriori outperforms Partition. 2. At high minimum supports, the speedup of CBW over Apriori is not significant, because the number of candidate itemsets at lower itemset 0-7695-2056-1/04 $17.00 (C) 2004 IEEE 5 Proceedings of the 37th Hawaii International Conference on System Sciences - 2004 levels generated by CBW is larger than that by Apriori. But at higher itemset levels, the counting cost of CBW is much less than that of Apriori. 35 30 Time (sec.) 3. FP-growth performs well for high support thresholds but degrades rapidly as the support threshold decreases. This is because though FPgrowth does not explicitly generate and store candidate itemsets, it needs to construct conditional pattern base and from which to construct FP-tree to generate all frequent itemsets. When minsup is very low, the overhead spent on building the conditional pattern base and conditional FP-tree, and recursively traversal of the tree to generate frequent patterns will overwhelm the cost saved in candidate itemset generation. 40 25 20 15 10 5 0 1 2 3 4 5 6 Cutting level α Figure 8. Execution times of CBW for various α’s, with minsup count = 12 700 CBW FP-growth Partition Apriori 600 4 400 3 Best α Time(sec.) 500 5 300 200 2 1 100 0 15 12 9 6 Mininmum support count 3 Figure 7. Execution times of Apriori, Partition, FPgrowth and CBW on Foodmart2000 We also conducted experiments to see the influence of different cutting levels. The results were depicted in Figure 8. We observed that the cutting level has great effects on the performance of CBW, and that there exists a best cutting level. The reason is that counting at a higher level would speed the execution of upward searching but increase the computation of downward searching. On the contrary, a lower cutting level would speed up the execution of downward searching but slow down the performance of upward searching. Figure 9 shows that the best cutting level is affected by the minimum support count; the larger the minimum support count, the smaller the best cutting level. 0 12 9 6 Minimum support count 3 Figure 9. Evolution of the best cutting level under different minsups 4.2 Synthetic database We next compared the four algorithms on the synthetic data sets under different minsups. The results were shown in Figures 10, 11 and 12. Our observations were as follows: 1. CBW outperformed all other methods in all cases. 2. While all algorithms suffered from the combinatorial exploration of itemsets due to low support constraints, our CBW exhibited the best in maintaining its performance. 3. The longer the itemsets become, the worse all four algorithms performed. The reason is that the cost for candidate generation, support counting, 0-7695-2056-1/04 $17.00 (C) 2004 IEEE 6 Proceedings of the 37th Hawaii International Conference on System Sciences - 2004 and conditional pattern and FP-tree construction grows as the itemset length increases. 2400 CBW FP-growth Partition Apriori 2000 2000 CBW FP-growth Partition Apriori Time(sec.) Time(sec.) 1600 1600 1200 1200 800 400 800 0 400 0.0 0 0.0 0.5 1.0 1.5 minsup (%) 2.0 2.5 0.5 1.0 1.5 minsup (%) 2.0 2.5 Figure 12. Execution times of Apriori, Partition, FP-growth and CBW on T15.I8.D200K Figure 10. Execution times of Apriori, Partition, FP-growth and CBW on T15.I4.D200K 120 2400 CBW FP-growth Partition Apriori 2000 Time(sec.) Time (sec.) 100 1600 80 60 40 20 1200 0 800 1 400 2 3 4 5 6 Cutting level α 0 0.0 0.5 1.0 1.5 minsup (%) 2.0 2.5 Figure 11. Execution times of Apriori, Partition, FP-growth and CBW on T15.I6.D200K We also evaluated the execution times of CBW under different cutting levels (minsup = 1.0%), and the influence of minsup to the cutting levels. We only showed the results for T6.I4.D100K in Figure 13 and 14; similar results were observed for the other datasets. The results conformed to those observed in Figures 8 and 9. Finally, we conducted an experiment to evaluate the scalability of the four algorithms. The results were shown in Figure 15, where we omitted Apriori because its performance was significantly inferior to the others. As the figure showed, our CBW exhibited the best in scalability while FP-growth exhibited the worst. Figure 13. Execution times of CBW on T6.I4.D100K for various α’s 5. Concluding remarks 5.1. Summary In this paper, we have described a new efficient algorithm for frequent itemsets mining. Unlike contemporary algorithms that either adopt a top-down or a bottom-up traversal throughout the itemset lattice to search for frequent itemsets, our algorithm employs a clever guess on the most promising itemset level (cutting-level) to generate all frequent itemsets located there. Then it performs a downward search, followed by an upward search to discover all other frequent itemsets. Empirical study showed that our algorithm is more than an order of magnitude faster than the Apriori variants. 0-7695-2056-1/04 $17.00 (C) 2004 IEEE 7 Proceedings of the 37th Hawaii International Conference on System Sciences - 2004 Best cutting level 5 4 3 2 1 0 1.00% 0.75% 0.50% 0.25% minsup Figure 14. Evolution of the best cutting level for CBW under different minsups, running on T6.I4.D100K Our CBW algorithm has been incorporated into an online multidimensional association rule mining system currently under development [10]. In the future, we will incorporate into CBW the taxonomy information and extend it to allow multiple minimum support specification [13]. 800 CBW FP-growth Partition 700 Time (sec.) 600 500 400 300 200 100 0 2 4 6 8 10 12 14 16 18 Number of transactions (x 10,000) 20 Figure 15. Scalability evaluation of CBW, FPgrowth and Partition running on T15.I8.D200K with minsup = 1.0% 5.2. Comparison with related work To our knowledge, [9] is the only work on combining the top-down and the bottom-up searches for association mining. But their approach and intention are quite different from ours. First, rather than starting from the middle of the search space and progressively searching towards both ends, their approach proceeds from both ends of the search lattice and progressively searches towards the middle. Second, their approach aims at discovering, instead of all frequent itemsets, the maximal frequent itemsets, i.e., itemsets having no supersets, which work is quite simple compared to the work for discovering all frequent itemsets. Furthermore, on applying their approach to the work of frequent itemsets mining, the "top-down pruning" technique on which their approach relies will become useless because the subsets of a frequent itemset found in top-down search still have to been counted to know their supports. In this way, the top-down search becomes unnecessary and their method will degenerate to the Apriori algorithm. On the contrary, we believe that our method can be adapted to the problem of mining maximal frequent itemsets [1][3][9]. Indeed, we are currently working on applying our CBW to this problem and hope to have result in the near future. References [1] R.C. Agarwal, C.C. Aggarwal, and V.V.V. Prasad, "Depth first generation of long patterns," in Proceedings of 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000, pp. 108−118. [2] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," in Proceedings of the 20th VLDB Conference, 1994, pp. 487−499. [3] R.J. Bayardo Jr., "Efficiently Mining Long Patterns from Databases," in Proceedings of 1998 ACM SIGMOD International Conference on Management of Data. Seattle, Washington, USA, 1998, pp. 85−93. [4] S. Brin, R. Motwani, J.D. Ullman, and S. Tsur, "Dynamic Itemset Counting and Implication Rules for Market Baseket Data," SIGMOD Record, Vol. 26, 1997, pp. 255−264. [5] Database Research Group in the Department of Computer Science and Engineering at the Chinese University of Hong Kong, http://www.cse.cuhk.edu.hk/~kdd/program.html. [6] J. Han, J. Pei, and Y. Yin, "Mining Frequent Patterns Without Candidate Generation," in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA, 2000, pp. 1−12. [7] J. Hipp, U. Guntzer, and G. Nakhaeizadeh, "Algorithms for Association Rule MiningA General Survey and Comparison," SIGKDD Explorations, Vol. 2, 2000, pp. 58−64. [8] J. Hipp, U. Guntzer, and G. Nakhaeizadeh, "Mining Association Rules: Deriving a Superior Algorithm by Analyzing Today’s Approaches," in Proceedings of 4th European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD’00), 2000, pp. 159−168. [9] D. Lin and Z.M. Kedem, "Pincer-search: An Efficient Algorithm for Discovering the Maximum Frequent Set," 0-7695-2056-1/04 $17.00 (C) 2004 IEEE 8 Proceedings of the 37th Hawaii International Conference on System Sciences - 2004 IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 3, 2002, pp. 553−566. [10] W.Y. Lin, J.H. Su and M.C. Tseng, "OMARS: The Framework of an Online Multi-dimensional Association Rules Mining System," in Proceedings of the 2nd International Conference on Electronic Business, Taipei, Taiwan, 2002, pp. 216−225. [11] J.S. Park, M.S. Chen, and P.S. Yu, "An Effective HashBased Algorithm for Mining Association Rules," in Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, CA, USA, 1995, pp. 175−186. [12] A. Savasere, E. Omiecinski, and S. Navathe, "An Efficient Algorithm for Mining Association Rules in Large Databases," in Proceedings of the 24th VLDB Conference, 1995, pp. 432−444. [13] M.C. Tseng and W.Y. Lin, "Mining Generalized Association Rules with Multiple Minimum Supports," in Proceedings of International Conference on Data Warehousing and Knowledge Discovery, Munich, Germany, 2001, pp. 11-20. [14] M.J. Zaki, "Scalable Algorithms for Association Mining," IEEE Transactions on Knowledge and Data Engineering, Vol. 12, No. 2, 2000, pp. 372−390. 0-7695-2056-1/04 $17.00 (C) 2004 IEEE View publication stats 9

Log In

CBW: an efficient algorithm for frequent itemset mining