Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Hybrid Approach for Mining Frequent Itemsets

2013 IEEE International Conference on Systems, Man, and Cybernetics, 2013
...Read more
A Hybrid Approach for Mining Frequent Itemsets Bay Vo Ton Duc Thang University, Ho Chi Minh City Viet Nam bayvodinh@gmail.com Frans Coenen Department of Computer Science, University of Liverpool, UK coenen@liverpool.ac.uk Tuong Le University of Food Industry, Ho Chi Minh City Viet Nam tuonglecung@gmail.com Tzung-Pei Hong Department of Computer Science and Information Engineering, National University of Kaohsiung, Taiwan tphong@nuk.edu.tw Abstract Frequent itemset mining is a fundamental element with respect to many data mining problems. Recently, the PrePost algorithm has been proposed, a new algorithm for mining frequent itemsets based on the idea of N-lists. PrePost in most cases outperforms other current state-of-the-art algorithms. In this paper, we present an improved version of PrePost that uses a hash table to enhance the process of creating the N-lists associated with 1-itemsets and an improved N-list intersection algorithm. Furthermore, two new theorems are proposed for determining the “subsume index” of frequent 1-itemsets based on the N-list concept. The experimental results show that the performance of the proposed algorithm improves on that of PrePost. Keywords - frequent itemset, PPC-tree, N-list, data mining I. INTRODUCTION Frequent itemset mining was first introduced in 1993 [1] and plays an important role in the mining of associate rules [1, 2, 7, 10]. Currently, there are a large number of algorithms which effectively mine frequent itemsets. They may be divided into three main groups: (1) Methods that use a candidate generate-and-test strategy of which Apriori [2] and BitTableFI [4] are exemplar algorithms. (2) Methods that adopt a divide-and-conquer strategy and a compressed data structure of which FP-Growth [6] and FP-Growth* [5] are exemplar algorithms. (3) Methods that use a hybrid approach of which Eclat [11], dEclat [12] and Index-BitTableFI [8] are all examples. Although many solutions have been proposed, the complexity of the frequent itemset mining problem remains a challenge. Therefore more computationally efficient solutions are desirable. Recently, Deng et al. [3] introduced the PrePost algorithm for mining frequent itemsets based on the idea of PPC-trees (Pre-order Post-order Code trees), an FP-tree like structure. PrePost operates as follows. First a tree construction algorithm is used to build a PPC-tree. Then N-lists are generated, each associated with a 1-itemset contained in the tree. A N-list of k-itemset is a list describing its features, it is compact form of transaction ID list (TID list). A divide-and- conquer strategy is then used for mining frequent itemsets. Unlike FP-tree-based approaches, this approach does not build additional trees on each iteration, it mines frequent itemsets directly using the N-list concept. The efficiency of PrePost is achieved because: (i) N-lists are much more compact than previously proposed vertical structures, (ii) the support of a candidate frequent itemset can be determined through N-list intersection operations which are O(m+n+k), where m, n are the cardinalities of the two N-lists and k is the cardinalities of the resulting N-list. This process is more efficient than finding the intersection of TID lists because it avoids unnecessary comparisons. The experimental results in [3] shows that the PrePost is more efficient than FP-Growth [6], FP-Growth* [5] and dEclat [12]. In this paper, we propose a hybrid algorithm based on PrePost, which features the following improvements: (1) Use of a hash table to speed up the process of creating the N-lists associated with frequent 1-itemsets. (2) Improving the N-list intersection procedure to determine the intersection between two N-lists. Song et al. [8] proposed the concept of the “subsume index”. Broadly the subsume index of a frequent 1-itemset is the list of frequent 1-itemsets that co-occur with it. This idea and the N-list concept have also been incorporated into the proposed hybrid algorithm. The main contributions of this paper: (i) an improved N-list intersection function, (ii) two new theorems associated with the generation of subsume indexes and (iii) the usage of the two theorems proposed in [8] in the proposed algorithm to reduce the runtime and memory usage. The rest of the paper is organized as follows. Section 2 presents the basic concepts. The proposed algorithm is proposed in Section 3 and an example of the process of this algorithm is presented in Section 4. Section 5 shows the results of experiments. Finally, the paper is concluded in Section 6 with a summary and some future research issues. II. BASIC CONCEPTS A. Frequent itemsets We assume a dataset DB comprised of n transactions such that each transaction contains a number of items belong to where is the set of all items in DB. An example transaction dataset is 978-1-4799-0652-9/13/$31.00 ©2013 IEEE.
presented in Table 1 (the meaning of the third column will become clear later in this paper), this dataset will be used for illustrative purposes throughout the remainder of this paper. The support of an itemset X, denoted by (X), where X I, is the number of transactions in DB which contain all the items in X. An itemset X is a “frequent itemset” if (X) ≥ minSup × n, where minSup is a given threshold. Note that a frequent itemset with k elements is called a frequent k-itemset and I 1 is the set of frequent 1-itemsets sorted in frequency descending order. TABLE I. AN EXAMPLE TRANSACTION DATASET Transaction Items Ordered frequent items 1 a, b a, b 2 a, b, c, d c, a, b, d 3 a, c, e c, a, e 4 a, b, c, e c, a, b, e 5 c, d, e, f c, d, e 6 c, d c, d B. PPC-tree Deng et al. [3] presented the PPC-tree (an FP-tree like structure) and the PPC-tree construction algorithm as follows: Definition 1 (The PPC-tree). A PPC-tree, , is a tree where each node holds five values: N i .name, N i .frequency, N i .childnodes, N i .pre and N i .post which are the frequent 1- itemset in I 1 , the frequency of this node, the set of children node associated with this node, the order of this node when traversing this tree in Left-Right order and the order of this node when traversing this tree in Right-Left order respectively. Note that the root of the tree,  , has  .name = “null” and  .frequency = 0. procedure Construct_PPC_tree(, ) 1.scan  to find and their frequency 2.sort in frequency descending order 3.create , the hash table of 4.create the root of a PP-tree, , and label it as ‘null’ 5.let threshold = ⌈  ⌉ 6.for each transaction    do 7. remove the items that their supports do not satisfy the threshold 8. sort its 1-itemsets in frequency descending order 9. Insert_Tree(, ) 10.traverse PP-tree to generate pre and post values associate with each node 11.return , , and threshold procedure Insert_Tree(, ) 1. while (is not null) do 2.  the first item of and  \ 3. if has a child such that .name = then 4. .frequency++ 5. else 6. create a new node N with N.name = , N.frequency = 1 and .childnodes = 7. Insert_Tree(, ) Figure 1. The PPC-tree construction The PPC-tree construction algorithm is presented in Figure 1. The example transaction dataset from Table 1 will be used with minSup = 30% to illustrate the operation of this algorithm. First the algorithm removes all items whose frequency does not satisfy the minSup threshold and sorts the remaining items in descending order of frequency (see column three in Table 1). The algorithm then inserts, in turn, the remaining items in each transaction into the PPC-tree as shown in Figure 2 with respect to our example dataset. Figure 2. Illustration of the creation of a PPC-tree using the example transaction dataset with minSup = 30% Finally the algorithm traverses the full tree (Figure 2 (f)) to generate the required pre and post values associated with each node. The final PPC-tree is presented in Figure 3. Figure 3. The final PPC-tree created from the example transaction dataset with minSup = 30% C. N-list Deng et al. [3] presented the definition of the N-list concept and three theorems associated with it. We summarize these as follows: Definition 2 (The PP-code). The PP-code, C i , of each node N i in a PPC-tree has a tuple as follows: c, 1 null d, 1 c, 2 null d, 2 e, 1 c, 3 null d, 2 e, 1 a, 1 b, 1 e, 1 (a) (b) (c) c, 4 null d, 2 e, 1 a, 2 b, 1 e, 1 e, 1 (d) c, 5 null d, 2 e, 1 a, 3 b, 2 e, 1 e, 1 d, 1 c, 5 null d, 2 e, 1 a, 3 b, 2 e, 1 e, 1 d, 1 a, 1 b, 1 d, 2 e, 1 a, 1 (9,9) (1,7) c, 5 (2,1) a, 3 (4,6) (3,0) b, 2 (5,4) e, 1 (8,5) e, 1 (6,2) (7,3) d, 1 null (0,10) N1 N2 b, 1 (10,8) (e) (f)
A Hybrid Approach for Mining Frequent Itemsets Bay Vo Tuong Le Ton Duc Thang University, Ho Chi Minh City Viet Nam bayvodinh@gmail.com University of Food Industry, Ho Chi Minh City Viet Nam tuonglecung@gmail.com Frans Coenen Tzung-Pei Hong Department of Computer Science, University of Liverpool, UK coenen@liverpool.ac.uk Department of Computer Science and Information Engineering, National University of Kaohsiung, Taiwan tphong@nuk.edu.tw Abstract — Frequent itemset mining is a fundamental element with respect to many data mining problems. Recently, the PrePost algorithm has been proposed, a new algorithm for mining frequent itemsets based on the idea of N-lists. PrePost in most cases outperforms other current state-of-the-art algorithms. In this paper, we present an improved version of PrePost that uses a hash table to enhance the process of creating the N-lists associated with 1-itemsets and an improved N-list intersection algorithm. Furthermore, two new theorems are proposed for determining the “subsume index” of frequent 1-itemsets based on the N-list concept. The experimental results show that the performance of the proposed algorithm improves on that of PrePost. Keywords - frequent itemset, PPC-tree, N-list, data mining I. INTRODUCTION Frequent itemset mining was first introduced in 1993 [1] and plays an important role in the mining of associate rules [1, 2, 7, 10]. Currently, there are a large number of algorithms which effectively mine frequent itemsets. They may be divided into three main groups: (1) Methods that use a candidate generate-and-test strategy of which Apriori [2] and BitTableFI [4] are exemplar algorithms. (2) Methods that adopt a divide-and-conquer strategy and a compressed data structure of which FP-Growth [6] and FP-Growth* [5] are exemplar algorithms. (3) Methods that use a hybrid approach of which Eclat [11], dEclat [12] and Index-BitTableFI [8] are all examples. Although many solutions have been proposed, the complexity of the frequent itemset mining problem remains a challenge. Therefore more computationally efficient solutions are desirable. Recently, Deng et al. [3] introduced the PrePost algorithm for mining frequent itemsets based on the idea of PPC-trees (Pre-order Post-order Code trees), an FP-tree like structure. PrePost operates as follows. First a tree construction algorithm is used to build a PPC-tree. Then N-lists are generated, each associated with a 1-itemset contained in the tree. A N-list of k-itemset is a list describing its features, it is compact form of transaction ID list (TID list). A divide-andconquer strategy is then used for mining frequent itemsets. Unlike FP-tree-based approaches, this approach does not build 978-1-4799-0652-9/13/$31.00 ©2013 IEEE. additional trees on each iteration, it mines frequent itemsets directly using the N-list concept. The efficiency of PrePost is achieved because: (i) N-lists are much more compact than previously proposed vertical structures, (ii) the support of a candidate frequent itemset can be determined through N-list intersection operations which are O(m+n+k), where m, n are the cardinalities of the two N-lists and k is the cardinalities of the resulting N-list. This process is more efficient than finding the intersection of TID lists because it avoids unnecessary comparisons. The experimental results in [3] shows that the PrePost is more efficient than FP-Growth [6], FP-Growth* [5] and dEclat [12]. In this paper, we propose a hybrid algorithm based on PrePost, which features the following improvements: (1) Use of a hash table to speed up the process of creating the N-lists associated with frequent 1-itemsets. (2) Improving the N-list intersection procedure to determine the intersection between two N-lists. Song et al. [8] proposed the concept of the “subsume index”. Broadly the subsume index of a frequent 1-itemset is the list of frequent 1-itemsets that co-occur with it. This idea and the N-list concept have also been incorporated into the proposed hybrid algorithm. The main contributions of this paper: (i) an improved N-list intersection function, (ii) two new theorems associated with the generation of subsume indexes and (iii) the usage of the two theorems proposed in [8] in the proposed algorithm to reduce the runtime and memory usage. The rest of the paper is organized as follows. Section 2 presents the basic concepts. The proposed algorithm is proposed in Section 3 and an example of the process of this algorithm is presented in Section 4. Section 5 shows the results of experiments. Finally, the paper is concluded in Section 6 with a summary and some future research issues. II. BASIC CONCEPTS A. Frequent itemsets We assume a dataset DB comprised of n transactions such that each transaction contains a number of items belong to where is the set of all items in DB. An example transaction dataset is presented in Table 1 (the meaning of the third column will become clear later in this paper), this dataset will be used for illustrative purposes throughout the remainder of this paper. The support of an itemset X, denoted by (X), where X I, is the number of transactions in DB which contain all the items in X. An itemset X is a “frequent itemset” if (X) ≥ ⌈minSup × n⌉, where minSup is a given threshold. Note that a frequent itemset with k elements is called a frequent k-itemset and I1 is the set of frequent 1-itemsets sorted in frequency descending order. TABLE I. Transaction The PPC-tree construction algorithm is presented in Figure 1. The example transaction dataset from Table 1 will be used with minSup = 30% to illustrate the operation of this algorithm. First the algorithm removes all items whose frequency does not satisfy the minSup threshold and sorts the remaining items in descending order of frequency (see column three in Table 1). The algorithm then inserts, in turn, the remaining items in each transaction into the PPC-tree as shown in Figure 2 with respect to our example dataset. AN EXAMPLE TRANSACTION DATASET Items Ordered frequent items 1 a, b a, b 2 a, b, c, d c, a, b, d 3 a, c, e c, a, e 4 a, b, c, e c, a, b, e 5 c, d, e, f c, d, e 6 c, d c, d B. PPC-tree Deng et al. [3] presented the PPC-tree (an FP-tree like structure) and the PPC-tree construction algorithm as follows: Definition 1 (The PPC-tree). A PPC-tree, , is a tree where each node holds five values: Ni.name, Ni.frequency, Ni.childnodes, Ni.pre and Ni.post which are the frequent 1itemset in I1, the frequency of this node, the set of children node associated with this node, the order of this node when traversing this tree in Left-Right order and the order of this node when traversing this tree in Right-Left order respectively. Note that the root of the tree, 𝑜𝑜 , has 𝑜𝑜 .name = “null” and 𝑜𝑜 .frequency = 0. procedure Construct_PPC_tree( , 𝑢 ) 1.scan to find and their frequency 2.sort in frequency descending order 3.create , the hash table of 4.create the root of a PP-tree, , and label it as ‘null’ ⌉ 5.let threshold = ⌈ 𝑢 6.for each transaction do 7. remove the items that their supports do not satisfy the threshold 8. sort its 1-itemsets in frequency descending order 9. Insert_Tree( , ) 10.traverse PP-tree to generate pre and post values associate with each node 11.return , , and threshold procedure Insert_Tree( , ) 1. while ( is not null) do 2. 𝑡 the first item of and \ 𝑡 3. if has a child such that .name = 𝑡 then 4. .frequency++ 5. else 6. create a new node N with N.name = 𝑡 , N.frequency = 1 and .childnodes = 7. Insert_Tree( , ) Figure 1. The PPC-tree construction null null null null c, 1 c, 2 c, 3 c, 4 d, 1 d, 2 d, 2 a, 1 d, 2 e, 1 e, 1 b, 1 e, 1 a, 2 b, 1 e, 1 (a) e, 1 (c) (b) (d) null null c, 5 c, 5 d, 2 a, 3 e, 1 e, 1 d, 2 b, 2 e, 1 a, 1 a, 3 e, 1 e, 1 e, 1 b, 2 e, 1 d, 1 b, 1 d, 1 (f) (e) Figure 2. Illustration of the creation of a PPC-tree using the example transaction dataset with minSup = 30% Finally the algorithm traverses the full tree (Figure 2 (f)) to generate the required pre and post values associated with each node. The final PPC-tree is presented in Figure 3. null (0,10) (1,7) (2,1) d, 2 (3,0) e, 1 (6,2) c, 5 N1 (4,6) (5,4) e, 1 b, 2 (7,3) a, 3 (8,5) N2 (9,9) a, 1 (10,8) b, 1 e, 1 d, 1 Figure 3. The final PPC-tree created from the example transaction dataset with minSup = 30% C. N-list Deng et al. [3] presented the definition of the N-list concept and three theorems associated with it. We summarize these as follows: Definition 2 (The PP-code). The PP-code, Ci, of each node Ni in a PPC-tree has a tuple as follows: Ci = Ni.pre, Ni.post, Ni.frequency (1) Example 1. The highlighted nodes N1 and N2 (for example) in Figure 3 have the PP-codes C1 = 1,7,5 and C2 = 5,4,2 respectively. Theorem 1 [3]. A PP-code Ci is an ancestor of another PPcode Cj if and only if Ci.pre Cj.pre and Ci.post Cj.post. Note that any PP-code is also considered to be its own ancestor. Example 2. According to Example 1, we have C1= 1,7,5 and C2= 5,4,2. Based on Theorem 1, C1 is an ancestor of C2 because C1.pre = 1 < C2.pre = 5 and C1.post = 7 > C2.post = 4. Definition 3 (The N-list of a frequent 1-itemset). The N-list associated with an item A, denoted by NL(A), is the set of PPcodes associated with nodes in the PPC-tree whose name is equal to A. Thus: 𝐿(𝐴) = *𝑁𝑖 ⋃ | 𝑁𝑖 .𝑛𝑎𝑚𝑒=𝐴+ where 𝐶𝑖 is the PP-code associated with 𝑖. 𝐶𝑖 (2) Example 3. Let A = {c} and B = {e}. According to the PPC-tree in Figure 3, NL(A) = {1,7,5} and NL(B) = {6,2,1,8,5,1}. Theorem 2 [3]. Let A be a 1-itemset with the associated N-list NL(A). The support for A, (A), is calculated by: (𝐴) = 𝐶 ∑ 𝐿(𝐴) 𝐶𝑖 . 𝑓𝑟𝑒𝑞𝑢𝑒 𝑐𝑦 (3) Example 4. According to Example 3 we have NL(A) = {1,7,5} and NL(B) = {6,2,1, 8,5,1}. Therefore, (𝐴) = 5 and ( ) = 1 + 1 = 2. Definition 4 (The N-list of a k-itemset). Let XA and XB be two (k-1)-itemsets with the same prefix X (X can be an empty set) such that A is before B according to the I1 ordering. NL(XA) and NL(XB) are two N-lists associated with XA and XB respectively. The N-list associated with XAB is determined as follows: (1) For each PP-code Ci NL(XA) and Cj NL(XB), if Ci is an ancestor of Cj, the algorithm will add Ci.pre, Ci.post, Cj.frequency to NL(XAB). (2) Traversing NL(XAB) to combine the PP-codes which has the same pre and post values. Example 5. According to Example 4 we have NL(A) = {1,7,5} and NL(B) = {6,2,1, 8,5,1}. Therefore NL(AB) = {1,7,1, 1,7,1} = {1,7,2}. Theorem 3 (The support of a k-itemset) [3]. Let X be an itemset and NL(X) be N-list associated with X. The support of X denoted by (X) is calculated as follows: (𝑋) = 𝐶 ∑ 𝐿(𝑋) 𝐶𝑖 . 𝑓𝑟𝑒𝑞𝑢𝑒 𝑐𝑦 (4) Example 6. According to Example 5 we have NL(AB) = {1,7,2}, therefore (AB) = 2. D. The subsume index of frequent 1-itemsets To reduce the search space, the concept of the subsume index was proposed in [8] which is based on the following function: g(X) = {T.ID DB | X ⊆ T} (5) where T.ID is the ID of the transaction T, and g(X) is the set of IDs of the transactions which include all items i X. Example 7. Let A={c}, we have g(A) = {2, 3, 4, 5, 6} because A exists in the transactions 2, 3, 4, 5, 6. Definition 5 [8]. The subsume index of a frequent 1-itemset, A, denoted by subsume(A) is defined as follows: subsume(A) = {B I1 | g(A) ⊆ g(B)} (6) Example 8. Let A = {e} and B = {c}, we have g(A) = {3, 4, 5} and g(B) = {2, 3, 4, 5, 6}. Because g(A) ⊆ g(B), thus B subsume(A). In other words, {c} subsume({e}) In [8] the following two theorems concerning the subsume index idea were also presented, which in turn can be used to speed up the frequent itemset mining process. Theorem 4 [8]. Let A be a frequent 1-itemset. If the support associated with A is equal to ⌈minSup × n⌉, then there exists no item B which has ( ) (𝐴) and B subsume(A) such that AB is a frequent itemset. Theorem 5 [8]. Let the subsume index of an item A be {a1, a2,…, am}. The support of each of the 2m-1 nonempty subsets of {a1, a2,…, am} is equal to the support of A. Example 9. Let A = {e} and B = {c} and according to Example 8, we have subsume(A) = {B}. Therefore 2m-1 nonempty subsets of subsume(A) is only {B}. Based on Theorem 5, the support of 2m-1 itemset which are combined 2m-1 nonempty subsets of subsume(A) with A is equal to (A). In this case, we have (AB) = (A) = 3. Besides, the support of the frequent itemset XA is also equal to the support of frequent itemset XAB. For detail, ae is a frequent itemset with (ae) = 2. So, aec is also a frequent itemset and (aec) = 2. III. THE PROPOSED ALGORITHM A. The N-list intersection function Deng et al. [3] proposed a N-list intersection function for determining the intersection of two N-lists which was O(n+m+k) where n, m and k is the length of the first, the second and the resulting N-lists (the function traverses the resulting Nlist so as to merge the same PP-codes). In this section we present an improved N-list intersection function to give O(n+m). This improved function offers the advantage that it does not traverse the resulting N-list to merge the same PPcodes. Furthermore, we also propose an early abandoning strategy comprised of three steps: (i) determine the total frequency of the first and the second N-list denoted by sF, (ii) for each PPcode Ci, that does not belong to the result N-list, update sF = sF - Ci.frequency, and (iii) if sF falls below ⌈minSup × n⌉ stop (the itemset currently being considered is not frequent). Given the above the improved N-list intersection function is presented in Figure 4. function NL_intersection(PS1, PS2) 1. PS3 2.let sF be the sum of frequency of PS1 and PS2 3.let i = 0, j = 0 and frequency = 0 4. while i < PS1.size and j < PS2.size do 5. if PS1[i].pre < PS2[j].pre then 6. if PS1[i].post > PS2[j].post then 7. if PS3.size > 0 and PS3[PS3.size-1].Pre = PS1[i].pre then 8. PS3[PS3.size-1].frequency += PS2[j].frequency 9. else 10. add the tuple PS1[i].pre, PS1[i].post, PS2[j].frequency to PS3 11. frequency += PS2[j++].frequency 12. else 13. sF = sF - PS1[i++].frequency 14. else 15. sF = sF - PS2[j++].frequency 16 . if sF < threshold then // using early abandoning strategy 17. return null // stop the procedure 18.return PS3 and frequency Figure 4. The improved N-list intersection function B. The subsume index associated with each frequent 1-itemset Theorem 6. Let A be a frequent 1-itemset. We have: subsume(A) = {B I1 | Ci NL(A), Cj and Cj is an ancestor of Ci} NL(B) (7) Proof. This theorem can be proven as follows: all PP-codes in NL(A) have a PP-code ancestor in NL(B), this means that all transactions that contain A also contain B. This, g(A) ⊆ g(B), which implies that B subsume(A). Therefore, this theorem is proven. Example. Let A = {e}, B = {c}. We have NL(B) = {1,7,5} and NL(A) = { 3,0,1, 6,2,1, 8,5,1}. According to Theorem 6, 3,0,1, 6,2,1 and 8,5,1 NL(A) are descendants of 1,7,5 NL(B). Therefore, B subsume(A). Theorem 7. Let A, B, C I1 be three frequent 1-itemsets. If A subsume(B) and B subsume(C) then A subsume(C). Proof. We have A subsume(B) and B subsume(C) therefore g(B) ⊆ g(A) and g(C) ⊆ g(B). So g(C) ⊆ g(A) and thus this theorems is proven. To find all frequent 1-itemset associated with the subsume index of each A I1, I1 should be sorted in ascending order of frequency. However, I1 has already been sorted in descending order of frequency with respect to the PPC-tree constructed previously. Therefore, with respect to the generate subsume index procedure, we propose a different traverse (see Figure 5) to avoid the cost of this reordering process and also facilitate the use of Theorem 7. procedure Find_Subsume( ) 1. for i 1 to .size - 1 do 2. for j i - 1 to do 3. if j [i].Subsumes then continue 4. if checkSubsume( [i].N-list, [j].N-list) = true then // using Theorem 6 5. add [j].name and its index, j, to [i].Subsumes 6. add all elements in [j].Subsumes to [i].Subsumes // using Theorem 7 function checkSubsume(N-list a, N-list b) 1. let i=0 and j=0 2.while j < a.size and i < b.size do 3. if b[i].pre < a[j].pre and b[i].post > a[j].post then 4. j++ 5. else 6. i++ 7.if j = a.size then 8. return true 9. return false Figure 5. The generating subsume index proceduce C. Algorithm The two theorems proposed in [8] and re-presented in section 2.4 were also adopted in the proposed algorithm to speed up the runtime (Figure 6). Besides, these theorems also helped reduce the memory usage because it is not necessary to determine and store the N-lists associated with a number of frequent itemsets to determine their supports. Input: A dataset and 𝑢 Output: 𝑠, the set of all frequent itemsets 1.Construct_PPC_tree( , 𝑢 ) to generate H1 and threshold 2.Generate_NList( , ) 3.Find_Subsume( ) 4. 𝑠 5.Subsume {} 6.Find_FIs( , Subsumes) 7.return 𝑠 procedure Generate_NList( , ) 1. 𝐶  .pre, .post, .frequency 2. H1[ .name].N-list.add(𝐶) 3. H1[ .name].frequency += 𝐶.frequency 4. for each child in .children 5. Generate_NList(child) , , procedure Find_FIs( 𝑠, ) 1.for i 𝑠.size - 1 to do 2. 𝑠 3. if 𝑠[i].Subsumes.size > 0 then 4. let be the set of subset generated from all elements of 𝑠[i].Subsumes 5. for each subset in 6. add subset, 𝑠[i].frequency to 𝑠 // using theorem 5 7. else if 𝑠[i].size = 1 then 8. S {} 9. if 𝑠[i].size = 1 and 𝑠[i].frequency = threshold then // using Theorem 4 10. continue 11. indexS = 𝑠[i].Subsumes.size - 1 12. for j - 1 to 0 do 13. if indexS >= 0 and the index of 𝑠[i].Subsumes[indexS] equals than j then 14. indexS = indexS - 1 15. continue 16. let efirst be the first item of 𝑠, 17. FI {efirst} + 𝑠[i] 18. (FI.N-list and frequency) NL_intersection( 𝑠[j].N-list, 𝑠[i].N-list) 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. if FI.N-list = null then continue // using early abandoning strategy FI.frequency = frequency if(FI.frequency threshold) then add FI to 𝑠 insert FI at position 0 in 𝑠 for each subsume in do let f = FI + subsume f.frequency = FI.frequency add f to 𝑠 // using theorem 5 Find_FIs( 𝑠 , ) Therefore, using the subsume index concept it not only reduces the runtime but also reduce the memory usage. null c:5 a:4 b:3 ca:3 cb:2 {1,7,3} {1,7,2} e:3 d:3 {1,7,5} {4,6,3, 9,9,1} {5,4,2, 10,8,1} {2,1,2, 7,3,1} {3,0,1, 6,2,1, 8,5,1} dc:3 ba:3 ae:2 ec:3 {4,6,2} Figure 6. The proposed algorithm IV. cba:2 aec:2 THE ILLUSTRATION An illustrative example is presented in this section using our example dataset. First the proposed algorithm scans the dataset to create the PPC-tree (Figure 3). Then, this algorithm traverses the PPC-tree to generate the N-lists associated with the frequent 1-itemsets in I1 (Figure 7). null c:5 a:4 b:3 d:3 e:3 {1,7,5} {4,6,3, 9,9,1} {5,4,2, 10,8,1} {2,1,2, 7,3,1} {3,0,1, 6,2,1, 8,5,1} Figure 7. The I1 and its N-lists on example dataset (minSup=30%) Next the algorithm combines, in turn, the frequent (k-1)itemsets in I1 in reverse order using a divide-and-conquer strategy to create the k-itemset candidates. For detail, e, the last frequent 1-itemset, is used to: (i) find the 2m-1 subsets from the m frequent 1-itemsets in subsume({e}) and combine them with {e} to generate the 2m-1 frequent itemsets S. In this case, subsume({e}) = {c}, so S = {ec}; (ii) combine, in turn, with remaining frequent 1-itemsets {d, b, a} (not combined with c because c subsume({e})) to create candidate 2-itemsets {de, be, ae}. However, only {ae} is frequent, thus 𝑠𝑛𝑒 = {ae}. Next the algorithm combines the elements in 𝑠𝑛𝑒 with the elements in S to create further frequent itemsets without calculating their support. In this case, only {aec} is created; and (iii) use the elements in 𝑠𝑛𝑒 to combine together to create the candidate 2-itemsets. In this case, this algorithm will stop here because 𝑠 has only one element (see Figure 8). e:3 {3,0,1, 6,2,1, 8,5,1} ae:2 ec:3 {4,6,2} aec:2 Figure 9. All frequent itemsets on example dataset (minSup=30%) V. EXPERIMENTAL RESULTS All experiments presented in this section were performed on an ASUS laptop with Intel core i3-3110M 2.4GHz and 4GBs of RAM. The operating system was Microsoft Windows 8. All the programs were coded in C# on MS/Visual studio 2012 and run on Microsoft .Net Framework Version 4.5.50709. The experiments were conducted using the following UCL datasets: Accidents, Chess, Mushroom, Pumsb_star and Retail1. Some statistics concerning these datasets are shown in Table 2. We report the runtime (total execution time) of the proposed algorithm and compare it to the runtime of PrePost. TABLE II. STATISTICAL SUMMARY OF THE EXPERIMENTAL DATASETS Dataset #Trans #Items Accidents 340,183 468 Chess 3,196 76 Mushroom 8,124 120 Pumsb_star 49,046 7,117 Retail 88,162 16,470 The experimental results are presented in Figure 10. From the figure it can be observed that given a sparse datasets such as Retail, the proposed algorithm is a little slower than PrePost. This is explained as follows. Generating the subsume index involves a cost. However, the subsume index associated with each of the frequent 1-itemsets in a sparse datasets usually have few elements. Therefore, using the subsume index concept is not effective in this case. Fortunately, this cost is usually relatively low, about 4 seconds for the Retail dataset with minSup = 0.1 (0.072% of the runtime) (see Figure 10(e)). However, given a dense datasets, the performance of the proposed algorithm is better than PrePost (see Figure 10(a)(b)(c) and (d)), especially with low thresholds. The proposed algorithm thus generally outperforms than the PrePost. Figure 8. The frequent itemsets generated from e on example dataset (minSup=30%) Then, using the above strategy, the other frequent 1-itemsets in turn continue to create the tree which contains all frequent itemsets as Figure 9. In Figure 9, the proposed algorithm does not compute and store the N-lists of the nodes {ba, cba, dc, ae, ec, aec}. 1 Downloaded from http://fimi.cs.helsinki.fi/data/ Runtime (seconds) 120 100 80 PrePost (a) Figure 10. The runtime of the proposed and PrePost algorithms using UCL datasets: (a) Accidents, (b) Chess, (c) Mushroom, (d) Pumsb_star and (e) Retail datasets (b) In this paper we have proposed a hybrid algorithm for mining frequent itemsets. First, we proposed several improvements on the previously published PrePost: (i) use of a hash table to enhance the process of creating the N-lists associated with the frequent 1-itemsets and (ii) an improved intersection function to find the intersection between two N-lists. Then, two theorems were proposed for application with respect to the determination of the subsume index of frequent 1-itemsets which were used in the proposed algorithm for improving the runtime. The proposed algorithm does not improve over the PrePost with respect to sparse datasets but the time gap is not significant. With respect to dense datasets the proposed algorithm is faster than PrePost. We therefore conclude that the proposed algorithm generally outperforms the PrePost. Proposed algorithm VI. 60 40 20 0 Runtime (seconds) 80 60 40 20 PrePost 15 Proposed algorithm 10 5 0 Runtime (seconds) 60 50 40 10 (c) PrePost 8 REFERENCES [1] 4 2 0 Runtime (seconds) 15 10 5 12 10 8 (d) PrePost Proposed algorithm 6 4 2 0 45 40 Runtime (seconds) 60 50 40 For future work we will initially focus on applying the Nlist concept and the hybrid approach for mining frequent closed/maximal itemsets. Proposed algorithm 6 35 (e) PrePost Proposed algorithm 30 20 10 0 0.3 0.2 CONCLUSIONS AND FUTURE WORK 0.1 minSup(%) Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. SIGMOD’93, 207-216, 1993. [2] Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. VLDB'94, 487-499, 1994. [3] Deng Z., Wang Z., Jiang J.J.: A new algorithm for fast mining frequent itemsets using N-lists. SCIENCE CHINA Information Sciences, 55(9), 2008-2030, 2012. [4] Dong, J., Han, M.: BitTableFI: An efficientmining frequent itemsets algorithm. Knowledge-Based Systems, 20, 329–335, 2007. [5] Grahne, G., Zhu, J.: Fast algorithms for frequent itemset mining using FP-trees. IEEE Transactions on Knowledge and Data Engineering, 17, 1347–1362, 2005. [6] Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. SIGMODKDD’00, 1–12, 2000. [7] Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Efficient Mining of Association Rules using Closed Itemset Lattices. In: Information Systems 24 (1), 25-46, 1999. [8] Song, W., Yang, B., Xu, Z.: Index-BitTableFI: An improved algorithm for mining frequent itemsets. Knowledge-Based Systems, 21, 507-13, 2008. [9] Vo, B., Hong, T.P., Le, B.: Dynamic bit vectors: An efficient approach for mining frequent itemsets. Scientific Research and Essays, 6(25), 5358-5368, 2011. [10] Vo B., Hong T.P., Le B.: A Lattice-based Approach for Mining Most Generalization Association Rules. Knowledge-Based Systems, 45, 2030, 2013. [11] Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. KDD’97, 283-286, 1997. [12] Zaki, M.J., Hsiao, C.J.: Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Transactions on Knowledge and Data Engineering, 17(4), 462-478, 2005.