Efficient Graph-Based Algorithms For Discovering and Maintaining Association Rules in Large Databases
Efficient Graph-Based Algorithms For Discovering and Maintaining Association Rules in Large Databases
Efficient Graph-Based Algorithms For Discovering and Maintaining Association Rules in Large Databases
Guanling Lee, K. L. Lee and Arbee L. P. Chen Department of Computer Science National Tsing Hua University Hsinchu, Taiwan 300, R.O.C. Email: alpchen@cs.nthu.edu.tw Abstract In this paper, we study the issues of mining and maintaining association rules in a large database of customer transactions. The problem of mining association rules can be mapped into the problems of finding large itemsets which are sets of items bought together in a sufficient number of transactions. We revise a graph-based algorithm to further speed up the process of itemset generation. In addition, we extend our revised algorithm to maintain discovered association rules when incremental or decremental updates are made to the databases. Experimental results show the efficiency of our algorithms. The revised algorithm significantly improves over the original one on mining association rules. The algorithms for maintaining association rules are more efficient than re-running the mining algorithms for the whole updated database and outperform previously proposed algorithms that need multiple passes over the database. Keywords: association rule, rules maintenance, graph-based approach, bit vector.
Introduction Data Mining, or knowledge discovery in databases (KDD), has been considered as a
promising new area in database research [2, 26]. When the amount of data is large, it is quite important to extract potentially useful knowledge embedded in it. To extract such previously unknown knowledge from large databases is the task of data mining. Various types of knowledge can be mined from large databases, such as mining characteristic and classification rules [1, 6, 8, 12, 13, 16, 21, 32, 33], association rules [3, 4, 15, 17, 19, 20, 22, 23, 29, 34], and sequential patterns [5, 27, 34]. Data mining has been widely applied in retail industry to improve market strategies. Each customer transaction stored in the database typically consists of customer identifier, 1
transaction time, and the set of items bought in this transaction. It is important to analyze the customer transactions to discover customer purchasing behaviors. The problem of mining association rules over customer transactions was introduced in [3]. An association rule describes the association among items like 80% of customers who purchase items X and Y also buy item Z in the same transaction, where X, Y, Z are initially unknown. There have been many works exploring this problem and its variations, such as mining quantitative association rules [26], generalized association rules[20, 25], and multilevel association rules [14]. While new transactions are added to a database or old ones are removed, rules or patterns previously discovered should be updated. The problem of maintaining discovered association rules was first studied in [9], which proposed the FUP algorithm to discover new association rules when incremental updates are made to the database. Other algorithm proposed in [10, 11] improve FUP by generating and counting fewer candidates. The graph-based algorithm DLG proposed in [34] can efficiently solve the problems of mining association rules. In [34], DLG is shown to outperform other algorithms which need to make multiple passes over the database. In this paper, we first propose the revised algorithms DLG* to achieve higher performance. Then we develop algorithm DUP (DLG* for database Updates), which are based on the framework of DLG*, to handle the problem of maintaining discovered association rules in the cases of insertion and deletion of transaction data. The remaining of this paper is organized as follows. Section 2 gives detailed descriptions of the above two problems. The algorithms DLG* is described in Section 3. The algorithm DUP for maintaining association rules is described in Section 4. Experimental Results are discussed in Section 5, and Section 6 concludes our study. 2 2.1 Problem Descriptions Mining association rules
The following definitions refer to [4]. Let I = {i1,i2,,im} be a set of literals, called items. A set of items is called an itemset. The number of items in an itemset is called the length of an itemset. Itemsets of length k are referred to as k-itemsets. Let D be a database of transactions, a transaction T contains itemset X if and only if X T. An association rule is an implication of 2
the form X Y, where X I, Y I, and X I Y= . The support count of itemset X, supX, is the number of transactions in D containing X. The association rule X Y has support s% if s% of transactions in D contain X U Y, i.e. supX U Y / |D| = s%. The association rule X Y has a confidence c% if c% of transactions in D that contain X also contain Y, i.e. supX U Y / supX = c%. The problem of mining association rules is to generate all rules that have support and confidence greater than the user specified thresholds, minimum support and minimum confidence. As mentioned before, the problem of mining association rules can be divided into the following steps: 1. Find out all itemsets that have supports above the user specified minimum support. Each such itemset is referred to as a large itemset. The set of all large itemsets in D is L, and Lk is the set of large k-itemsets. 2. Generate the association rules from the large itemsets with respect to another threshold, minimum confidence. The second step is relatively straightforward. However, the first step is not trivial if the total number of items |I| and the maximum number of items in each transaction |MT| are large. For example, if |I| =1000, |MT|=10, there are 21000 possible itemsets. We need to identify large itemsets among
C i i
=1
10
1000
itemsets satisfying a minimum support is the main problem of mining association rules. 2.2 Update of association rules
The following definitions refer to [11]. Let L be the set of large itemsets in the original database D, and s% be the minimum support. Assume the support count of each large itemset X, supX, is available. Let d+ (d-) be the set of added (deleted) transactions, and the support count of itemset X in d+ (d-) is denoted as sup+X (sup-X) . With respect to the same minimum support s%, an itemset X is large in the updated database D if and only if the support count of X in D, supX, is no less than |D| s%, i.e. (supX - sup-X + sup+X) (|D|-|d-|+|d+|) s%. Thus the problem of updating associations rules is to find the set of new large itemsets L in D. Note that a large itemset in L may not appear in L. On the other hand, an itemset not in L may become a large itemset in L.
2.3
Previous Works
The Apriori [4] and DHP [22] algorithms of the generate-and-test type are quite successful for mining association rules. The Apriori algorithm generates the candidate k-itemsets Ck in the k-th iteration by applying the apriori-gen function [4] on the set Lk-1, which is the set of large (k-1)-itemsets found in the previous iteration. After finding candidates Ck, Apriori scans the database to count the support of each itemset in Ck, and then determines Lk which can be used to generate Ck+1 in the next iteration. The algorithm DHP improves over Apriori by reducing the number of candidates and trimming the database progressively using hashing techniques. DIC (dynamic itemset counting) is introduced in [7]. Unlike Apriori where large itemsets and candidate itemsets are generated at the end of each scan, DIC takes these actions while scanning the database. The place where the action is taken is called stop and each scan contains a number of stops. In [28], an incremental pruning technique (DICIP) is used to improve the DIC algorithm. At each stop, the candidate itemsets generated at the previous stop, which cannot be large will be pruned. The DICIP algorithm improves the performance of the DIC algorithm. Moreover, these two algorithms require less database scans and never generate more candidate sets than Apriori does. In [34], we propose the graph-based algorithms DLG to efficiently solve the problems of mining association rules. The algorithm DLG constructs an association graph to indicate the associations between large items, and then traverse the graph to generate large itemsets. The DLG algorithm is very efficient because it need not scan the database to count candidates, and needs to scan the database only once. However, it can be improved further by reducing the number of candidates. We revise the DLG algorithm in Section 3 and show its improvement in Section 5.2. The algorithm FUP [9] was first designed for maintaining association rules when an incremental update is made to the database. FUP makes use of the stored support counts of the original large itemsets to generate a much smaller number of candidates to be checked in the updated database. Other algorithms FUP* [10] and FUP2 [11] are faster versions of FUP. The FUP2 is able to maintain the discovered association rules in the cases including insertion, deletion, and modification of transactions. As with Apriori and DHP, all of the FUP families have to repeatedly scan the database to count candidates. In [30], we propose another 4
approach to deal with the rule maintenance problem. The information of the itemsets which are not large at a moment but may become large after updating the database is stored. This information is then used to reduce the number of the database scans. In this paper, we extend our previous work [18] to efficiently update the rules by scanning the database only once. We compare with FUP2 in the cases of insertion and deletion. The experimental results reveal that our approaches significantly outperform FUP2. 3 The Revised Algorithm DLG* In this section, we first briefly describe the algorithm DLG [34] for efficient large itemset generation. DLG is a three-phase algorithm. The large 1-itemset generation phase finds large items and records related information. The graph construction phase constructs an association graph between large items, and at the same time generates large 2-itemsets. The large itemset generation phase generates large k-itemsets (k>2) based on this association graph. The DLG* algorithm reduces the execution time during the large itemset generation phase by recording additional information in the graph construction phase. 3.1 Large 1-itemset generation phase
The DLG algorithm scans the database to count the support and builds a bit vector for each item. The length of a bit vector is the number of transactions in the database. The bit vector associated with item i is denoted as BVi. The jth bit of BVi is set to 1 if item i appears in the jth transaction. Otherwise, the jth bit of BVi is set to 0. The number of 1s in BVi is equal to the support count of the item i. For example, Table 1 records a database of transactions, where TID is the transaction identifier, and itemset records the items purchased in the transaction. Assume the minimum support count is 2 transactions (|D| minimum support). The large items and the associated bit vectors are shown in Table 2.
Itemset 23456 137 36 1234 14 123 245 Large item 1 2 3 4 5 6 Bit vector 0101110 1001011 1111010 1001101 1000001 1010000
3.2 Graph construction phase The support count for the itemset {i1,i2,,ik} is the number of 1s in BVi1 BVi2 BVik, where the notation is a logical AND operation. Hence, the support count of the itemset {i1,i2,,ik} can be found directly by applying logical AND operations on the bit vectors of the k itemsets instead of scanning the database. If the number of 1s in BVi BVj (i<j) is no less than the minimum support count, a directed edge from item i to item j is constructed in the association graph. Also, {i,j} is a large 2-itemset. Take the database in Table 1 for example, the association graph is shown in Figure 1, and L2={{1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {2,5}, {3,4}, {3,6}, {4,5}}.
1 2 3
4 5
3.3
For each large k-itemset {i1,i2,,ik} in Lk (k>1), the last item ik is used to extend the itemset into (k+1)-itemsets. If there is a directed edge from ik to item j, the itemset {i1,i2,,ik,j} is a candidate (k+1)-itemset. If the number of 1s in BVi1 BVi2 BVik BVj is no less than the minimum support count, {i1,i2,,ik,j} is a large (k+1)-itemset in Lk+1. If no large k-itemset is generated in the k-th iteration, the algorithm terminates. Consider the above example, candidate large 3-itemsets can be generated based on L2. The candidate 3-itemsets are {{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,6}, {1,4,5}, {2,3,4}, {2,3,6}, {2,4,5}, {3,4,5}}. After applying the logical AND operations on the bit vectors of the three items in each candidate, the large 3-itemset L3={{1,2,3}, {2,3,4}, {2,4,5}} is generated. The candidate 4-itemsets are {{1,2,3,4}, {1,2,3,6}, {2,3,4,5}}. After applying the logical AND operations on the bit vectors of the four items in each candidate, no large 4-itemset is generated. Thus, the algorithm terminates. 3.4 Improvements over DLG
In the k-th (k>2) iteration, DLG generates candidate k-itemsets by extending each large (k1)-itemset according to the association graph. Suppose on the average, the out-degree of each node in the association graph is q. The number of candidate itemsets is |Lk-1| q, and DLG must perform |Lk-1| q (k-1) logical AND operations on bit vectors to determine all large kitemsets. The key issue of the DLG* algorithm is to reduce the number of candidate itemsets. The following properties are used by DLG* to reduce the number of candidates. Lemma 1 If a (k+1)-itemset X={i1,,ik,ik+1} Lk+1 (k 2), then {ij,ik+1} L2 for 1 j k, that is, item ik+1 is contained in at least k large 2-itemsets. Proof Any transaction that contains X also contains the subsets of X. If the support of X={i1,,ik,ik+1} meets the minimum support, all k 2-itemsets {ij,ik+1} (1 j k) are also large. In addition, ik+1 may be contained in other large 2-itemsets which are not subsets of X. Therefore, item ik+1 is contained in at least k large 2-itemsets. In the large itemset generation phase, DLG* extends each large k-itemset in Lk (k 2) into (k+1)-itemsets like the original DLG algorithm. Suppose {i1,i2,,ik} is a large k-itemset, and there is a directed edge from item ik to item i. From Lemma 1, if the (k+1)-itemset 7
{i1,i2,,ik,i} is large, it must satisfy the following two conditions (Otherwise, it cannot be large and is excluded from the set of candidate (k+1)-itemsets). 1. Any {ij,i} (1 j k) must be large. In other words, the in-degree of the node associated with item i must be at least k. 2. Moreover, a directed edge from ik to item i means that {ik,i} is also a large 2-itemset. Therefore, we only need to check if all {ij,i} (1 j k-1) are large. These simple checks significantly reduce the number of candidate itemsets. In order to speed up these checks, we record some information during the graph construction phase. For the first condition, for each large item, we count the in-degrees of this item. For the second condition, a bitmap with |L1| |L1| bits is built to record related information about the association graph. If there is a directed edge from item i to item j, the bit associated with {i,j} is set to 1. Otherwise, the bit is set to 0. DLG* requires extra memory space of size quadratic to |I|, but speeds up the performance significantly.
Table 3. The related information recorded by DLG* for Figure 1
2 1 2 3 4 5 1
3 1 1
4 1 1 1
5 0 1 0 1
6 0 0 1 0 0
Item 1 2 3 4 5 6
Total 0 1 2 3 2 1
The DLG* algorithm is illustrated with the example in Table 1 in the following. The extra information recorded by DLG* is shown in Table 3. For large 2-itemset {1,2}, it can be extended into {1,2,3}, {1,2,4}, and {1,2,5}. Consider {1,2,3}, the in-degree of item 3 is 2 ( 2), and the bit associated with {1,3} in the bitmap is 1. Therefore, {1,2,3} is a candidate. Consider {1,2,5}, the in-degree of item 5 is 2 ( 2), but the bit associated with {1,5} in the bitmap is 0. Therefore, {1,2,5} is not a candidate. In the third iteration, DLG generates 10 candidates, but DLG* only generates 5 candidates ({1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}, {2,4,5}). For large 3-itemset {1,2,3}, it can be extended into {1,2,3,4}, {1,2,3,6}. Consider 8
{1,2,3,4}, the in-degree of item 4 is 3 ( 3), and the bits associated with {1,4} and {2,4} in the bitmap are both 1. Therefore, {1,2,3,4} is a candidate. Consider {1,2,3,6}, the in-degree of item 6 is 1 (<3). Therefore, {1,2,3,6} is not a candidate. In the fourth iteration, DLG generates 3 candidates, but DLG* generates only 1 candidate {1,2,3,4}. In this example, DLG* has reduced the number of candidate k-itemsets (k>2) by ((13-6)/13)*100% 54%. The DLG* algorithm, which is a revision of the DLG algorithm in [34], is shown as follows. /* Large 1-itemset generation phase */ forall itemset i do set all bits of BVi to 0; for (j=1;j N;j++) do begin /* N is the number of transactions */ forall items i in the jth transaction do begin i.count++; set the jth bit of BVi to 1; end end L1= ; forall items i in database D do if i.count minsup then L1= L1 U {i}; /* minsup is the minimum support count*/ /* Graph construction phase */ if L1 then begin initialize all |L1| |L1| bits of Bitmap to 0; forall large 1-itemsets l L1 do total[l]=0; /* record the number of large 2-itemsets containing l */ L2= ; for every two large items i, j (i<j) do if (the number of 1s in BVi BVj) minsup then begin CreateEdge(i,j); /* create a directed edge from i to j */
L2 = L2 U {{i,j}}; Bitmap[i][j]=1; total[i]++; total[j]++; end end /* Large itemset generation phase */ k=2; while |Lk| k+1 do begin Lk+1= ; forall itemsets {i1,i2,,ik} Lk do begin if there is a directed edge from item ik to item i then begin if total[i] k then if (Bitmap[i1][i]= Bitmap[i2][i]== Bitmap[ik-1][i])=1) then if ( the number of 1s in BVi1 BVi2 BVik BVi) minsup then Lk+1= Lk+1 U {{i1,i2,,ik,i}}; end end k=k+1; end
Efficient Update Algorithms based on DLG* In this section, we introduce the update algorithm for transaction insertion and deletion. The
algorithm DUP is based on the framework of DLG*, which can be splitted into three phases. As in [9], we assume the support counts of all large itemsets found in the previous mining operations are available. If a candidate itemset X is large in the original database, we can directly get the support count supx in the original database D. Otherwise, we must apply logical AND operations on the bit vectors associated with D to find the support count supx. However, we can use the following properties to reduce the cost of performing logical AND 10
operations. As defined in section 2, supx+ is the support count of itemset x in the set of inserted transactions d+ and supx- is the support count in the set of deleted transactions d-. The following lemma is similar to Lemma 4 in [11]. Lemma 2 If an itemset X is not large in the original database, then X is large in the updated database only if supx+- supx->(|d+|-|d-|) s%. Proof Since X is not large in the original database D, supx <|D| s%. When supx+- supx-
(|d+|-|d-|) s%, supx= supx+supx+-supx-<(|D|+|d+|-|d-|) s%, X can not be large in the updated database. For each k-itemset X not in Lk, we first apply logical AND operations on the bit vectors associated with the changed part of the database (d+ and d-). For the itemsets X satisfying supx+- supx- (|d+|-|d-|) s%, we can determine they will not be large in L without applying logical AND operations on the bit vectors associated with the unchanged part of the database (D d-). In the following, we describe the DUP algorithm in detail. 4.1 Large 1-itemset generation phase
The DUP algorithm scans the set of inserted and deleted database d+ and d-. For these two databases, the bit vector BVi+ and BVi- are build for each item i. In order to determine which item is large in the updated database D more efficiently, DUP requires the support count of each item in the original database D be stored in the previous mining operation. Hence, we can directly calculate the new support count sup{i} = sup{i}+sup+{i}sup-{i}. If an item i is large in D, DUP scans the remaining database (original database deleted database) and builds a bit vector BV i (D-d ). After completing the phase, the bit vectors BVi(D-d ), BVi+ and BVi- for each large item i are available. This requires extra storage space of size linear to |I| to store sup{i} for each item i, but reduces the cost of building bit vectors and counting supports for those items not large in the updated database. For each large item i, we allocate three bits i., i.+ and i.. We set the bit i. (i.+, i.) to 1 if item i is large in D (d+, d-) . Otherwise, the bit is set to 0. The usage of these three bits is explained in the following two phases. 11
-
For example, consider the original, inserted and deleted databases in Table 4, TID 100 and 200 are deleted from the original database, and TID 800 and 900 are inserted into original database. Assume the minimum support is 25%, that is, an itemset is in L if the support count is no less than 7 0.25=1.75. The bit vectors and three associated bits (, + and ) for each large item is shown in Table 5. The large items in the updated database are items 1, 2, 3, 4, and 5. Since item 1, 2, 3 and 4 are large in L1, their associated bits are all set to 1. And item 5 is not large in the original database, its associated bit is set to 0. For the inserted (deleted) database, if the support count of any item is no less than |d+| 0.25=0.5 (|d-| 0.25=0.5), this item is large in d+(d-). Since sup+{2}, sup+{4}, sup+{5} all greater than 0.5, sup+{1}= sup+{3}=0, we set 2.+=4. +=5. +=1, 1. +=3. +=0. (sup-{1}, sup-{2}, sup-{3}, sup-{4}, sup-{5} all greater than 0.5, their associated bits are all set to 1.)
Table 4. An example
TID 100 200 Itemset 123456 137
Table 5. The bit vectors and associated bits of large items in Table 4
Large item 1 2 3 4 5 BV11 10 11 10 10 BV(D-d ) 01110 01011 11010 00100 00000
-
BV+ + 00 11 00 11 11 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1
TID 100 200 300 400 500 600 700 800 900
4.2
Each 2-itemset X={i,j} (i<j), where i, j are large items in D, is a candidate 2-itemset. The sup+{i,j} can be found by counting the number of 1s in BVi+ BVj+. Similarity, sup-{i,j} can be found by counting the number of 1s in BVi BVj . For X L2, supx is available from the previous mining result, we can calculate supx= supx+ sup+x supx. If supx is no less than |D| s%, then X is added into L2. For X L2, according to Lemma 2, X L2 only if sup+x supx >(|d+|-|d-|) s%. If sup+x supx >(|d+|-|d-|) s%, we perform BVI(D-d ) BVj(D-d ) and count the
12
number of 1s to find supx, then add this count to supx+ to get supx. If supx is no less than |D| s%, X is added into L2. The three bits i., i.+ and i. of each large item i can be used to further improve the performance. Lemma 3 is used to illustrate the function of the bits. Lemma 3 If an itemset X is not large in the original database, then X can not be large in the updated database if supx+ < |d+| s% and supx-> |d-| s%. Proof Since X is not large in the original database D, supx <|D| s%. The support of
itemset X (supx)after update the database is equal to supx + supx+- supx-. supx< |D| s% + |d+| s% - |d-| s% = (|D|+|d+|-|d-|) s%. Therefore, X can not be large in the updated database. There is also a property should be mentioned here. Refer to Lemma 1, if any subset of itemset X is not large, X can not be a large itemset. However, if all subsets of itemset X is large, X may or may not be a large itemset, further checking is needed to make sure whether X is a large itemset. According to these properties, before we check whether {i,j} L2, we can check if both i. and j. are equal to 1. If either i. or j. is 0, X cannot be large in L2. Thus, we save the cost of the membership check, which is costly when |L2| is large. For X L2, if either i.+ or j.+ is 0, we can make sure that sup+x cannot be greater than |d+| s% without performing BVi+ BVj+, which is costly when the number of transactions in d+ is large. For X L2 and sup+x < |d+| s%, if both i. or j. is 1, sup-x may be greater than |d-| s%. Therefore, BVI- BVj- is performed to check whether X is large in d-. If X is large in d-, we know X
L2. For example, we know 2-itemset {3,5} can not be large 2-itemset, because
3. 5.=0, 3.+ 5.+=0, 3. 5.=1 and itemset {3,5} is really large in d-. For each large 2-itemset {i,j}, a directed edge from item i to item j is constructed in the association graph. We also set three bits X., X. + and X. - for each large 2-itemset X. Continue the above example, there are 8 candidate 2-itemsets. These candidates are all in L2, except {3,4}. L2={{1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {2,5}, {4,5}}. The association graph is shown in Figure 2, and Table 6 shows the associated bits of the large 2-itemsets.
13
1 2 3
1 1 1 1 0 0 0
+ 0 0 0 0 1 1 1
1 1 1 1 1 1 1
4.3
Candidate itemset Ck is extended from Lk-1 by the way used in DLG*. Suppose X={i1,i2,,ik} is a large k-itemset, and a candidate Y={i1,i2,,ik,ik+1} can be generated successfully based on X. Similar to the above phase, we get supY+ by performing BV1+ BV2+ BVk+1+, and supY- by performing BV1- BV2- BVk+1-, and then check whether Y Lk+1. The bits X. and ik+1. are used to save the cost of the membership check. For Y Lk+1, if either X.+ or ik+1.+ is 0, and Y is large in d-. we know Y Lk+1 without performing BV1(D-d ) BV2(D-d ) BVk+1(D-d ) and BV1+ BV2+ BVk+1+. Three bits X. , X.+ and X. are set for each large itemset X, which can be used in the next iteration. Continue the above example. There are two candidates 3-itemset {1,2,3} and {2,4,5}, and their support are all greater than 1.75. Therefore, {1,2,3} and {2,4,5}L3. Table 7 compares the number of candidates of DLG* with that of DUP. DLG* performs logical AND operations on bit vectors associated with the whole updated database ((D-d-) U d+), with a total of 13 candidates. DUP performs logical AND operations on bit vectors associated with D-d, d+ and d-, with a total of 10 candidates.
Table 7. Number of candidate itemsets
Iteration DLG* (D-d-) U d+ 10 3 D-d 0 1
-
DUP d+ 8 2 d5 1
2 3
14
In the k-th iteration, N (k-1) logical AND operations are performed to determine all large k-itemsets, where N is the candidate number. While DLG* performs 10 1+3 2=16 operations on the bit vectors of length 7, DUP performs (8+5) 1+(2+1) 2=19 operations on the bit vectors of length 2 and 12=2 operations on the bit vectors of length 5. Therefore, DUP reduces the cost of performing logical AND operations by 1-(19*2+2*5)/(16*7)57%. DUP algorithm also can deal with the case when there is no inserted database or deleted database. For the case of |d+|=0, we only need to set i+=0 and supi+=0. For the case of |d|=0, we only need to set i=1 and supi=minsup.
5.
Experimental Results To assess the performance of our graph-based algorithms for discovering and maintaining
large itemsets, the algorithms DLG [34], DLG*, FUP2 [11] and DUP are implemented on a Sun SPARC/20 workstation. We first show the improvement of DLG* over DLG, and then demonstrate the performance of DUP by comparing with DLG* and FUP2, which is the most efficient update algorithm recently. 5.1 Synthetic data generation
In each experiment, we use synthetic data as the input dataset to evaluate the performance of the algorithms. The method to generate synthetic transactions is the same as the one used in [34]. The parameters used are similar to those in [34] except the size of the changed part of the database (d+ or d-). The parameters used to generate our synthetic database are listed in Table 8. The readers can refer to [34] for a detailed explanation of these parameters.
15
Table 8. The parameters |D| |d| |I| |MI| |L| |T| |MT| N Number of transactions Number of inserted/deleted transactions Average size of the potentially large itemsets Maximum size of potentially large itemsets Number of the potentially large itemsets Average size of the transactions Maximum size of the transactions Number of items
We generate datasets by setting N=1000 and |L|=1000. We choose two values for |T|=10 and 20, and the corresponding |MT|=20 and 40, respectively. We choose two values for |I|=3 and 5, and the corresponding |MI|=5 and 10, respectively. We use Tx.Iy.Dm.dn, adopted from [9], to mean that |T|=x, |I|=y, |D|=m thousands, and |d|=n thousands. Notice that a positive value n means the size of inserted transactions |d+|, and a negative one means the size of deleted transactions |d-|. The way we create the updated database is straightforward. To generate the original database D and the inserted transactions d+, a database of size |D|+|d+| is first generated and then the first |D| transactions are stored in D, and the remaining |d+| transactions are stored in d+. Similarly, to generate the original database D and the deleted transactions d-, a database of size |D| is first generated, and the first |d-| transactions are considered as deleted ones. 5.2 Comparison of DLG* with DLG
We perform an experiment on the dataset T10.I5.D100 with minimum support 0.75%. The experiment results are shown in Table 9. Lets examine the number of candidate itemsets generated by each algorithm. DLG* reduces the number of candidate k-itemsets of DLG (k>2) by 1-38%=62%.
16
Figure 3 shows the relative execution times for DLG* and DLG, using three synthetic datasets. The minimum support is varied between 0.25% and 2%. The result shows that DLG* can perform 2.7 times faster than DLG with a moderate size database of 100,000 transactions. The performance gain is achieved by the significant reduction of the number of candidates. As shown, the performance gap increases as the minimum support decreases because the increase of the large itemset number results in more redundant candidates generated by DLG. Moreover, for a larger |I| or |T|, DLG* can perform much better than DLG. This is because DLG* prune more candidates and the computation cost for candidates support counting is increased as |I| or |T| becomes large. 5.3 Effects of the minimum support on the update algorithm
The next two experiments show the effects of the minimum support. The value of minimum support is varied between 0.25% and 2%. We set T10.I5.D100.d+1.d1 to demonstrate the effect. Moreover, for a fair comparison, the dataset of FUP2 is also put in main memory to neglect the time needed for scanning disk. As shown in Figure 4, DUP is 1.9 to 3.6 times faster than DLG*, and 1.9 to 2.8 times faster than FUP2. The result shows that DUP always has a better performance than re-running DLG* and FUP2. The speed-up ratio over DLG* decreases as the minimum support increases, because the number of large itemsets becomes smaller and re-running DLG* is less costly. In general, the smaller the minimum support is, the larger the speed-up ratio over FUP2 is, this is 17
because a small minimum support will induce a large number of large itemsets and the computation cost of candidates support counting is more expensive in FUP2.
5.4
Next, we examine how the size of the changed part of the database affects the performance of the algorithms. Compare DUP to re-running DLG*, when the amount of changes becomes large, the performance of update algorithms degrades. There are two major reasons for it. First, the previous mining results become less useful when the updated database is much different from the original one. Second, the number of transactions which need to be handled increases. Compare DUP to FUP2, when the amount of changes becomes large, the speed-up ratio increases. The reason is that the computation process of candidate support counting is more efficient in DUP and when the inserted database increases, FUP2 spends more time to compute the candidates support. Two series of experiments are conducted to support the analysis. We set T10.I5.D100.d+x.d-1 for the insertion case, and T10.I5.D100.d+1.d-x for the deletion case. The minimum support is set to 1%. In the insertion case, we increase the number of inserted transactions from 20k to 120k to evaluate the performance ratio. Figure 5 shows that DUP is 2.9 to 3.3 times faster than FUP2, and 1.4 to 1.9 times faster than DLG*. When comparing DUP to FUP2, the speed-up ration increases as the inserted database increases. When comparing DUP to DLG*, although the execution time DLG* also increases as |d+|, the speed-up ratio decreases. However, DUP still has a better performance even when |d+|=120k, which is larger than the size of the original database. In the deletion case shown in Figure 6, DUP is 2.6 to 3.0 times faster than FUP2, and 1.0 to 1.6 times faster than DLG* for 10 |d-| 30. The execution time of DLG* decreases as |d-| increases. DUP outperforms DLG* for |d-| 30, but eventually drops down to DLG* for |d| 40. However, DUP always performs significantly faster than FUP2. In Figure 7, we examine how the size of the deleted transactions affects the performance of DUP, using three synthetic datasets. Compared with re-running DLG* in the remaining database (D-d-), DUP keeps a better performance when |d-| is limited to 30% of |D|.
18
6.
Conclusion and Future Work We study efficient graph-based algorithms for discovering and maintaining knowledge in
the database. The revised algorithm DLG* is developed to efficiently solve the problem of mining association rules. DLG* improves DLG [34] by reducing the number of candidates and the cost of performing logical AND operations on bit vectors. The update algorithms DUP, which is based on the framework of DLG*, are further developed to solve the problem of maintaining association rules in the cases of insertion and deletion. The experimental results show that it significantly outperform FUP2 [11], which is the most efficient update algorithm developed so far. DUP always performs faster than re-running DLG* even when |d+| is larger than the size of the original database |D|, and DUP keeps a better performance when |d-| is not greater than 30% of |D|. To reduce the space for storing the bit vectors, some compression method has to be considered. The characteristic of the bit vectors is that they contain a large number of zeros. Take the advantage of this characteristic, we can substitute the sequence of zeros by recording the number of zeros in the sequence. To achieve a better compression rate, we should make the length of the sequence of zeros longer. We are currently investigating a method to rearrange the order of the transactions such that the bit vectors of most items can have a longer sequence of zeros.
19
3 2.5 2 1.5 1 0.5 0.25 0.50 0.75 1.00 1.25 Minimum Support (%)
1.50
1.75
2.00
T10.I5.D100.d+1.d-1 4 3.5 3 Relative Execution Time DLG*/DUP 2.5 2 1.5 1 0.5 0 0.25% 0.50% 0.75% 1.00% 1.25% 1.50% 1.75% 2.00% Minimum Support FUP2/DUP
T10.I5.D100.d+x.d-1 3.5 3 Relative Execution Time 2.5 2 1.5 1 0.5 0 20 40 60 80 100 120 Number of inserted transactions(k) DLG*/DUP FUP2/DUP
20
T10.I5.D100.d+1.d-x 3.5 3 Relative Execution Time 2.5 2 1.5 1 0.5 0 10 20 30 40 50 60 Number of deleted transactions (k) DLG*/DUP FUP2/DUP
D100.d-x 2.5 Relative Execution Time(DLG*/DUP) 2 1.5 1 0.5 0 10 20 30 40 50 60 Number of deleted transactions (k)
Reference
[1] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An Interval Classifier for Database Mining Applications. In Proc. of the International Conference on Very Large Data Bases, pages 560-573, 1992.
21
R. Agrawal, and et al. Database Mining: A Performance Perspective. In IEEE Transactions on Knowledge and Data Engineering, pages 914-925, 1993. R. Agrawal, T. Imielinski, and A. Swami Mining Association Rules between Sets of Items in Large Databases. In Proc. of ACM SIGMOD, pages 207-216, 1993. R. Agrawal and R. Srikant. Fast Algorithm for Mining Association Rules. In Proc. of the International Conference on Very Large Data Bases, pages 487-499, 1994. R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. of the International Conference on Data Engineering, pages 3-14, 1995. T. M. Anwar, H. W. Beck, and S. B. Navathe. Knowledge Mining by Imprecise Querying: A Classification-Based Approach. In Proc. of the International Conference on Data Engineering, pages 622-630, February 1992. S. Brin, R. Motwani, J. Ullman and S. Tsur. Dynamic Itemset Counting and Implication Rules for Market Basket Data. In Proc. of ACM SIGMOD, pages 255-264, 1997. D. W. Cheung, A. W.-C. Fu, and J. Han. Knowledge discovery in databases: A rule-based attribute-oriented approach. In Proc. of International Symp. On Methodologies for Intelligent Systems, pages 164-173, Charlotte, North Carolina, October 1994.
[9]
D. W. Cheung, J. Han, V. T. Ng, and C. Y. Wong. Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique. In Proc. of the International Conference on Data Engineering, pages 106-114, 1996.
[10]
D. W. Cheung, V. T. Ng, and B. W. Tam. Maintenance of Discovered Knowledge: A case in Multi-level Association Rules. In Proc. of 2nd International Conference on Knowledge Discovery and Data Mining, pages 307310, 1996.
[11]
D. W. Cheung, S.D. Lee, and Benjamin Kao. A General Incremental Technique for Maintaining Discovered Association Rules. In Proc. of International Conference on Database Systems for Advanced Applications, April 1-4, pages 185-194, 1997.
J. Han, Y. Cai, and N. Cerone. Knowledge Discovery in Databases: An Attribute-Oriented Approach. In Proc. of the International Conference on Very Large Data Bases, pages 547-559, 1992. J. Han, Y. Cai, and N. Cercone. Data-Driven discovery of quantitative rules in relational databases. In IEEE Transactions on Knowledge and Data Engineering, 5:29-40, 1993. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In Proc. of the International Conference on Very Large Data Bases, pages 420-431, 1995. M. Housma and A. Swami. Set-Oriented Mining for Association Rules in Relational Databases. In Proc. of the International Conference on Data Engineering, pages, pages 25-33, 1995. Hoi-Yee Hwang and Wai-Chee Fu. Efficient Algorithm for Attribute-Oriented Induction. In Proc. of the first International Conference of Knowledge Discovery and Data Mining, pages 168-173, 1995. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding Interesting Rules from Large Sets of Discovered Association Rules. In Proc. of the 3rd International Conference on Information and Knowledge Management, pages 401-408, Gaithersburg, Maryland, 1994.
[18] [19]
K. L. Lee, G. Lee, A.L.P. Chen. Efficient Graph-based Algorithm for Discovering and Maintaining Knowledge in Large Databases. In Third Pacific-Asia Conference, PAKDD-99 Proceedings, pages 409-419, 1999. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient Algorithm for Discovering Association Rules. In Proc. of AAAI Workshop on Knowledge Discovery in Databases, pages 221-235, 1994.
22
R. Meo, G. Psaila, and S. Ceri. A New SQL-like Operator for Mining Association Rules. In Proc. of the International Conference on Very Large Data Bases, pages122-133, 1996. R. T. Ng and J. Han. Efficient and Effective Clustering Method for Spatial Data Mining. In Proc. of the International Conference on Very Large Data Bases, pages 144-155, 1994. J. S. Park, M. S. Chen, and P. S. Yu. An Effective Hash-Based Algorithm for Mining Association Rules. In Proc. of ACM SIGMOD, pages 175-185, 1995. A. Savasere, E. Omiecinski and S. Navathe. An Efficient Algorithm for Mining Association Rules in Large Databases, In Proc. of the International Conference on Very Large Data Bases, pages 432-444, 1995. E. Simoudis, J. Han and U. Fayyad. Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD96), 1996. R. Srikant and R. Agrawal. Mining Generalized Association Rules, In Proc. of the International Conference on Very Large Data Bases, pages 407-419, 1995. R. Srikant and R. Agrawal. Mining Quantitative Association Rules in Large Relational Tables. In Proc. of ACM SIGMOD, pages 1-12, 1996. R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proc. of the Fifth International Conference on Extending Database Technology (EDBT), Avignon, France, March 1996. J. Tang. Using Incremental Pruning to Increase the Efficiency of Dynamic Itemset Counting for Mining Association Rules. In Proc. of the six international conference on information and knowledge management (CIKM), pages 273280, 1998.
H. Toivonen. Sampling Large Databases for Finding Association Rules. In Proc. of the International Conference on Very Large Data Bases, pages134-145, 1996. P. S. M. Tsai, C.C. Lee, A.L.P. Chen. An efficient Approach for Incremental Association Rule Mining. In Third Pacific-Asia Conference, PAKDD-99 Proceedings, pages 74-83, 1999. J. T. L. Wang, G.-W. Chirn, T. G. Marr, B. Shapiro, D. Shasha, and K. Zhang, Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results. In Proc. of ACM SIGMOD, pages 115-125, 1994. S. J. Yen and A. L. P. Chen. An Efficient Algorithm for Deriving Compact Rules from Databases. In Proc. of International Conference on Database Systems for Advanced Applications, pages 364-371, 1995. S. J. Yen and A. L. P. Chen. The Analysis of Relationships in Databases for Rule Derivation. In Journal of Intelligent Information Systems, Vol.7, pages 1-24, 1996. S. J. Yen and A. L. P. Chen. An Efficient Approach to Discovering Knowledge from Large Databases. In Proc. of the IEEE/ACM International Conference on Parallel and Distributed Information Systems, pages 8-18, 1996.
Guanling Lee received the B.S. and M.S. degrees, both in computer science, from National Tsing Hua University, Taiwan, R.O.C. in 1995 and 1997, respectively. She is currently a Ph.D. candidate in the Department of Computer Science, National Tsing Hua University. Her research interests include location management in mobile environments and data scheduling on wireless channels. 23
K.L. Lee received the B.S. and M.S. degrees, both in computer science, from National Tsing Hua University, Taiwan, R.O.C. in 1995 and 1997, respectively. She is currently a computer teacher at Chupei Senior High School. Arbee L.P. Chen received the B.S. degree in computer science from National Chiao-Tung University, Taiwan, R.O.C. in 1977, and the Ph.D. degree in computer engineering from the University of Southern California in 1984. He joined National Tsing Hua University, Taiwan, as a National Science Council (NSC) sponsored Visiting Specialist in August 1990, and became a Professor in the Department of Computer Science in 1991. He was a Member of Technical Staff at Bell Communications Research, New Jersey, from 1987 to 1990, an Adjunct Associate Professor in the Department of Electrical Engineering and Computer Science, Polytechnic University, New York, and a Research Scientist at Unisys, California, from 1985 to 1986. His current research interests include multimedia databases, data mining and mobile computing. Dr. Chen has organized (and served as a Program Co-Chair) 1995 IEEE Data Engineering Conference and 1999 International Conference on Database Systems for Advanced Applications (DASFAA) in Taiwan. He is a recipient of the NSC Distinguished Research Award.
24