Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
FIMI’03: Workshop on Frequent Itemset Mining Implementations FIMI Repository: http://fimi.cs.helsinki.fi/ FIMI Repository Mirror: http://www.cs.rpi.edu/~zaki/FIMI03/ Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, New York, USA zaki@cs.rpi.edu Bart Goethals HIIT Basic Research Unit Department of Computer Science University of Helsinki, Finland bart.goethals@cs.helsinki.fi 1 Why Organize the FIMI Workshop? Given the experimental, algorithmic nature of FIM (and most of data mining in general), it is crucial that other researchers be able to independently verify the claims made in a new paper. Unfortunately, the FIM community (with few exceptions) has a very poor track record in this regard. Many new algorithms are not available even as an executable, let alone the source code. How many times have we heard “this is proprietary software, and not available.” This is not the way other sciences work. Independent verifiability is the hallmark of sciences like physics, chemistry, biology, and so on. One may argue, that the nature of research is different, they have detailed experimental procedure that can be replicated, while we have algorithms, and there is more than one way to code an algorithm. However, a good example to emulate is the bioinformatics community. They have espoused the open-source paradigm with more alacrity than we have. It is quite common for journals and conferences in bioinformatics to require that software be available. For example, here is a direct quote from the journal Bioinformatics (http://bioinformatics.oupjournals.org/): Since the introduction of association rule mining in 1993 by Agrawal Imielinski and Swami [3], the frequent itemset mining (FIM) tasks have received a great deal of attention. Within the last decade, a phenomenal number of algorithms have been developed for mining all [3–5, 10, 18, 19, 21, 23, 26, 28, 31, 33], closed [6, 12, 22, 24, 25, 27, 29, 30, 32] and maximal frequent itemsets [1, 2, 7, 11, 15–17, 20, 35]. Every new paper claims to run faster than previously existing algorithms, based on their experimental testing, which is oftentimes quite limited in scope, since many of the original algorithms are not available due to intellectual property and copyright issues. Zheng, Kohavi and Mason [34] observed that the performance of several of these algorithms is not always as claimed by its authors, when tested on some different datasets. Also, from personal experience, we noticed that even different implementations of the same algorithm could behave quite differently for various datasets and parameters. Given this proliferation of FIM algorithms, and sometimes contradictory claims, there is a pressing need to benchmark, characterize and understand the algorithmic performance space. We would like to understand why and under what conditions one algorithm outperforms another. This means testing the methods for a wide variety of parameters, and on different datasets spanning dense and sparse, real and synthetic, small and large, and so on. Authors please note that software should be available for a full 2 YEARS after publication of the manuscript. We organized the FIMI workshop to address the three main deficiencies in the FIM community: • Lack of publicly available implementations of FIM algorithms 1 • Lack of publicly available “real” datasets The conditions for “acceptance” of a submission were as follows: i) a correct implementation for the given task, ii) an efficient implementation compared with other submissions in the same category or a submission that provides new insight into the FIM problem. The idea is to highlight both successful and unsuccessful but interesting ideas. One outcome of the workshop will be to outline the focus for research on new problems in the field. In order to allow a fair comparison of these algorithms, we performed an extensive set of experiments on several real-life datasets, and a few synthetic ones. Among these are three new datasets, i.e. a supermarket basket dataset donated by Tom Brijs [9], a dataset containing click-stream data of a Hungarian on-line news portal donated by Ferenc Bodon [8], and a dataset containing Belgian traffic accident descriptions donated by Karolien Geurts [13]. • Lack of any serious performance benchmarking of algorithms 1.1 FIMI Repository The goals of this workshop are to find out the main implementation aspects of the FIM problem for all, closed and maximal pattern mining tasks, and evaluating the behavior of the proposed algorithms with respect to different types of datasets and parameters. One of the most important aspects is that only open source code submissions are allowed and that all submissions will become freely available (for research purposes only) on the online FIMI repository along with several new datasets for benchmarking purposes. See the URL: http://fimi.cs.helsinki.fi/. 1.2 Some Recommendations 2.1 Acknowledgments We strongly urge all new papers on FIM to provide access to source code or at least an executable immediately after publication. We request that researchers to contribute to the FIMI repository both in terms of algorithms and datasets. We also urge the data mining community to adopt the open-source strategy, which will serve to accelerate the advances in the field. Finally, we would like to alert reviewers that the FIMI repository now exists, and it contains state-of-the-art FIM algorithms, so there is no excuse for a new paper to not do an extensive comparison with methods in the FIMI repository. Such papers should, in our opinion, be rejected outright! 2 We would like to thank the following program committee members for their useful suggestions and reviews: • Roberto Bayardo, IBM Almaden Research Center, USA • Johannes Gehrke, Cornell University, USA • Jiawei Han, University of Illinois at UrbanaChampaign, USA • Hannu Toivonen, University of Helsinki, Finland We also thank Taneli Mielikäinen and Toon Calders for their help in reviewing the submissions. We extend our thanks to all the participants who made submissions to the workshop, since their willingness to participate and contribute source-code in the public domain was essential in the creation of the FIMI Repository. For the same reason, thanks are due to Ferenc Bodon, Tom Brijs, and Karolien Geurts, who contributed new datasets, and to Zheng, Kohavi and Mason for the KDD Cup 2001 datasets. The Workshop This is a truly unique workshop. It consisted of code submission as well as a paper submission describing the algorithm and a detailed performance study by the authors on publicly provided datasets, along with a detailed explanation on when and why their algorithm performs better than existing implementations. The submissions were tested independently by the cochairs, and the papers were reviewed by members of the program committee. The algorithms were judged for three main tasks: all frequent itemsets mining, closed frequent itemset mining, and maximal frequent itemset mining. The workshop proceedings contains 15 papers describing 18 different algorithms that solve the frequent itemset mining problems. The source code of the implementations of these algorithms is publicly available on the FIMI repository site. 3 The FIMI Tasks Defined Let’s assume we are given a set of items I. An itemset I ⊆ I is some subset of items. A transaction is a couple T = (tid , I) where tid is the transaction identifier and I is an itemset. A transaction T = (tid , I) is said to support an itemset X, if X ⊆ I. A transaction database D is a set of transactions such that each transaction has a unique identifier. The cover 2 the compilation. Other platforms were also tried, such as an older dual 400Mhz Pentium III processors with 256MB memory, but a faster SCSI 10,000rpms disk. Independent tests were also run on quad 500Mhz Pentium III processors, with 1GB memory. There were some minor differences, which have been reported on the workshop website. Here we refer to the target platform (3.2Ghz/1GB/7200rpms). of an itemset X in D consists of the set of transaction identifiers of transactions in D that support X: cover (X, D) := {tid | (tid , I) ∈ D, X ⊆ I}. The support of an itemset X in D is the number of transactions in the cover of X in D: support(X, D) := |cover(X, D)|. An itemset is called frequent in D if its support in D exceeds a given minimal support threshold σ. D and σ are omitted when they are clear from the context. The goal is now to find all frequent itemsets, given a database and a minimal support threshold. The search space of this problem, all subsets of I, is clearly huge, and a frequent itemset of size k implies the presence of 2k −2 other frequent itemsets as well, i.e., its nonempty subsets. In other words, if frequent itemsets are long, it simply becomes infeasible to mine the set of all frequent itemsets. In order to tackle this problem, several solutions have been proposed that only generate a representing subset of all frequent itemsets. Among these, the collections of all closed or maximal itemsets are the most popular. A frequent itemset I is called closed if it has no frequent superset with the same support, i.e., if \ I= J All times reported are real times, including system and user times, as obtained from the unix time command. All algorithms were run with the output flag turned on, which means that mined results were written to a file. We made this decision, since in the real world one wants to see the output, and the total wall clock time is the end-to-end delay that one will see. There was one unfortunate consequence of this, namely, we were not able to run algorithms for mining all frequent itemsets below a certain threshold, since the output file exceeded the 2GB file size limit on a 32bit platform. For each algorithm we also recorded its memory consumption using the memusage command. Results on memory usage are available on the FIMI website. For the experiments, each algorithm was allocated a maximum of 10 minutes to finish execution, after which point it was killed. We had to do this to finish the evaluation in a reasonable amount of time. We had a total of 18 algorithms in the all category, 6 in the closed category, and 8 in the maximal category, for a grand total of 32 algorithms. Please note the algorithms eclat zaki, eclat goethals, charm and genmax, were not technically submitted to the workshop, however we included them in the comparison since their source code is publicly available. We used 14 datasets, with an average of 7 values of minimum support. With a 10 minute time limit per algorithm, the total time to finish one round of evaluation took 31360 minutes of running time, which translates to an upper-bound of 21 days! Since not all algorithms take a full 10 minute, the actual time for one round was roughly 7-10 days. (tid,J)∈cover (I) An frequent itemset is called maximal if it has no superset that is frequent. Obviously, the collection of maximal frequent itemsets is a subset of the collection of closed frequent itemsets which is a subset of the collection of all frequent itemsets. Although all maximal itemsets characterize all frequent itemsets, the supports of all their subsets is not available, while this might be necessary for some applications such as association rules. On the other hand, the closed frequent itemsets form a lossless representation of all frequent itemsets since the support of those itemsets that are not closed is uniquely determined by the closed frequent itemsets. See [14] for a recent survey of the FIM algorithms. 4 We should also mention that some algorithms had problems on certain datasets. For instance for mining all frequent itemsets, armor is not able to handle dense datasets very well (for low values of minimum support it crashed for chess, mushroom, pumsb); pie gives a segmentation fault for bms2, chess, retail and the synthetic datasets; cofi gets killed for bms1 and kosarak; and dftime/dfmem crash for accidents, bms1 and retail. For closed itemset mining, fpclose segmentfaults for bms1, bms2, bmspos and retail; borgelt eclat also has problems with retail. Finally, for maximal set mining, apriori borgelt crashes for bms1 for low value of support and so does eclat borgelt for pumsb. Experimental Evaluation We conducted an extensive set of experiments for different datasets, for all of the algorithms in the three categories (all, closed and maximal). Figure 1 shows the data characteristics. Our target platform was a Pentium4, 3.2 GHz Processor, with 1GB of memory, using a WesternDigital IDE 7200rpms, 200GB, local disk. The operating system was Redhat Linux 2.4.22 and we used gcc 3.2.2 for 3 Database accidents bms1 bms2 bmspos chess connect kosarak mushroom pumsb* pumsb retail T10I5N1KP5KC0.25D200K T20I10N1KP5KC0.25D200K T30I15N1KP5KC0.25D200K #Items 468 497 3341 1658 75 129 41,270 119 2088 2113 16,469 956 979 987 Avg. Length 33.8 2.5 5.6 7.5 37 43 8.1 23 50.5 74 10.3 10.3 20.1 29.7 #Transactions 340,183 59,602 77,512 515,597 3,196 67,557 990,002 8,124 49,046 49,046 88,162 200,000 200,000 200,000 Figure 1. Database Characteristics 4.2 Mining Closed Frequent Itemsets 4.1 Mining All Frequent Itemsets Figures 5 and 6 show the timings for the algorithms for mining all frequent itemsets. Figure 2 shows the best and second-best algorithms for high and low values of support for each dataset. There are several interesting trends that we observe: Figures 7 and 8 show the timings for the algorithms for mining closed frequent itemsets. Figure 3 shows the best and second-best algorithms for high and low values of support for each dataset. For high support values, fpclose is best for 7 out of the 14 datasets, and lcm, afopt, and charm also perform well on some datasets. For low values of support the competition is between fpclose and lcm for the top spot. For the runner-up spot there is a mixed bag of algorithms: fpclose, afopt, lcm and charm. If one were to pick an overall best algorithm, it would arguably be fpclose, since it either performs the best or shows up in the runner-up spot, more times than any other algorithm. An interesting observation is that for the cases where fpclose doesn’t appear in the table it gives a segmentation fault (for bms1, bms2, bmspos and retail). 1. In some cases, we observe a high initial running time of the highest value of support, and the time drops for the next value of minimum support. This is due to file caching. Each algorithm was run with multiple minimum support values before switching to another algorithm. Therefore the first time the database is accessed we observe higher times, but on subsequent runs the data is cached and the I/O time drops. 2. In some cases, we observe that there is a cross-over in the running times as one goes from high to low values of support. An algorithm may be the best for high values of support, but the same algorithm may not be the best for low values. 4.3 Mining Maximal Frequent Itemsets Figures 9 and 10 show the timings for the algorithms for mining maximal frequent itemsets. Figure 4 shows the best and second-best algorithms for high and low values of support for each dataset. For high values of support fpmax* is the dominant winner or runnerup. Genmax, mafia and afopt also are worth mentioning. For the low support category fpmax* again makes a strong show as the best in 7 out of 14 databases, and when it is not best, it appears as the runner-up 6 times. Thus fpmax* is the method of choice for maximal pattern mining. 3. There is no one best algorithm either for high values or low values of support, but some algorithms are the best or runner-up more often than others. Looking at Figure 2, we can conclude that for high values the best algorithms are either kdci or patricia, across all databases we tested. For low values, the picture is not as clear; the algorithms likely to perform well are patricia, fpgrowth* or lcm. For the runner-up in the low support category, we once again see patricia and kdci showing up. 4 4.4 Conclusions We presented only some of the results in this report. We refer the reader to the FIMI repository for a more detailed experimental study. The study done by us was also somewhat limited, since we performed only timing and memory usage experiments for given datasets. Ideally, we would have liked to do a more detailed study of the scale-up of the algorithms, and for a variety of different parameters; our preliminary studies show that none of the algorithms is able to gracefully scale-up to very large datasets, with millions of transactions. One reason may be that most methods are optimized for in-memory datasets, which points to the area of out-of-core FIM algorithms an avenue for future research. In the experiments reported above, there were no clear winners, but some methods did show up as the best or second best algorithms for both high and low values of support. Both patricia and kdci represent the state-of-the-art in all frequent itemset mining, whereas fpclose takes this spot for closed itemset mining, and finally fpmax* appears to be one of the best for maximal itemset mining. An interesting observation is that for the synthetic datasets, apriori borgelt seems to perform quite well for all, closed and maximal itemset mining. We refer the reader to the actual papers in these proceedings to find out the details on each of the algorithms in this study. The results presented here should be taken in the spirit of experiments-in-progress, since we do plan to diversify our testing to include more parameters. We are confident that the workshop will generate a very healthy and critical discussion on the state-of-affairs in frequent itemset mining implementations. To conclude, we hope that the FIMI workshop will serve as a model for the data mining community to hold more such open-source benchmarking tests, and we hope that the FIMI repository will continue to grow with the addition of new algorithms and datasets, and once again to serve as a model for the rest of the data mining world. [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] References [17] [1] C. Aggarwal. Towards long pattern generation in dense databases. SIGKDD Explorations, 3(1):20–26, 2001. [2] R. Agrawal, C. Aggarwal, and V. Prasad. Depth First Generation of Long Patterns. In 7th Int’l Conference on Knowledge Discovery and Data Mining, Aug. 2000. [3] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. [18] [19] 5 In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207– 216. ACM Press, 1993. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In U. Fayyad and et al, editors, Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI Press, Menlo Park, CA, 1996. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In 20th VLDB Conference, Sept. 1994. Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal. Mining frequent patterns with counting inference. SIGKDD Explorations, 2(2), Dec. 2000. R. J. Bayardo. Efficiently mining long patterns from databases. In ACM SIGMOD Conf. Management of Data, June 1998. F. Bodon. A fast apriori implementation. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, 2003. T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using association rules for product assortment decisions: A case study. In Knowledge Discovery and Data Mining, pages 254–260, 1999. S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In ACM SIGMOD Conf. Management of Data, May 1997. D. Burdick, M. Calimlim, and J. Gehrke. MAFIA: a maximal frequent itemset algorithm for transactional databases. In Intl. Conf. on Data Engineering, Apr. 2001. D. Cristofor, L. Cristofor, and D. Simovici. Galois connection and data mining. Journal of Universal Computer Science, 6(1):60–73, 2000. K. Geurts, G. Wets, T. Brijs, and K. Vanhoof. Profiling high frequency accident locations using association rules. In Proceedings of the 82nd Annual Transportation Research Board, page 18, 2003. B. Goethals. Efficient Frequent Pattern Mining. PhD thesis, transnational University of Limburg, Belgium, 2002. K. Gouda and M. J. Zaki. Efficiently mining maximal frequent itemsets. In 1st IEEE Int’l Conf. on Data Mining, Nov. 2001. G. Grahne and J. Zhu. High performance mining of maximal frequent itemsets. In 6th International Workshop on High Performance Data Mining, May 2003. D. Gunopulos, H. Mannila, and S. Saluja. Discovering all the most specific sentences by randomized algorithms. In Intl. Conf. on Database Theory, Jan. 1997. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD Conf. Management of Data, May 2000. M. Houtsma and A. Swami. Set-oriented mining of association rules in relational databases. In 11th Intl. Conf. Data Engineering, 1995. [20] D.-I. Lin and Z. M. Kedem. Pincer-search: A new algorithm for discovering the maximum frequent set. In 6th Intl. Conf. Extending Database Technology, Mar. 1998. [21] J.-L. Lin and M. H. Dunham. Mining association rules: Anti-skew algorithms. In 14th Intl. Conf. on Data Engineering, Feb. 1998. [22] F. Pan, G. Cong, A. Tung, J. Yang, and M. Zaki. CARPENTER: Finding closed patterns in long biological datasets. In ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, Aug. 2003. [23] J. S. Park, M. Chen, and P. S. Yu. An effective hash based algorithm for mining association rules. In ACM SIGMOD Intl. Conf. Management of Data, May 1995. [24] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In 7th Intl. Conf. on Database Theory, Jan. 1999. [25] J. Pei, J. Han, and R. Mao. Closet: An efficient algorithm for mining frequent closed itemsets. In SIGMOD Int’l Workshop on Data Mining and Knowledge Discovery, May 2000. [26] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In 21st VLDB Conf., 1995. [27] P. Shenoy, J. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, and D. Shah. Turbo-charging vertical mining of large databases. In ACM SIGMOD Intl. Conf. Management of Data, May 2000. [28] H. Toivonen. Sampling large databases for association rules. In 22nd VLDB Conf., 1996. [29] J. Wang, J. Han, and J. Pei. Closet+: Searching for the best strategies for mining frequent closed itemsets. In ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, Aug. 2003. [30] M. J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390, May-June 2000. [31] M. J. Zaki and K. Gouda. Fast vertical mining using Diffsets. In 9th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, Aug. 2003. [32] M. J. Zaki and C.-J. Hsiao. ChARM: An efficient algorithm for closed itemset mining. In 2nd SIAM International Conference on Data Mining, Apr. 2002. [33] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, Aug. 1997. [34] Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 401–406. ACM Press, 2001. [35] Q. Zou, W. Chu, and B. Lu. Smartminer: a depth first algorithm guided by tail information for mining maximal frequent itemsets. In 2nd IEEE Int’l Conf. on Data Mining, Nov. 2002. 6 Database accidents bms1 bms2 bmspos chess connect kosarak mushroom pumsb pumsb* retail T10I5N1KP5KC0.25D200K T20I10N1KP5KC0.25D200K T30I15N1KP5KC0.25D200K 1st kdci patricia patricia kdci patricia kdci kdci kdci patricia kdci patricia patricia kdci kdci High Support 2nd eclat zaki lcm lcm patricia kdci aim patricia lcm fpgrowth* aim/patricia afopt fpgrowth* apriori borgelt eclat zaki/apriori borgelt Low Support 1st 2nd fpgrowth* patricia lcm fpgrowth* lcm lcm patricia lcm mafia patricia lcm fpgrowth* fpgrowth* apriori borgelt patricia patricia patricia kdci afopt kdci lcm kdci patricia/afopt patricia lcm fpgrowth* Figure 2. All FIM: Best (1st) and Runner-up (2nd) for High and Low Supports Database accidents bms1 bms2 bmspos chess connect kosarak mushroom pumsb pumsb* retail T10I5N1KP5KC0.25D200K T20I10N1KP5KC0.25D200K T30I15N1KP5KC0.25D200K High Support 1st 2nd charm fpclose lcm fpclose lcm apriori borgelt apriori borgelt charm/afopt lcm fpclose fpclose afopt fpclose charm fpclose afopt fpclose/charm afopt fpclose afopt/charm afopt lcm fpclose afopt apriori borgelt charm fpclose apriori borgelt Low Support 1st 2nd fpclose afopt lcm fpclose lcm charm lcm charm lcm fpclose lcm fpclose fpclose afopt fpclose lcm lcm fpclose fpclose afopt lcm apriori borgelt fpclose lcm fpclose lcm apriori borgelt fpclose Figure 3. Closed FIM: Best (1st) and Runner-up (2nd) for High and Low Supports Database accidents bms1 bms2 bmspos chess connect kosarak mushroom pumsb pumsb* retail T10I5N1KP5KC0.25D200K T20I10N1KP5KC0.25D200K T30I15N1KP5KC0.25D200K High Support 1st 2nd genmax fpmax* fpmax* lcm afopt fpmax* fpmax* genmax fpmax* afopt fpmax* afopt fpmax* genmax fpmax* mafia genmax fpmax* fpmax* mafia afopt lcm fpmax* afopt apriori borgelt genmax genmax fpmax* Low Support 1st 2nd fpmax* mafia/genmax lcm fpmax* afopt fpmax* fpmax* afopt mafia fpmax* fpmax* afopt afopt fpmax* fpmax* mafia fpmax* afopt mafia fpmax* afopt lcm fpmax* afopt fpmax* afopt apriori borgelt fpmax* Figure 4. Maximal FIM: Best (1st) and Runner-up (2nd) for High and Low Supports 7 all-accidents all-bms1 1000 10 Total Time (sec) 100 Total Time (sec) cofi pie patricia eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia afopt apriori_borgelt kdci patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 1 10 1 0.1 90 80 70 60 50 40 30 Minimum Support (%) 20 10 0 0.1 0.095 0.09 0.085 0.08 Minimum Support (%) all-bms2 0.07 all-bmspos 1000 1000 cofi patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 10 cofi pie patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 100 Total Time (sec) 100 Total Time (sec) 0.075 10 1 0.1 1 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 Minimum Support (%) 0.02 0.01 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Minimum Support (%) all-chess 0.1 0 all-connect 1000 1000 cofi pie patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 10 1 cofi pie patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 100 Total Time (sec) 100 Total Time (sec) 0.2 10 1 0.1 0.01 0.1 90 80 70 60 50 40 Minimum Support (%) 30 20 10 95 90 85 80 75 70 Minimum Support (%) Figure 5. Comparative Performance: All 8 65 60 55 all-kosarak all-mushroom 1000 1000 Total Time (sec) 100 cofi pie patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 100 Total Time (sec) pie patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 10 10 1 1 0.1 1 0.9 0.8 0.7 0.6 0.5 0.4 Minimum Support (%) 0.3 0.2 0.1 20 18 16 all-pumsb 6 4 1000 10 cofi pie patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 100 Total Time (sec) cofi pie patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 100 Total Time (sec) 8 all-pumsb_star 1000 1 10 1 0.1 0.1 95 90 85 80 75 70 65 Minimum Support (%) 60 55 50 50 45 40 35 30 Minimum Support (%) all-retail 25 20 all-T10I5N1KP5KC0.25D200K 1000 1000 10 cofi patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 100 Total Time (sec) cofi patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 100 Total Time (sec) 14 12 10 Minimum Support (%) 10 1 0.1 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 Minimum Support (%) 0.02 0.01 1 0.11 0 0.1 0.09 all-T20I10N1KP5KC0.25D200K 0.08 0.07 0.06 0.05 Minimum Support (%) 0.04 0.03 0.02 all-T30I15N1KP5KC0.25D200K 1000 1000 Total Time (sec) 100 10 cofi patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 100 Total Time (sec) cofi patricia apriori-dftime eclat_borgelt apriori_bodon lcm armor apriori_brave eclat_zaki aim fpgrowth* eclat_goethals mafia apriori-dfmem afopt apriori_borgelt kdci 10 1 1 1 0.9 0.8 0.7 0.6 0.5 0.4 Minimum Support (%) 0.3 0.2 0.1 0 5 4.5 4 3.5 3 2.5 2 Minimum Support (%) Figure 6. Comparative Performance: All 9 1.5 1 0.5 closed-accidents closed-bms1 1000 1000 charm fpclose eclat_borgelt lcm afopt apriori_borgelt charm fpclose eclat_borgelt lcm afopt apriori_borgelt 100 Total Time (sec) Total Time (sec) 100 10 10 1 1 0.1 90 80 70 60 50 40 30 Minimum Support (%) 20 10 0 0.1 0.09 0.08 0.07 closed-bms2 0.06 0.05 0.04 0.03 Minimum Support (%) 0.02 0.01 0 closed-bmspos 1000 100 charm eclat_borgelt lcm afopt apriori_borgelt charm eclat_borgelt lcm afopt apriori_borgelt Total Time (sec) Total Time (sec) 100 10 10 1 0.1 1 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 Minimum Support (%) 0.02 0.01 0 1 0.9 0.8 0.7 closed-chess 0.6 0.5 0.4 Minimum Support (%) 0.3 0.2 0.1 0 closed-connect 1000 1000 charm fpclose eclat_borgelt lcm afopt apriori_borgelt 100 charm fpclose eclat_borgelt lcm afopt apriori_borgelt Total Time (sec) Total Time (sec) 100 10 1 10 1 0.1 0.01 90 80 70 60 50 40 Minimum Support (%) 30 20 0.1 100 10 90 80 70 60 50 40 Minimum Support (%) Figure 7. Comparative Performance: Closed 10 30 20 10 0 closed-kosarak closed-mushroom 1000 100 charm fpclose eclat_borgelt lcm afopt apriori_borgelt charm fpclose eclat_borgelt lcm afopt apriori_borgelt 10 Total Time (sec) Total Time (sec) 100 1 10 0.1 1 0.01 1 0.9 0.8 0.7 0.6 0.5 0.4 Minimum Support (%) 0.3 0.2 0.1 20 18 16 14 closed-pumsb 12 10 8 Minimum Support (%) 6 4 2 0 closed-pumsb_star 1000 1000 charm fpclose eclat_borgelt lcm afopt apriori_borgelt charm fpclose eclat_borgelt lcm afopt apriori_borgelt 100 Total Time (sec) Total Time (sec) 100 10 10 1 1 0.1 0.1 95 90 85 80 75 70 65 Minimum Support (%) 60 55 50 50 45 40 closed-retail 35 30 25 20 Minimum Support (%) 15 10 5 0 closed-T10I5N1KP5KC0.25D200K 1000 1000 charm fpclose eclat_borgelt lcm afopt apriori_borgelt charm fpclose eclat_borgelt lcm afopt apriori_borgelt 100 Total Time (sec) Total Time (sec) 100 10 10 1 0.1 1 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 Minimum Support (%) 0.02 0.01 0 0.5 0.45 0.4 closed-T20I10N1KP5KC0.25D200K 0.35 0.3 0.25 0.2 0.15 Minimum Support (%) 0.1 0.05 0 closed-T30I15N1KP5KC0.25D200K 1000 100 charm fpclose eclat_borgelt lcm afopt apriori_borgelt charm fpclose eclat_borgelt lcm afopt apriori_borgelt Total Time (sec) Total Time (sec) 100 10 10 1 1 1 0.9 0.8 0.7 0.6 0.5 0.4 Minimum Support (%) 0.3 0.2 0.1 0 5 4.5 4 3.5 3 2.5 2 Minimum Support (%) Figure 8. Comparative Performance: Closed 11 1.5 1 0.5 maximal-accidents maximal-bms1 10000 1000 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt 100 Total Time (sec) Total Time (sec) 1000 100 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt 10 10 1 1 0.1 90 80 70 60 50 40 30 Minimum Support (%) 20 10 0 0.1 0.09 0.08 0.07 maximal-bms2 0.06 0.05 0.04 0.03 Minimum Support (%) 0.02 0.01 0 maximal-bmspos 1000 1000 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt 100 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt Total Time (sec) Total Time (sec) 100 10 10 1 0.1 1 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 Minimum Support (%) 0.02 0.01 0 1 0.9 0.8 0.7 maximal-chess 0.2 0.1 0 maximal-connect 1000 10000 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt 10 1 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt 1000 Total Time (sec) 100 Total Time (sec) 0.6 0.5 0.4 0.3 Minimum Support (%) 0.1 100 10 1 0.01 90 80 70 60 50 40 Minimum Support (%) 30 20 0.1 100 10 90 80 70 60 50 40 30 Minimum Support (%) Figure 9. Comparative Performance: Maximal 12 20 10 0 maximal-kosarak maximal-mushroom 1000 1000 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt 100 Total Time (sec) Total Time (sec) 100 10 1 10 0.1 1 0.01 1 0.9 0.8 0.7 0.6 0.5 0.4 Minimum Support (%) 0.3 0.2 0.1 0 20 18 16 14 maximal-pumsb 6 4 2 0 maximal-pumsb_star 1000 1000 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt 10 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt 100 Total Time (sec) 100 Total Time (sec) 12 10 8 Minimum Support (%) 1 10 1 0.1 0.1 95 90 85 80 75 70 65 Minimum Support (%) 60 55 50 50 45 40 maximal-retail 35 30 25 20 Minimum Support (%) 15 10 5 0 maximal-T10I5N1KP5KC0.25D200K 1000 1000 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt 100 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt Total Time (sec) Total Time (sec) 100 10 10 1 0.1 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 Minimum Support (%) 0.02 0.01 1 0.11 0 0.1 0.09 maximal-T20I10N1KP5KC0.25D200K 0.08 0.07 0.06 0.05 Minimum Support (%) 0.04 0.03 0.02 maximal-T30I15N1KP5KC0.25D200K 10000 1000 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt 1000 ibe mafia eclat_borgelt fpmax* lcm genmax afopt apriori_borgelt Total Time (sec) Total Time (sec) 100 100 10 10 1 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Minimum Support (%) 0.2 0.1 0 5 4.5 4 3.5 3 2.5 2 Minimum Support (%) Figure 10. Comparative Performance: Maximal 13 1.5 1 0.5 A fast APRIORI implementation Ferenc Bodon∗ Informatics Laboratory, Computer and Automation Research Institute, Hungarian Academy of Sciences H-1111 Budapest, Lágymányosi u. 11, Hungary Abstract The efficiency of frequent itemset mining algorithms is determined mainly by three factors: the way candidates are generated, the data structure that is used and the implementation details. Most papers focus on the first factor, some describe the underlying data structures, but implementation details are almost always neglected. In this paper we show that the effect of implementation can be more important than the selection of the algorithm. Ideas that seem to be quite promising, may turn out to be ineffective if we descend to the implementation level. We theoretically and experimentally analyze APRIORI which is the most established algorithm for frequent itemset mining. Several implementations of the algorithm have been put forward in the last decade. Although they are implementations of the very same algorithm, they display large differences in running time and memory need. In this paper we describe an implementation of APRIORI that outperforms all implementations known to us. We analyze, theoretically and experimentally, the principal data structure of our solution. This data structure is the main factor in the efficiency of our implementation. Moreover, we present a simple modification of APRIORI that appears to be faster than the original algorithm. rithms have been presented. Numerous of them are APRIORI based algorithms or APRIORI modifications. Those who adapted APRIORI as a basic search strategy, tended to adapt the whole set of procedures and data structures as well [20][8][21][26]. Since the scheme of this important algorithm was not only used in basic association rules mining, but also in other data mining fields (hierarchical association rules [22][16][11], association rules maintenance [9][10][24][5], sequential pattern mining [4][23], episode mining [18] and functional dependency discovery [14] [15]), it seems appropriate to critically examine the algorithm and clarify its implementation details. A central data structure of the algorithm is trie or hashtree. Concerning speed, memory need and sensitivity of parameters, tries were proven to outperform hash-trees [7]. In this paper we will show a version of trie that gives the best result in frequent itemset mining. In addition to description, theoretical and experimental analysis, we provide implementation details as well. The rest of the paper is organized as follows. In Section 1 the problem is presented, in Section 2 tries are described. Section 3 shows how the original trie can be modified to obtain a much faster algorithm. Implementation is detailed in Section 4. Experimental details are given in Section 5. In Section 7 we mention further improvement possibilities. 2. Problem Statement 1 Introduction Finding frequent itemsets is one of the most investigated fields of data mining. The problem was first presented in [1]. The subsequent paper [3] is considered as one of the most important contributions to the subject. Its main algorithm, APRIORI, not only influenced the association rule mining community, but it affected other data mining fields as well. Association rule and frequent itemset mining became a widely researched area, and hence faster and faster algo∗ Research supported in part by OTKA grants T42706, T42481 and the EU-COE Grant of MTA SZTAKI. Frequent itemset mining came from efforts to discover useful patterns in customers’ transaction databases. A customers’ transaction database is a sequence of transactions (T = t1 , . . . , tn ), where each transaction is an itemset (ti ⊆ I). An itemset with k elements is called a k-itemset. In the rest of the paper we make the (realistic) assumption that the items are from an ordered set, and transactions are stored as sorted itemsets. The support of an itemset X in T, denoted as suppT (X), is the number of those transactions that contain X, i.e. supp T (X) = |{tj : X ⊆ tj }|. An itemset is frequent if its support is greater than a support threshold, originally denoted by min supp. The frequent itemset mining problem is to find all frequent itemset in a given transaction database. The first, and maybe the most important solution for finding frequent itemsets, is the APRIORI algorithm [3]. Later faster and more sophisticated algorithms have been suggested, most of them being modifications of APRIORI [20][8][21][26]. Therefore if we improve the APRIORI algorithm then we improve a whole family of algorithms. We assume that the reader is familiar with APRIORI [2] and we turn our attention to its central data structure. 3. Determining Support with a Trie The data structure trie was originally introduced to store and efficiently retrieve words of a dictionary (see for example [17]). A trie is a rooted, (downward) directed tree like a hash-tree. The root is defined to be at depth 0, and a node at depth d can point to nodes at depth d + 1. A pointer is also called edge or link, which is labeled by a letter. There exists a special letter * which represents an ”end” character. If node u points to node v, then we call u the parent of v, and v is a child node of u. Every leaf ℓ represents a word which is the concatenation of the letters in the path from the root to ℓ. Note that if the first k letters are the same in two words, then the first k steps on their paths are the same as well. Tries are suitable to store and retrieve not only words, but any finite ordered sets. In this setting a link is labeled by an element of the set, and the trie contains a set if there exists a path where the links are labeled by the elements of the set, in increasing order. In our data mining context the alphabet is the (ordered) set of all items I. A candidate k-itemset C = {i1 < i2 < . . . < ik } can be viewed as the word i 1 i2 . . . ik composed of letters from I. We do not need the * symbol, because every inner node represent an important itemset (i.e. a meaningful word). Figure 1 presents a trie that stores the candidates {A,C,D}, {A,E,G}, {A,E,L}, {A,E,M}, {K,M,N}. Numbers in the nodes serve as identifiers and will be used in the implementation of the trie. Building a trie is straightforward, we omit the details, which can be found in [17]. In support count method we take the transactions oneby-one. For a transaction t we take all ordered k-subsets X of t and search for them in the trie structure. If X is found (as a candidate), then we increase the support count of this candidate by one. Here, we do not generate all k-subsets of t, rather we perform early quits if possible. More precisely, if we are at a node at depth d by following the j th item link, then we move forward on those links that have the labels i ∈ t with index greater than j, but less than |t| − k + d + 1. 0 A K 7 1 C E M 2 D 4 G 3 5 8 M 6 L N 10 9 Figure 1. A trie containing 5 candidates In our approach, tries store not only candidates, but frequent itemsets as well. This has the following advantages: 1. Candidate generation becomes easy and fast. We can generate candidates from pairs of nodes that have the same parents (which means, that except for the last item, the two sets are the same). 2. Association rules are produced much faster, since retrieving a support of an itemset is quicker (remember the trie was originally developed to quickly decide if a word is included in a dictionary). 3. Just one data structure has to be implemented, hence the code is simpler and easier to maintain. 4. We can immediately generate the so called negative border, which plays an important role in many APRIORI-based algorithm (online association rules [25], sampling based algorithms [26], etc.). 3.1 Support Count Methods with Trie Support count is done, by reading transactions one-byone and determine which candidates are contained in the actual transaction (denoted by t). Finding candidates in a given transaction determines the overall running time primarily. There are two simple recursive methods to solve this task, both starts from the root of the trie. The recursive step is the following (let us denote the number of edges of the actual node by m). 1. For each item in the transaction we determine whether there exists an edge whose label corresponds to the item. Since the edges are ordered according to the labels this means a search in an ordered set. 2. We maintain two indices, one for the items in the transaction, one for the edges of the node. Each index is initialized to the first element. Next we check whether the element pointed by the two indices equals. If they do, we call our procedure recursively. Otherwise we increase the index that points to the smaller item. These steps are repeated until the end of the transaction or the last edge is reached. In both methods if item i of the transaction leads to a new node, then item j is considered in the next step only if j > i (more precisely item j is considered, if j < |t| + actual depth − m + 1). Let us compare the running time of the methods. Since both methods are correct, the same branches of the trie will be visited. Running time difference is determined by the recursive step. The first method calls the subroutine that decides whether there exist an edge with a given label |t| times. If binary search is evoked then it requires log 2 m steps. Also note that subroutine calling needs as many value assignments as many parameters the subroutine has. We can easily improve the first approach. If the number of edges is small (i.e. if |t|m < m|t| ) we can do the inverse procedure, i.e. for all labels we check whether there exists a corresponding item. This way the overall running time is proportional to min{|t| log 2 m, m log2 |t|}. The second method needs at least min{m, |t|} and at most m + |t| steps and there is no subroutine call. Theoretically it can happen that the first method is the better solution (for example if |t|=1, m is large, and the label of the last edge corresponds to the only item in the transaction), however in general the second method is faster. Experiments showed that the second method finished 50% faster on the average. Running time of support count can be further reduced if we modify the trie a little. These small modifications, tricks are described in the next subsections. 3.2 Storing the Length of Maximal Paths Here we show how the time of finding supported candidates in a transaction can be significantly reduced by storing a little extra information. The point is that we often perform superfluous moves in trie search in the sense that there are no candidates in the direction we are about to explore. To illustrate this, consider the following example. Assume that after determining frequent 4-itemsets only candidate {A, B, C, D, E} was generated, and Figure 2 shows the resulting trie. If we search for 5-itemset candidates supported by the transaction {A, B, C, D, E, F, G, H, I}, then we must visit every node of the trie. This appears to be unnecessary since only one path leads to a node at depth 5, which means that only one path represents a 5-itemset candidate. Instead of visiting merely 6 nodes, we visit 32 of them. At each node, we also have to decide which link to follow, which can greatly affect running time if a node has many links. A B C D E D E E B C C D E C D E D D E E D E E E E D E E E E E Figure 2. A trie with a single 5-itemset candidate To avoid this superfluous traveling, at every node we store the length of the longest directed path that starts from there. When searching for k-itemset candidates at depth d, we move downward only if the maximal path length at this node is at least k − d. Storing counters needs memory, but as experiments proved, it can seriously reduce search time for large itemsets. 3.3 Frequency Codes It often happens that we have to find the node that represents a given itemset. For example, during candidate generation, when the subsets of the generated candidate have to be checked, or if we want to obtain the association rules. Starting from the root we have to travel, and at depth d we have to find the edge whose label is the same as the d th element of the itemset. Theoretically, binary search would be the fastest way to find an item in an ordered list. But if we go down to the implementation level, we can easily see that if the list is small the linear search is faster than the binary (because an iteration step is faster). Hence the fastest solution is to apply linear search under a threshold and binary above it. The threshold does not depend on the characteristic of the data, but on the ratio of the elementary operations (value assignment, increase, division, . . . ) In linear search, we read the first item, compare with the searched item. If it is smaller, then there is no edge with this item, if greater, we step forward, if they equal then the search is finished. If we have bad luck, the most frequent item has the highest order, and we have to march to the end of the line whenever this item is searched for. On the whole, the search will be faster if the order of items corresponds to the frequency order of them. We know exactly the frequency order after the first read of the whole database, thus everything is at hand to build the trie with the frequency codes instead of the original codes. The frequency code of an item i is f c[i], if i is the f c[i] th most frequent item. Storing frequency codes and their inverses increases the memory need slightly, in return it increases the speed of retrieving the occurrence of the itemsets. A theoretical analysis of the improvement can be read in the Appendix. Frequency codes also affect the structure of the trie, and consequently the running time of support count. To illustrate this, let us suppose that 2 candidates of size 3 are generated: {A, B, C}, {A, B, D}. Different tries are generated if the items have different code. The next figure present the tries generated by two different coding. C B C D A A D B C C B D D C D order: A,B,C,D B A B A B B B order: C,D,A,B generation is evoked fewer times. For example in our figure one union would be generated and then deleted in the first case and none would be generated in the second. Altogether frequency codes have advantages and disadvantages. They accelerate retrieving the support of an itemset which can be useful in association rule generation or in on-line frequent itemset mining, but slows down candidate generation. Since candidate generation is by many order of magnitude faster that support count, the speed decrease is not noticeable. 3.4 Determining the support of 1- and 2-itemset candidates We already mentioned that the support count of 1element candidates can easily be done by a single vector. The same easy solution can be applied to 2-itemset candidates. Here we can use an array [19]. The next figure illustrates the solutions. |L1 | − 1 123 ... 1 ... 2 .. . |L1 | − 2 . . . . . |L1 | − 1 12 3 N-1N supp(i)=vector[i] Figure 3. Different coding results in different tries If we want to find which candidates are stored in a basket {A, B, C, D} then 5 nodes are visited in the first case and 7 in the second case. That does not mean that we will find the candidates faster in the first case, because nodes are not so ”fat” in the second case, which means, that they have fewer edges. Processing a node is faster in the second case, but more nodes have to be visited. In general, if the codes of the item correspond to the frequency code, then the resulted trie will be unbalanced while in the other case the trie will be rather balanced. Neither is clearly more advantageous than the other. We choose to build unbalanced trie, because it travels through fewer nodes, which means fewer recursive steps which is a slow operation (subroutine call with at least five parameters in our implementation) compared to finding proper edges at a node. In [12] it was showed that it is advantageous to recode frequent items according to ascending order of their frequencies (i.e.: the inverse of the frequency codes) because candidate generation will be faster. The first step of candidate generation is to find siblings and take the union of the itemset represented by them. It is easy to prove that there are less sibling relations in a balanced trie, therefore less unions are generated and the second step of the candidate supp(f c[i], f c[j])= array[f c[i]][f c[j] − f c[i]] Figure 4. Data structure to determine the support of the items and candidate itempairs Note that this solution is faster than the trie-based solution, since increasing a support takes one step. Its second advantage is that it needs less memory. Memory need can be further reduced by applying onthe-fly candidate generation [12]. If the number of frequent items is |L1 | then the number of 2-itemset candidates is |L1 | 2 , out of which a lot will never occur. Thus instead of the array, we can use trie. A 2-itemset candidate is inserted only if it occurs in a basket. 3.5 Applying hashing techniques Determining the support with a trie is relatively slow when we have to step forward from a node that has many edges. We can accelerate the search if hash-tables are employed. This way the cost of a step down is calculating one hash value, thus a recursive step takes exactly |t| steps. We want to keep the property that a leaf represents exactly one itemset, hence we have to use perfect hashing. Frequency codes suit our needs again, because a trie stores only frequent items. Please note, that applying hashing techniques at tries does not result in a hash-tree proposed in [3]. It is wasteful to change all inner nodes to hash-table, since a hash-table needs much more memory than an ordered list of edges. We propose to alter only those inner nodes into a hash-table, which have more edges than a reasonable threshold (denoted by leaf max size). During trie construction when a new leaf is added, we have to check whether the number of its parent’s edges exceeds leaf max size. If it does, the node has to be altered to hash-table. The inverse of this transaction may be needed when infrequent itemsets are deleted. If the frequent itemsets are stored in the trie, then the number of edges can not grow as we go down the trie. In practice nodes, at higher level have many edges, and nodes at low level have only a few. The structure of the trie will be the following: nodes over a (not necessarily a horizontal) line will be hash-tables, while the others will be original nodes. Consequently, where it was slow, search will be faster, and where it was fast –because of the small number of edges– it will remain fast. 3.6 Brave Candidate Generation instead of 14. One may say that APRIORI-BRAVE does not consume more memory than APRIORI and it can be faster, because there is a possibility that some candidates at different sizes are collected and their support is determined in one read. However, accelerating in speed is not necessarily true. APRIORI-BRAVE may generate (k+2)-itemset candidates from frequent k-itemset, which can lead to more false candidates. Determining support of false candidates needs time. Consequently, we cannot guarantee that APRIORIBRAVE is faster than APRIORI, however, test results prove that this heuristics works well in real life. 3.7 A Deadend Idea- Support Lower Bound The false candidates problem in APRIORI-BRAVE can be avoided if only those candidates are generated that are surely frequent. Fortunately, we can give a lower bound to the support of a (k + j)-itemset candidate based on the support of k-itemsets (j ≥ 0) [6]. Let X = X ′ ∪ Y ∪ Z. The following inequality holds: supp(X) ≥ supp(X ′ ∪ Y ) + supp(X ′ ∪ Z) − supp(X ′ ). It is typical, that in the last phases of APRIORI there are a small number of candidates. However, to determine their support the whole database is scanned, which is wasteful. This problem was also mentioned in the original paper of APRIORI [3], where algorithm APRIORI-HYBRID was proposed. If a certain heuristic holds, then APRIORI switches to APRIORI-TID, where for each candidate we store the transactions that contain it (and support is immediately obtained). This results in a much faster execution of the latter phases of APRIORI. The hard point of this approach is to tell when to switch from APRIORI to APRIORI-TID. If the heuristics fails the algorithm may need too much memory and becomes very slow. Here we choose another approach that we call the brave candidate generation and the algorithm is denoted by APRIORI-BRAVE. APRIORI-BRAVE keeps track of memory need, and stores the amount of the maximal memory need. After determining frequent k-itemsets it generates (k + 1)-itemset candidates as APRIORI does. However, it does not carry on with the support count, but checks if the memory need exceeds the maximal memory need. If not, (k + 2)-itemset candidates are generated, otherwise support count is evoked and maximal memory need counter is updated. We carry on with memory check and candidate generation till memory need does not reach maximal memory need. This procedure will collect together the candidates of the latter phases and determine their support in one database read. For example, candidates in database T40I10D100K with min supp = 0.01 will be processed the following way: 1, 2, 3, 4-10, 11-14, which means 5 database scan And hence supp(X) ≥ max {supp(X \ Y ) + supp(X \ Z) Y,Z∈X − supp(X \ Y \ Z)}. If we want to give a lower bound to a support of (k + j)itemset base on support of k-itemset, we can use the generalization of the above inequality (X = X ′ ∪ x1 ∪ . . . ∪ xj ): supp(X) ≥ supp(X ′ ∪ x1 ) + . . . + supp(X ′ ∪ xj ) − (j − 1)supp(X ′). To avoid false candidate generation we generate only those candidates that are surely frequent. This way, we could say that neither memory need nor running time is worse than APRIORI’s. Unfortunately, this is not true! Test results proved that this method not only slower than original APRIORI-BRAVE, but also APRIORI outperforms it. The reason is simple. Determining the support threshold k+j  j is a slow operation (we have to find the support of k−1 itemsets) and has to be executed many times. It loses more time with determining support thresholds than we win by generating sooner some candidates. The failure of the support-threshold candidate generation is a nice example when a promising idea turns out to be useless at the implementation level. 3.8 Storing input Many papers in the frequent itemset mining subject focus on the number of the whole database scan. They say that reading data from disc is much slower than operating in memory, thus the speed is mainly determined by this factor. However, in most cases the database is not so big and it fits into the memory. Behind the scenery the operating system swaps it in the memory and the algorithms read the disc only once. For example, a database that stores 10 7 transaction, and in each transaction there are 6 items on the average needs approximately 120Mbytes, which is a fraction of today’s average memory capacity. Consequently, if we explicitly store the simple input data, the algorithm will not speed up, but will consume more memory (because of the double storing), which may result in swapping and slowing down. Again, if we descend to the elementary operation of an algorithm, we may conclude the opposite result. Storing the input data is profitable if the same transactions are gathered up. If a transaction occurs ℓ times, the support count method is evoked once with counter increment ℓ, instead of calling the procedure ℓ times with counter increment 1. In Borgelt algorithm input is stored in a prefix tree. This is dangerous, because data file can be too large. We’ve chosen to store only reduced transactions. A reduced transaction stores only the frequent items of the transaction. Storing reduced transactions have all information needed to discover frequent itemsets of larger sizes, but it is expected to need less memory (obviously it depends on min supp). Reduced transactions are stored in a tree for the fast insertion (if reduced transactions are recode with frequency codes then we almost get an FP-tree [13]). Optionally, when the support of candidates of size k is determined we can delete those reduced transactions that do not contain candidates of size k. 4. Implementation details APRIORI is implemented in an object-oriented manner in language C++. STL possibilities (vector, set, map) are heavily used. The algorithm (class Apriori) and the data structure (class Trie) are separated. We can change the data structure (for example to a hash-tree) without modifying the source code of the algorithm. The baskets in a file are first stored in a vector<...>. If we choose to store input –which is the default– the reduced baskets are stored in a map<vector<...>,unsigned long>, where the second parameter is the number of times that reduced basket occurred. A better solution would be to apply trie, because map does not make use of the fact that two baskets can have the same prefixes. Hence insertion of a basket would be faster, and the memory need would be smaller, since the same prefixes would be stored just once. Because of the lack of time trie-based basket storing was not implemented and we do not delete a reduced basket from the map if it did not contain any candidate during some scan. The class Trie can be programmed in many ways. We’ve chosen to implement it with vectors and arrays. It is simple, fast and minimal with respect to memory need. Each node is described by the same element of the vectors (or row of the arrays). The root belongs to the 0 th element of each vector. The following figure shows the way the trie is represented by vectors and arrays. 3 1❦ ✲ 4❦ 1✒ 0❦2✲ 2❦3✲ 5❦ ❅ 3 ❘ ❦ ❅ 3 a trie 3 1 1 0 0 0 edge number 0 0 0 1 2 parent 1 2 3 3 3 1 2 3 4 5 item array state array 2 1 1 0 0 0 maxpath 652 453 320 310 243 198 counter Figure 5. Implementation of a trie The vector edge number stores the number of edges of the nodes. The itemarray[i] stores the label of the edges, statearray[i] stores the end node of the edges of node i. Vectors parent[i] and maxpath[i] store the parents and the length of the longest path respectively. The occurrences of itemset represented by the nodes can be found in vector counter. For vectors we use vector class offered by the STL, but arrays are stored in a traditional C way. Obviously, it is not a fixed-size array (which caused the ugly calloc, realloc commands in the code). Each row is as long as many edges the node has, and new rows are inserted as the trie grows (during candidate generation). A more elegant way would be if the arrays were implemented as a vector of vectors. The code would be easier to understand and shorter, because the algorithms of STL could also be used. However, the algorithm would be slower because determining a value of the array takes more time. Tests showed that sacrificing a bit from readability leads to 10-15% speed up. In our implementation we do not adapt on-line 2-itemset candidate generation (see Section 3.4) but use a vector and an array (temp counter array) for determining the support of 1- and 2-itemset candidates efficiently. The vector and array description of a trie makes it pos- sible to give a fast implementation of the basic functions (like candidate generation, support count, . . . ). For example, deleting infrequent nodes and pulling the vectors together is achieved by a single scan of the vectors. For more details readers are referred to the html-based documentation. 5. Experimental results min supp 0.0500 0.0100 0.0050 0.0030 0.0020 0.0015 Bodon Borgelt impl. . impl. 4.23 5.2 10.27 14.03 17.87 16.13 34.07 18.8 70.3 21.9 85.23 25.1 Goethals impl. 11.73 30.5 40.77 53.43 69.73 86.37 APRIORI BRAVE 2.87 6.6 12.7 23.97 46.5 87.63 Running time (sec.) Here we present the experimental results of our implementation of APRIORI and APRIORI-BRAVE compared to the two famous APRIORI implementations by Christian Borgelt (version 4.08) 1 and Bart Goethals (release date: 01/06/2003) 2. 3 databases were used: the well-known T40I10D100K and T10I4D100K and a coded log of a clickstream data of a Hungarian on-line news portal (denoted by kosarak). This database contains 990002 transactions of size 8.1 on the average. Test were run on a PC with 2.8 GHz dual Xeon processors and 3Gbyte RAM. The operating system was Debian Linux, running times were obtained by the /usr/bin/time command. The following 3 tables present the test results of the 3 different implementations of APRIORI and the APRIORI BRAVE on the 3 databases. Each test was carried out 3 times; the tables contain the averages of the results. The two well-known implementations are denoted by the last name of the coders. min supp 0.05 0.030 0.020 0.010 0.009 0.0085 Bodon impl. 8.57 10.73 15.3 95.17 254.33 309.6 Borgelt impl. 10.53 11.5 13.9 155.83 408.33 573.4 Goethals impl. 25.1 41.07 53.63 207.67 458.7 521.0 APRIORI BRAVE 8.3 10.6 14.0 100.27 178.93 188.0 Running time (sec.) Table 1. T40I10D100K database Tables 4-6 show result of Bodon’s APRIORI implementation with hash techniques. The notation leaf max size stands for the threshold above a node applies perfect hashing technique. Our APRIORI implementation beats Goethals implementation almost all the times, and beats Borgelt’s implementation many times. It performs best at low support threshold. We can also see that in the case of these 1 http://fuzzy.cs.uni-magdeburg.de/∽borgelt/software.html#assoc 2 http://www.cs.helsinki.fi/u/goethals/software/index.html Table 2. T10I4D100K database min supp 0.050 0.010 0.005 0.003 0.002 0.001 Bodon impl. 14.43 17.9 24.0 35.9 81.83 612.67 Borgelt impl. 32.8 41.87 52.4 67.03 199.07 1488.0 Goethals impl. 28.27 44.3 58.4 76.77 182.53 1101.0 APRIORI BRAVE 14.1 17.5 21.67 28.5 72.13 563.33 Running time (sec.) Table 3. kosarak database min leaf max size supp 1 2 5 7 10 25 0.0500 8.1 8.1 8.1 8.1 8.1 8.1 0.0300 9.6 9.8 9.7 9.8 9.8 9.8 0.0200 13.8 14.4 13.6 13.9 13.6 13.9 0.0100 114.0 96.3 83.8 82.5 78.9 79.2 0.0090 469.8 339.6 271.8 258.1 253.0 253.0 0.0085 539.0 373.0 340.0 310.0 306.0 306.0 60 8.1 9.8 13.9 80.4 251.0 306.0 100 8.4 9.9 14.1 83.0 253.8 309.0 Runing time (sec.) Table 4. T40I10D100K database 3 databases APRIORI BRAVE outperforms APRIORI at most support threshold. Strange, but hashing technique not always resulted in faster execution. The reason for this might be that small vectors are cached in, where linear search is very fast. If we enlarge the size of the vector by altering it into a hashtable, then the vector may be moved into the memory, where read is a slower operation. Applying hashing technique is the other example when an accelerating technique does not result in improvement. min supp 1 2 0.0500 2.8 2.8 0.0100 7.8 7.3 0.0050 24.2 14.9 0.0030 55.3 34.6 0.0020 137.3 100.2 0.0015 235.0 176.0 5 2.8 6.9 13.4 30.3 76.2 125.0 leaf max size 7 10 25 2.8 2.8 2.8 6.9 6.9 6.9 13.1 13 12.8 25.9 25.2 25.2 77.0 75.2 78.0 130.0 125.0 132.0 Running time (sec.) 60 2.8 7.0 13.0 25.3 69.5 115.0 100 2.8 7.1 13.2 25.8 64.5 103.0 efficiency parameters. However, the same algorithm that uses a certain data structure has a wide variety of implementation. In this paper, we showed that different implementation results in different running time, and the differences can exceed differences between algorithms. We presented an implementation that solved frequent itemset mining problem in most cases faster than other well-known implementations. References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, min leaf max size pages 207–216, 1993. supp 1 2 5 7 10 25 60 100 [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. 0.0500 14.6 14.3 14.3 14.2 14.2 14.2 14.2 14.2 Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, pages 307–328, 0.0100 17.5 17.5 17.5 17.6 17.6 17.6 18 18.1 1996. 0.0050 21.0 21.0 22.0 21.0 22.0 22.0 22.8 22.8 0.0030 26.3 26.1 26.3 26.5 27.2 27.4 28.5 29.6 [3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. The International Conference on Very Large 0.0020 98.8 77.5 62.3 60.0 59.7 61.0 61.0 63.4 Databases, pages 487–499, 1994. 0.0010 1630 1023.0 640.0 597.0 574.0 577.0 572.0 573.0 [4] R. Agrawal and R. Srikant. Mining sequential patterns. In P. S. Yu and A. L. P. Chen, editors, Proc. 11th Int. Conf. Data Running time (sec.) Engineering, ICDE, pages 3–14. IEEE Press, 6–10 1995. [5] N. F. Ayan, A. U. Tansel, and M. E. Arkun. An efficient algorithm to update large itemsets with early pruning. In KnowlTable 6. kosarak database edge Discovery and Data Mining, pages 287–291, 1999. [6] R. J. Bayardo, Jr. Efficiently mining long patterns from databases. In Proceedings of the 1998 ACM SIGMOD inter6. Further Improvement and Research Possinational conference on Management of data, pages 85–93. bilities ACM Press, 1998. [7] F. Bodon and L. Rónyai. Trie: an alternative data structure for data mining algorithms. to appear in Computers and Our APRIORI implementation can be further improved Mathematics with Applications, 2003. if trie is used to store reduced basket, and a reduced basket [8] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic is removed if it does not contain any candidate. itemset counting and implication rules for market basket We mentioned that there are two basic ways of finding data. SIGMOD Record (ACM Special Interest Group on the contained candidates in a given transaction. Further theManagement of Data),26(2):255, 1997. oretical and experimental analysis may lead to the conclu[9] D. W.-L. Cheung, J. Han, V. Ng, and C. Y. Wong. Maintesion that a mixture of the two approaches would lead to the nance of discovered association rules in large databases: An fastest execution. incremental updating technique. In ICDE, pages 106–114, 1996. Theoretically, hashing technique accelerates support [10] D. W.-L. Cheung, S. D. Lee, and B. Kao. A general incount. However, experiments did not support this claim. cremental technique for maintaining discovered association Further investigations are needed to clear the possibilities rules. In Database Systems for Advanced Applications, of this technique. pages 185–194, 1997. [11] Y. Fu. Discovery of multiple-level rules from large 7. Conclusion databases, 1996. [12] B. Goethals. Survey on frequent pattern mining. Technical report, Helsinki Institute for Information Technology, 2003. Determining frequent objects (itemsets, episodes, se[13] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without quential patterns) is one of the most important fields of data candidate generation. In W. Chen, J. Naughton, and P. A. mining. It is well known that the way candidates are deBernstein, editors, 2000 ACM SIGMOD Intl. Conference on fined has great effect on running time and memory need, Management of Data, pages 1–12. ACM Press, 05 2000. and this is the reason for the large number of algorithms. It [14] Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. is also clear that the applied data structure also influences TANE: An efficient algorithm for discovering functional Table 5. T10I4D100K database [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] and approximate dependencies. The Computer Journal, 42(2):100–111, 1999. Y. Huhtala, J. Kinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In ICDE, pages 392–401, 1998. Y. F. Jiawei Han. Discovery of multiple-level association rules from large databases. In Proc. of the 21st International Conference on Very Large Databases (VLDB), Zurich, Switzerland, 1995. D. E. Knuth. The Art of Computer Programming Vol. 3. Addison-Wesley, 1968. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episodes in sequences. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 210–215. AAAI Press, 1995. B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. In ICDE, pages 412–421, 1998. J. S. Park, M.-S. Chen, and P. S. Yu. An effective hash based algorithm for mining association rules. In M. J. Carey and D. A. Schneider, editors, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 175–186, San Jose, California, 22–25 1995. A. Sarasere, E. Omiecinsky, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. 21st International Conference on Very Large Databases (VLDB), Zurich, Switzerland, Also Gatech Technical Report No. GIT-CC-95-04., 1995. R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. of the 21st International Conference on Very Large Databases (VLDB), Zurich, Switzerland, 1995. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. Technical report, IBM Almaden Research Center, San Jose, California, 1995. S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka. An efficient algorithm for the incremental updation of association rules in large databases. page 263. S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka. An efficient algorithm for the incremental updation of association rules in large databases. In Knowledge Discovery and Data Mining, pages 263–266, 1997. H. Toivonen. Sampling large databases for association rules. In The VLDB Journal, pages 134–145, 1996. 8. Appendix Let us analyze formally how frequency code accelerate the search. Suppose that the number of frequent items is m, for m j times and the j th most frequent has to be searched m (n1 ≥ n2 ≥ . . . ≥ nm ) and n = i=1 ni . If an item is in the position j, then the cost of finding it is c · j, where c is a constant. For the sake of simplicity c is omitted. m The total cost of search based on frequency codes is j=1 j · nj . How much is the cost if the list is not ordered by frequencies? We cannot determine this precisely, because we don’t know which item is in the first position, which item is in the second, etc. We can calculate the expected time of the total cost if we assume that each order occurs with the same probability. Then the probability of each permutation 1 . Thus is m! E[total cost] =  1 · (cost of π) m! π m = 1  π(j)nπ(j) . m! π j=1 Here π runs through the permutations of 1, 2, . . . , m, and the j th item of π is denoted by π(j). Since each item gets to each position (m − 1)! times, we obtain that E[total cost] = m m   1 (m − 1)! nj k m! j=1 k=1 m m+1 1 (m + 1)m  n. nj = = m 2 2 j=1 It is easy to prove that E[total cost] is greater than or equal to the total cost of the search based on frequency codes (because of the condition n 1 ≥ n2 ≥ . . . ≥ nm ). We want to know more, namely how small the ratio m j=1 j · nj (1) n m+1 2 can be. In the worst case (n 1 = n2 = . . . = nm ) it is 1, in best case (n1 = n − m + 1, n2 = n3 = . . . = nm = 1) it converges to 0, if n → ∞. We can not say anything more unless the probability distribution of frequent items is known. In many applications, there are some very frequent items, and the probability of rare items differs slightly. This is why we voted for an exponential decrease. In our model the probability of occurrence of the j th most frequent item is p j = aebj , where a > 0, b < 0 are two parameters, such that ae b ≤ 1 holds. Parameter b can be regarded as the gradient of the distribution, and parameter a determines the starting point 3. 3 Note P that pj does not have to be equal with 1, since more than one item can occur in a basket. We suppose, that the ratio of occurrences is the same as the ratio of the probabilities, hence n 1 : n2 : . . . : nm =p1 : p2 : . . . : pm . From this and the condition m P pj n = j=1 nj , we infer that n j = n m pk . Using the be 0.39, 0.73 and 0.8, which meets our expectations: by adapting frequency codes the search time will drop sharply if the probabilities differ greatly from each other. We have to remember that frequency codes do not have any effect on nodes where binary search is applied. k=1 formula for geometric series, and using the notation x = e b we obtain nj = n x−1 xj (x − 1) = n m+1 · xj . xm+1 − 1 x −1 The total cost can be determined: m m  n(x − 1)  j · xj . j · nj = m+1 x − 1 j=1 j=1 Let us calculate m  j · xj = m m  j=1 j · xj : (j + 1) · xj − xj = m  xj+1 ′ − m  xj j=1 j=1 j=1 j=1 j=1 m  mxm+2 − (m + 1)xm+1 + x = . (x − 1)2 The ratio of total costs can be expressed in a closed formula: 2x(mxm+1 − (m + 1)xm + 1) , (xm+1 − 1)(x − 1)(m + 1) cost ratio = (2) where x = eb . We can see, that the speedup is independent of a. In Figure 6 3 different distributions can be seen. The first is gently sloping, the second has larger graduation and the last distribution is quite steep. 0.5 0.07e^(-0.06x) 0.7e^(-0.35x) 0.4e^(-0.1x) 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 2 4 6 8 10 12 14 16 18 Figure 6. 3 different probability distribution of the items If we substitute the parameters of the probability distribution to the formula (2) (with m = 10), then the result will 20 Efficient Implementations of Apriori and Eclat Christian Borgelt Department of Knowledge Processing and Language Engineering School of Computer Science, Otto-von-Guericke-University of Magdeburg Universitätsplatz 2, 39106 Magdeburg, Germany Email: borgelt@iws.cs.uni-magdeburg.de Abstract Apriori and Eclat are the best-known basic algorithms for mining frequent item sets in a set of transactions. In this paper I describe implementations of these two algorithms that use several optimizations to achieve maximum performance, w.r.t. both execution time and memory usage. The Apriori implementation is based on a prefix tree representation of the needed counters and uses a doubly recursive scheme to count the transactions. The Eclat implementation uses (sparse) bit matrices to represent transactions lists and to filter closed and maximal item sets. 1. Introduction Finding frequent item sets in a set of transactions is a popular method for so-called market basket analysis, which aims at finding regularities in the shopping behavior of customers of supermarkets, mail-order companies, on-line shops etc. In particular, it is tried to identify sets of products that are frequently bought together. The main problem of finding frequent item sets, i.e., item sets that are contained in a user-specified minimum number of transactions, is that there are so many possible sets, which renders naı̈ve approaches infeasible due to their unacceptable execution time. Among the more sophisticated approaches two algorithms known under the names of Apriori [1, 2] and Eclat [8] are most popular. Both rely on a topdown search in the subset lattice of the items. An example of such a subset lattice for five items is shown in figure 1 (empty set omitted). The edges in this diagram indicate subset relations between the different item sets. To structure the search, both algorithms organize the subset lattice as a prefix tree, which for five items is shown in Figure 2. In this tree those item sets are combined in a node which have the same prefix w.r.t. to some arbitrary, but fixed order of the items (in the five items example, this order is simply a, b, c, d, e). With this structure, the item sets contained in a node of the tree can be constructed easily in the following way: Take all the items with which the edges leading to the node are labeled (this is the common prefix) and add an item that succeeds, in the fixed order of the items, the last edge label on the path. Note that in this way we need only one item to distinguish between the item sets represented in one node, which is relevant for the implementation of both algorithms. The main differences between Apriori and Eclat are how they traverse this prefix tree and how they determine the support of an item set, i.e., the number of transactions the item set is contained in. Apriori traverses the prefix tree in breadth first order, that is, it first checks item sets of size 1, then item sets of size 2 and so on. Apriori determines the support of item sets either by checking for each candidate item set which transactions it is contained in, or by traversing for a transaction all subsets of the currently processed size and incrementing the corresponding item set counters. The latter approach is usually preferable. Eclat, on the other hand, traverses the prefix tree in depth first order. That is, it extends an item set prefix until it reaches the boundary between frequent and infrequent item sets and then backtracks to work on the next prefix (in lexicographic order w.r.t. the fixed order of the items). Eclat determines the support of an item set by constructing the list of identifiers of transactions that contain the item set. It does so by intersecting two lists of transaction identifiers of two item sets that differ only by one item and together form the item set currently processed. 2. Apriori Implementation My Apriori implementation uses a data structure that directly represents a prefix tree as it is shown in figure 2. This tree is grown top-down level by level, pruning those branches that cannot contain a frequent item set [4]. 2.1. Node Organization If we want to optimize memory usage, we can decide dynamically, which data structure is more efficient in terms of memory, accepting the higher counter access time due to the binary search if necessary. It should also be noted that we need a set of child pointers per node, at least for all levels above the currently added one (in order to save memory, one should not create child pointers before one is sure that one needs them). For organizing these pointers there are basically the same options as for organizing the counters. However, if the counters have item identifiers attached, there is an additional possibility: We may draw on the organization of the counters, using the same order of the items and leaving child pointers nil if they are not needed. This can save memory, even though we may have unnecessary nil pointers, because we do not have to store item identifiers a second time. There are different data structures that may be used for the nodes of the prefix tree. In the first place, we may use simple vectors of integer numbers to represent the counters for the item sets. The items (note that we only need one item to distinguish between the counters of a node, see above) are not explicitly stored in this case, but are implicit in the vector index. Alternatively, we may use vectors, each element of which consists of an item identifier (an integer number) and a counter, with the vector elements being sorted by the item identifier. The first structure has the advantage that we do not need any memory to store the item identifiers and that we can very quickly find the counter for a given item (simply use the item identifier as an index), but it has the disadvantage that we may have to add “unnecessary” counters (i.e., counters for item sets, of which we know from the information gathered in previous steps that they must be infrequent), because the vector may not have “gaps”. This problem can only partially be mitigated by enhancing the vector with an offset to the first element and a size, so that unnecessary counters at the margins of the vector can be discarded. The second structure has the advantage that we only have the counters we actually need, but it has the disadvantage that we need extra memory to store the item identifiers and that we have to carry out a binary search in order to find the counter corresponding to a given item. A third alternative would be to use a hash table per node. However, although this reduces the time needed to access a counter, it increases the amount of memory needed, because for optimal performance a hash table must not be too full. In addition, it does not allow us to exploit easily the order of the items in the counting process (see below). Therefore I do not consider this alternative here. Obviously, if we want to optimize speed, we should choose simple counter vectors, despite the gap problem. a b d c 2.2. Item Coding It is clear that the way in which the items are coded (i.e., are assigned integer numbers as identifiers) can have a significant impact on the gap problem for pure counter vectors mentioned above. Depending on the coding we may need large vectors with a lot of gaps or we may need only short vectors with few gaps. A good heuristic approach to minimize the number and the size of gaps seems to be this: It is clear that frequent item sets contain items that are frequent individually. Therefore it is plausible that we have only few gaps if we sort the items w.r.t. their frequency, so that the individually frequent items receive similar identifiers if they have similar frequency (and, of course, infrequent items are discarded entirely). In this case it can be hoped that the offset/size representation of a counter vector can eliminate the greater part of the unnecessary counters, because these can be expected to cluster at the vector margins. Extending this scheme, we may also consider to code the items w.r.t. the number of frequent pairs (or even triples etc.) e a b ab ac ad ae bc bd be cd ce de ab ac abc abd abe acd ace ade bcd bce bde abc abd abe cde c abcd abce abde acde ad bcde ae bd bc ace d d ade d bcd d abcd abce cd be c d acd e c b c b d c a abde bce de ce d bde cde d acde bcde d abcde abcde Figure 1. A subset lattice for five items (empty set omitted). Figure 2. A prefix tree for five items (empty set omitted). 2 they are part of, thus using additional information from the second (or third etc.) level to improve the coding. This idea can most easily be implemented for item pairs by sorting the items w.r.t. the sum of the sizes of the transactions they are contained in (with infrequent items discarded from the transactions, so that this sum gives a value that is similar to the number of frequent pairs, which, as these are heuristics anyway, is sufficient). considerable performance gains can result. Of course, the gains have to outweigh the additional costs of constructing such a transaction tree to lead to an overall gain. 2.5. Transaction Filtering It is clear that in order to determine the counter values on the currently added level of the prefix tree, we only need the items that are contained in those item sets that are frequent on the preceding level. That is, to determine the support of item sets of size k, we only need those items that are contained in the frequent item sets of size k − 1. All other items can be removed from the transactions. This has the advantage that the transactions become smaller and thus can be counted more quickly, because the size of a transaction is a decisive factor for the time needed by the recursive counting scheme described above. However, this can only be put to work easily if the transactions are processed individually. If they are organized as a prefix tree, a possibly costly reconstruction of the tree is necessary. In this case one has to decide whether to continue with the old tree, accepting the higher counting costs resulting from unnecessary items, or whether rebuilding the tree is preferable, because the costs for the rebuild are outweighed by the savings resulting from the smaller and simpler tree. Good heuristics seem to be to rebuild the tree if 2.3. Recursive Counting The prefix tree is not only an efficient way to store the counters, it also makes processing the transactions very simple, especially if we sort the items in a transaction ascendingly w.r.t. their identifiers. Then processing a transaction is a simple doubly recursive procedure: To process a transaction for a node of the tree, (1) go to the child corresponding to the first item in the transaction and process the remainder of the transaction recursively for that child and (2) discard the first item of the transaction and process it recursively for the node itself (of course, the second recursion is more easily implemented as a simple loop through the transaction). In a node on the currently added level, however, we increment a counter instead of proceeding to a child node. In this way on the current level all counters for item sets that are part of a transaction are properly incremented. By sorting the items in a transaction, we can also apply the following optimizations (this is a bit more difficult—or needs additional memory—if hash tables are used to organize the counters and thus explains why I am not considering hash tables): (1) We can directly skip all items before the first item for which there is a counter in the node, and (2) we can abort the recursion if the first item of (the remainder of) a transaction is beyond the last one represented in the node. Since we grow the tree level by level, we can even go a step further: We can terminate the recursion once (the remainder of) a transaction is too short to reach the level currently added to the tree. nnew ttree < 0.1, ncurr tcount where ncurr is the number of items in the current transaction tree, nnew is the number of items that will be contained in the new tree, ttree is the time that was needed for building the current tree and tcount is the time that was needed for counting the transactions in the preceding step. The constant 0.1 was determined experimentally and on average seems to lead to good results (see also Section 4). 2.6. Filtering Closed and Maximal Item Sets 2.4. Transaction Representation A frequent item set is called closed if there is no superset that has the same support (i.e., is contained in the same number of transactions). Closed item sets capture all information about the frequent item sets, because from them the support of any frequent item set can be determined. A frequent item set is called maximal if there is no superset that is frequent. Maximal item sets define the boundary between frequent and infrequent sets in the subset lattice. Any frequent item set is often also called a free item set to distinguish it from closed and maximal ones. In order to find closed and maximal item sets with Apriori one may use a simple filtering approach on the prefix tree: The final tree is traversed top-down level by level (breadth first order). For each frequent item set all subsets The simplest way of processing the transactions is to handle them individually and to apply to each of them the recursive counting procedure described in the preceding section. However, the recursion is a very expensive procedure and therefore it is worthwhile to consider how it can be improved. One approach is based on the fact that often there are several similar transactions, which lead to a similar program flow when they are processed. By organizing the transactions into a prefix tree (an idea that has also been used in [6] in a different approach) transactions with the same prefix can be processed together. In this way the procedure for the prefix is carried out only once and thus 3 processed in the corresponding child node. Of course, rows corresponding to infrequent item sets should be discarded from the constructed matrix, which can be done most conveniently if we store with each row the corresponding item identifier rather than relying on an implicit coding of this item identifier in the row index. Intersecting two rows can be done by a simple logical and on a fixed length integer vector if we work with a true bit matrix. During this intersection the number of set bits in the intersection is determined by looking up the number of set bits for given word values (i.e., 2 bytes, 16 bits) in a precomputed table. For a sparse representation the column indices for the set bits should be sorted ascendingly for efficient processing. Then the intersection procedure is similar to the merge step of merge sort. In this case counting the set bits is straightforward. with one item less are traversed and marked as not to be reported if they have the same support (closed item sets) or unconditionally (maximal item sets). 3. Eclat Implementation My Eclat implementation represents the set of transactions as a (sparse) bit matrix and intersects rows to determine the support of item sets. The search follows a depth first traversal of a prefix tree as it is shown in Figure 2. 3.1. Bit Matrices A convenient way to represent the transactions for the Eclat algorithm is a bit matrix, in which each row corresponds to an item, each column to a transaction (or the other way round). A bit is set in this matrix if the item corresponding to the row is contained in the transaction corresponding to the column, otherwise it is cleared. There are basically two ways in which such a bit matrix can be represented: Either as a true bit matrix, with one memory bit for each item and transaction, or using for each row a list of those columns in which bits are set. (Obviously the latter representation is equivalent to using a list of transaction identifiers for each item.) Which representation is preferable depends on the density of the dataset. On 32 bit machines the true bit matrix representation is more memory efficient if the ratio of set bits to cleared bits is greater than 1:31. However, it is not advisable to rely on this ratio in order to decide between a true and a sparse bit matrix representation, because in the search process, due to the intersections carried out, the number of set bits will decrease. Therefore a sparse representation should be used even if the ratio of set bits to cleared bits is greater than 1:31. In my current implementation a sparse representation is preferred if the ratio is greater than 1:7, but this behavior can be changed by a user. A more sophisticated option would be to switch to the sparse representation of a bit matrix during the search once the ratio of set bits to cleared bits exceeds 1:31. However, such an automatic switch, which involves a rebuild of the bit matrix, is not implemented in the current version. 3.3. Item Coding As for Apriori the way in which items are coded has an impact on the execution time of the Eclat algorithm. The reason is that the item coding not only affects the number and the size of gaps in the counter vectors for Apriori, but also the structure of the pruned prefix tree and thus the structure of Eclat’s search tree. Sorting the items usually leads to a better structure. For the sorting there are basically the same options as for Apriori (see Section 2.2). 3.4. Filtering Closed and Maximal Item Sets Determining closed and maximal item sets with Eclat is slightly more difficult than with Apriori, because due to the backtrack Eclat “forgets” everything about a frequent item set once it is reported. In order to filter for closed and maximal item sets, one needs a structure that records these sets, and which allows to determine quickly whether in this structure there is an item set that is a superset of a newly found set (and whether this item set has the same support if closed item sets are to be found). In my implementation I use the following approach to solve this problem: Frequent item sets are reported in a node of the search tree after all of its child nodes have been processed. In this way it is guaranteed that all possible supersets of an item set that is about to be reported have already been processed. Consequently, we can maintain a repository of already found (closed or maximal) item sets and only have to search this repository for a superset of the item set in question. The repository can only grow (we never have to remove an item set from it), because due to the report order a newly found item set cannot be a superset of an item set in the repository. For the repository one may use a bit matrix in the same way as it is used to represent the transactions: Each row 3.2. Search Tree Traversal As already mentioned, Eclat searches a prefix tree like the one shown in Figure 2 in depth first order. The transition of a node to its first child consists in constructing a new bit matrix by intersecting the first row with all following rows. For the second child the second row is intersected with all following rows and so on. The item corresponding to the row that is intersected with the following rows thus is added to form the common prefix of the item sets 4 corresponds to an item, each column to a found (closed or maximal) frequent item set. The superset test consists in intersecting those rows of this matrix that correspond to the items in the frequent item set in question. If the result is empty, there is no superset in the repository, otherwise there is (at least) one. (Of course, the intersection loop is terminated as soon as an intersection gets empty.) To include the information about the support for closed item sets, an additional row of the matrix is constructed, which contains set bits in those columns that correspond to item sets having the same support as the one in question. With this additional row the intersection process is started. It should be noted that the superset test can be avoided if any direct descendant (intersection product) of an item set has the same support (closed item sets) or is frequent (maximal item set). In my implementation the repository bit matrix uses the same representation as the matrix that represents the transactions. That is, either both are true bit matrices or both are sparse bit matrices. Apriori (diagram b) and Eclat (diagram c). To ease the comparison of the two diagrams, the default parameter curve for the other algorithm (the solid curve in its own diagram) is shown in grey in the background. The curves in diagram b represent the following settings: solid: Items sorted ascendingly w.r.t. the sum of the sizes of the transactions they are contained in; prefix tree to represent the transactions, which is rebuild every time the heuristic criterion described in section 2.5 is fulfilled. short dashes: Like solid curve, prefix tree used to represent the transactions, but never rebuild. long dashes: Like solid curve, but transactions are not organized as a prefix tree; items that are no longer needed are not removed from the transactions. dense dots: Like long dash curve, but items sorted ascendingly w.r.t. their frequency in the transactions. In diagram b it is not distinguished whether free, closed, or maximal item sets are to be found, because the time for filtering the item sets is negligible compared to the time needed for counting the transactions (only a small difference would be visible in the diagrams, which derives mainly from the fact that less time is needed to write the smaller number of closed or maximal item sets). In diagram c the solid, short, and long dashes curve show the results for free, closed, and maximal item sets, respectively, with one representation of the bit matrix, the dense dots curve the results for free item sets for the other representation (cf. section 3.1). Whether the solid, short, and long dashes curve refer to a true bit matrix and the dense dots curve to a sparse one or the other way round depends on the data set and is indicated in the corresponding section below. Diagrams d and e show the decimal logarithm of the memory in bytes used for different parameterizations of Apriori (diagram d) and Eclat (diagram e). Again the grey curve refers to the default parameter setting of the other algorithm (the solid curve in its own diagram). The curves in diagram d represent the following settings: 4. Experimental Results I ran experiments with both programs on five data sets, which exhibit different characteristics, so that the advantages and disadvantages of the two approaches and the different optimizations can be observed. The data sets I used are: BMS-Webview-1 (a web click stream from a leg-care company that no longer exists, which has been used in the KDD cup 2000 [7, 9]), T10I4D100K (an artificial data set generated with IBM’s data generator [10]), census (a data set derived from an extract of the US census bureau data of 1994, which was preprocessed by discretizing numeric attributes), chess (a data set listing chess end game positions for king vs. king and rook), and mushroom (a data set describing poisonous and edible mushrooms by different attributes). The last three data sets are available from the UCI machine learning repository [3]. The discretization of the numeric attributes in the census data set was done with a shell/gawk script that can be found on the WWW page mentioned below. For the experiments I used an AMD Athlon XP 2000+ machine with 756 MB main memory running S.u.S.E. Linux 8.2 and gcc version 3.3. The results for these data sets are shown in Figures 3 to 7. Each figure consists of five diagrams, a to e, which are organized in the same way in each figure. Diagram a shows the decimal logarithm of the number of free (solid), closed (short dashes), and maximal item sets (long dashes) for different support values. From these diagrams it can already be seen that the data sets have clearly different characteristics. Only census and chess appear to be similar. Diagrams b and c show the decimal logarithm of the execution time in seconds for different parameterizations of solid: Items sorted ascendingly w.r.t. the sum of the sizes of the transaction they are contained in; transactions organized as a prefix tree; memory saving organization of the prefix tree nodes as described in section 2.1. short dashes: Like solid, but no memory saving organization of the prefix tree nodes (always pure vectors). long dashes: Like short dashes, but items sorted descendingly w.r.t. the sum of the sizes of the transaction they are contained in. dense dots: Like long dashes, but items not sorted. Again it is not distinguished whether free, closed, or maximal item sets are to be found, because this has no influence on the memory usage. The meaning of the line styles in diagram e is the same as in diagram c (see above). 5 BMS-Webview-1: Characteristic for this data set is the divergence of the number of free, closed, and maximal item sets for lower support values. W.r.t. the execution time of Apriori this data set shows perfectly the gains that can result from the different optimizations. Sorting the items w.r.t. the sum of transactions sizes (long dashes in diagram b) improves over sorting w.r.t. simple frequency (dense dots), organizing the transactions as a prefix tree (short dashes) improves further, removing no longer needed items yields another considerable speed-up (solid curve). However, for free and maximal item sets and a support less than 44 transactions Eclat with a sparse bit matrix representation (long dashes and solid curve in diagram c) is clearly better than Apriori, which also needs a lot more memory. Only for closed item sets Apriori is the method of choice (Eclat: short dashes in diagram c), which is due to the more expensive filtering with Eclat. Using a true bit matrix with Eclat is clearly not advisable as it performs worse than Apriori and down to a support of 39 transactions even needs more memory (dense dots in diagrams c and e). T10I4D100K: The numbers of all three types of item sets sharply increase for lower support values; there is no divergence as for BMS-Webview-1. For this data set Apriori outperforms Eclat, although for a support of 5 transactions Eclat takes the lead for free item sets. For closed and maximal item sets Eclat cannot challenge Apriori. It is remarkable that for this data set rebuilding the prefix tree for the transactions in Apriori slightly degrades performance (solid vs. short dashes in diagram b, with the dashed curve almost covered by the solid one). For Eclat a sparse bit matrix representation (solid, short, and long dashes curve in diagrams c and e) is preferable to a true bit matrix (dense dots). (Remark: In diagram b the dense dots curve is almost identical to the long dashes curve and thus is covered.) Census: This data set is characterized by an almost constant ratio of the numbers of free, closed, and maximal item sets, which increase not as sharply as for T10I4D100K. For free item sets Eclat with a sparse bit matrix representation (solid curve in diagram c) always outperforms Apriori, while it clearly loses against Apriori for closed and maximal item sets (long and short dashes curves in diagrams c and e, the latter of which is not visible, because it lies outside the diagram — the execution time is too large due to the high number of closed item sets). For higher support values, however, using a true bit matrix representation with Eclat to find maximal item sets (sparse dots curves in diagrams c and e) comes close to being competitive with Apriori. Again it is remarkable that rebuilding the prefix tree of transactions in Apriori slightly degrades performance. Chess: W.r.t. the behavior of the number of free, closed, and maximal item sets this dataset is similar to census, although the curves are bend the other way. The main difference to the results for census are that for this data set a true bit ma- a 6 5 4 34 35 36 37 38 39 40 41 42 43 44 45 b 2 1 0 34 35 36 37 38 39 40 41 42 43 44 45 c 2 1 0 34 35 36 37 38 39 40 41 42 43 44 45 d 8 7 6 34 35 36 37 38 39 40 41 42 43 44 45 e 8 7 6 34 35 36 37 38 39 40 41 42 43 44 45 Figure 3. Results on BMS-Webview-1 6 7 a a 6 6 5 5 4 5 10 15 20 25 30 35 40 45 50 55 60 10 20 30 40 50 60 70 80 90 b 100 b 2 2 1 1 5 10 15 20 25 30 35 40 45 50 55 60 10 20 30 40 50 60 70 80 90 c 100 c 2 2 1 1 5 10 15 20 25 30 35 40 45 50 55 60 10 20 30 40 50 60 70 80 90 d 100 d 8 8 7 7 5 10 15 20 25 30 35 40 45 50 55 60 10 20 30 40 50 60 70 80 90 e 100 e 8 8 7 7 5 10 15 20 25 30 35 40 45 50 55 60 10 Figure 4. Results on T10I4D100K 20 30 40 50 60 70 80 Figure 5. Results on census 7 90 100 trix representation for Eclat (solid, short, and long dashes curves in diagrams c and e) is preferable to a sparse one (dense dots), while for census it is the other way round. The true bit matrix representation also needs less memory, indicating a very dense data set. Apriori can compete with Eclat only when it comes to closed item sets, where it performs better due to its more efficient filtering of the fairly high number of closed item sets. Mushroom: This data set differs from the other four in the position of the number of closed data sets between the number of free and maximal item sets. Eclat with a true bit matrix representation (solid, short, and long dashes curves in diagrams c and e) outperforms Eclat with a sparse bit matrix representation (dense dots), which in turn outperforms Apriori. However, the sparse bit matrix (dense dots in diagram c) gains ground towards lower support values, making it likely to take the lead for a minimum support of 100 transactions. Even for closed and maximal item sets Eclat is clearly superior to Apriori, which is due to the small number of closed and maximal item sets, so that the filtering is not a costly factor. (Remark: In diagram b the dense dots curve is almost identical to the long dashes curve and thus is covered. In diagram d the short dashes curve, which lies over the dense dots curve, is covered the solid one.) a 6 5 4 1500 1600 1700 1800 1900 2000 b 2 1 0 1500 1600 1700 1800 1900 2000 c 2 1 5. Conclusions 0 1500 1600 1700 1800 1900 For free item sets Eclat wins the competition w.r.t. execution time on four of the five data sets and it always wins w.r.t. memory usage. On the only data set on which it loses the competition (T10I4D100K), it takes the lead for the lowest minimum support value tested, indicating that for lower minimum support values it is the method of choice, while for higher minimum support values its disadvantage is almost negligible (note that for this data set all execution times are less than 30s). For closed item sets the more efficient filtering gives Apriori a clear edge w.r.t. execution time, making it win on all five data sets. For maximal item sets the picture is less clear. If the number of maximal item sets is high, Apriori wins due to its more efficient filtering, while Eclat wins for a lower number of maximal item sets due to its more efficient search. 2000 d 7 6 5 1500 1600 1700 1800 1900 2000 e 7 6. Programs 6 The implementations of Apriori and Eclat described in this paper (WindowsTM and LinuxTM executables as well as the source code) can be downloaded free of charge at http://fuzzy.cs.uni-magdeburg.de/˜borgelt/software.html The special program versions submitted to this workshop rely on the default parameter settings of these programs (solid curves in the diagrams b to e of Section 4). 5 1500 1600 1700 1800 1900 2000 Figure 6. Results on chess 8 References a 7 [1] R. Agrawal, T. Imielienski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. Proc. Conf. on Management of Data, 207– 216. ACM Press, New York, NY, USA 1993 6 5 4 [2] A. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo. Fast Discovery of Association Rules. In: [5], 307–328 3 200 300 400 500 600 700 800 900 3 1000 [3] C.L. Blake and C.J. Merz. UCI Repository of Machine Learning Databases. Dept. of Information and Computer Science, University of California at Irvine, CA, USA 1998 http://www.ics.uci.edu/ mlearn/MLRepository.html b 2 1 [4] C. Borgelt and R. Kruse. Induction of Association Rules: Apriori Implementation. Proc. 14th Conf. on Computational Statistics (COMPSTAT). Berlin, Germany 2002 0 200 300 400 500 600 700 800 900 3 1000 [5] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds. Advances in Knowledge Discovery and Data Mining. AAAI Press / MIT Press, Cambridge, CA, USA 1996 c 2 1 [6] J. Han, H. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation. In: Proc. Conf. on the Management of Data (SIGMOD’00, Dallas, TX). ACM Press, New York, NY, USA 2000 0 200 300 400 500 600 700 800 900 1000 [7] R. Kohavi, C.E. Bradley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000 Organizers’ Report: Peeling the Onion. SIGKDD Exploration 2(2):86–93. 2000. d 8 [8] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New Algorithms for Fast Discovery of Association Rules. Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining (KDD’97), 283–296. AAAI Press, Menlo Park, CA, USA 1997 7 6 200 300 400 500 600 700 800 900 1000 [9] Z. Zheng, R. Kohavi, and L. Mason. Real World Performance of Association Rule Algorithms. In: Proc. 7th Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD’01). ACM Press, New York, NY, USA 2001 e 8 7 [10] Synthetic Data Generation Code for Associations and Sequential Patterns. http://www.almaden.ibm.com/ software/quest/Resources/index.shtml Intelligent Information Systems, IBM Almaden Research Center 6 200 300 400 500 600 700 800 900 1000 Figure 7. Results on mushroom 9 Detailed Description of an Algorithm for Enumeration of Maximal Frequent Sets with Irredundant Dualization Takeaki Uno, Ken Satoh National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan Email: uno, ksatoh@nii.ac.jp Abstract avoids checking all frequent item sets. However, this algorithm solves dualization problems many times, hence it is not fast for practical purpose. Moreover, the algorithm uses the dualization algorithm of Fredman and Khachiyan [Fredman96] which is said to be slow in practice. We improved the algorithm in [SatohUno03] by using incremental dualization algorithms proposed by Kavvadias and Stavropoulos [Kavvadias99], and Uno [Uno02]. We developed an algorithm by interleaving dualization with finding maximal frequent sets. Roughly speaking, our algorithm solves one dualization problem with the size |Bd+ |, in which Bd+ is the set of maximal frequent sets, while the algorithm of Gunopulos et al. solves |Bd+ | dualization problems with sizes from 1 through |Bd+ |. This reduces the computation time by a factor of 1/|Bd+|. To reduce the computation time more, we used Uno’s dualization algorithm [Uno02]. The experimental computation time of Uno’s algorithm is linear in the number of outputs, and O(|E|) per output, while that of Kavvadias and Stavropoulos seems to be O(|E|2 ). This reduces the computation time by a factor of 1/|E|. Moreover, we add an improvement based on sparseness of input. By this, the experimental computation time per output is reduced to O(ave(Bd+ )) where ave(Bd+ ) is the average size of maximal frequent sets. In summary, we reduced the computation time by a factor of ave(Bd+ ) / (|Bd+ | × |E|2 ) by using the combination of the algorithm of Gunopulos et al. and the algorithm of Kavvadias and Stavropoulos. In the following sections, we describe our algorithm and the computational result. Section 2 describes the algorithm of Gunopulos et al. and Section 3 describes our algorithm and Uno’s algorithm. Section 4 explains our improvement using sparseness. Computational experiments for FIMI’03 instances are shown in Section 5, and we conclude the paper in Section 6. We describe an implementation of an algorithm for enumerating all maximal frequent sets using irredundant dualization, which is an improved version of that of Gunopulos et al. The algorithm of Gunopulos et al. solves many dualization problems, and takes long computation time. We interleaves dualization with the main algorithm, and reduce the computation time for dualization by as long as one dualization. This also reduces the space complexity. Moreover, we accelerate the computation by using sparseness. 1. Introduction Let E be an item set and T be a set of transactions defined on E. For an item set S ⊆ E, we denote the set of transactions including S by X(S). We define the frequency of S by |X(S)|. For a given constant α, if an item set S satisfies |X(S)| ≥ α, then S is said to be frequent. A frequent item set included in no other frequent item set is said to be maximal. An item set not frequent is called infrequent. An infrequent item set including no other infrequent item set is said to be minimal. This paper describes an implementation of an algorithm for enumerating all maximal frequent sets using dualization in detail presented at [SatohUno03]. The algorithm is an improved version of that of Gunopulos et al. [Gunopulos97a, Gunopulos97b]. The algorithm computes maximal frequent sets based on computing minimal transversals of a hypergraph, computing minimal hitting set, or, in other words, computing a dualization of a monotone function [Fredman96]. The algorithm finds all minimal item sets not included in any current obtained maximal frequent set by dualization. If a frequent item set is in those minimal item sets, then the algorithm finds a new maximal frequent set including the frequent item set. In this way, the algorithm 1 Dualize and Advance[Gunopulos97a] 1 Bd+ := {go up(∅)} 2 Compute M HS(Bd+ ). 3 If no set in M HS(Bd+ ) is frequent, output M HS(Bd+ ). 4 If there exists a frequent set S in M HS(Bd+ ), Bd+ := Bd+ ∪ {go up(S)} and go to 2. Figure 1: Dualize and Advance Algorithm 2. Enumerating maximal frequent sets Basically, the algorithm of Gunopulos et al. solves dualization problems with sizes from 1 through |Bd+ |. Although we can terminate dualization when we find a new maximal frequent set, we may check each minimal infrequent item set again and again. This is one of the reasons that the algorithm of Gunopulos et al. is not fast in practice. In the next section, we propose a new algorithm obtained by interleaving gp up into a dualization algorithm. The algorithm basically solves one dualization problem of size |Bd+ |. by dualization In this section, we describe the algorithm of Gunopulos et al. Explanations are also in [Gunopulos97a, Gunopulos97b, SatohUno03], however, those are written with general terms. In this section, we explain in terms of frequent set mining. Let Bd− be the set of minimal infrequent sets. For a subset family H of E, a hitting set HS of H is a set such that for every S ∈ H, S ∩ HS = ∅. If a hitting set includes no other hitting set, then it is said to be minimal. We denote the set of all minimal hitting sets of H by M HS(H). We denote the complement of a subset S w.r.t. E by S. For a subset family H, we denote {S|S ∈ H} by H. There is a strong connection between the maximal frequent sets and the minimal infrequent sets by the minimal hitting set operation. 3. Description of our algorithm The key lemma of our algorithm is the following. + Lemma 1 [SatohUno03] Let Bd+ 1 and Bd2 be sub+ + + sets of Bd . If Bd1 ⊆ Bd2 , + − − M HS(Bd+ 1 ) ∩ Bd ⊆ M HS(Bd2 ) ∩ Bd Proposition 1 [Mannila96] Bd− = M HS(Bd+ ) Suppose that we have already found minimal hitting sets corresponding to Bd+ of a subset Bd+ of the maximal frequent sets. The above lemma means that if we add a maximal frequent set to Bd+ , any minimal hitting set we found which corresponds to a minimal infrequent set is still a minimal infrequent set. Therefore, if we can use an algorithm to visit each minimal hitting set based on an incremental addition of maximal frequent sets one by one, we no longer have to check the same minimal hitting set again even if maximal frequent sets are newly found. The dualization algorithms proposed by Kavvadias and Stavropoulos [Kavvadias99] and Uno[Uno02] are such kinds of algorithms. Using these algorithms, we reduce the number of checks. Let us show Uno’s algorithm [Uno02]. This is an improved version of Kavvadias and Stavropoulos’s algorithm [Kavvadias99]. Here we introduce some notation. A set S ∈ H is called critical for e ∈ hs, if S ∩ hs = {e}. We denote a family of critical sets for e w.r.t. hs and H as crit(e, hs). Note that mhs is a minimal hitting set of H if and only if for every e ∈ mhs, crit(e, mhs) is not empty. Using the following proposition, Gunopulos et al. proposed an algorithm called Dualize and Advance shown in Fig. 1 to compute the maximal frequent sets [Gunopulos97a]. Proposition 2 [Gunopulos97a] Let Bd+ ⊆ Bd+ . Then, for every S ∈ M HS(Bd+ ), either S ∈ Bd− or S is frequent (but not both). In the above algorithm, go up(S) for a subset S of E is a maximal frequent set which is computed as follows. 1. Select one element e from S and check the frequency of S ∪ {e}. 2. If it is frequent, S := S ∪ {e} and go to 1. 3. Otherwise, if there is no element e in S such that S ∪ {e} is frequent, then return S. Proposition 3 [Gunopulos97a] The number of frequency checks in the “Dualize and Advance” algorithm to compute Bd+ is at most |Bd+ | · |Bd− | + |Bd+ | · |E|2 . 2 global S0 , ..., Sm; compute mhs(i, mhs) /* mhs is a minimal hitting set of S0 , ..., Si */ begin 1 if i == m then output mhs and return; 2 else if Si+1 ∩ mhs = ∅ then compute mhs(i + 1, mhs); else begin 3 for every e ∈ Si+1 do 4 if for every e′ ∈ mhs, there exists Sj ∈ crit(e′ , mhs), j ≤ i s.t. Sj does not contain e then 5 comupute mhs(i + 1, mhs ∪ {e}); end return; end Figure 2: Algorithm to Enumerate Minimal Hitting Sets Suppose that H = {S1 , ..., Sm}, and let M HSi be M HS({S0 , ..., Si})(1 ≤ i ≤ n). We simply denote M HS(H) by M HS. A hitting set hs for {S1 , ..., Si} is minimal if and only if crit(e, hs) ∩ {S1 , ..., Si} = ∅ for any e ∈ hs. following conditions. (1) mhs′ = mhs (2) mhs′ = mhs ∪ {e} In particular, no mhs has a child satisfying (1) and a child satisfying (2). Lemma 2 [Uno02] For any mhs ∈ M HSi (1 ≤ i ≤ n), there exists just one minimal hitting set mhs′ ∈ M HSi−1 satisfying either of the following conditions (but not both), If mhs ∩ Si+1 = ∅ then mhs ∈ M HSi+1 , and (1) holds. If mhs ∩ Si+1 = ∅, then mhs ∈ M HSi+1 , and (2) can hold for some e ∈ Si+1 . If mhs′ = mhs ∪ {e} is a child of mhs, then for any e′ ∈ mhs, there is Sj ∈ crit(e′ , mhs), j ≤ i such that e ∈ Sj . From these observations, we obtain the algorithm described in Fig. 2. An iteration of the algorithm in Fig. 2 takes: • mhs′ = mhs. • mhs′ = mhs \ {e} where crit(e, mhs) ∩ {S0 , ..., Si} = {Si }. • O(|mhs|) time for line 1. We call mhs′ the parent of mhs, and mhs a child of mhs′ . Since the parent-child relationship is not cyclic, its graphic representation forms a forest in which each of its connected components is a tree rooted at a minimal hitting set of M HS1 . We consider the trees as traversal routes defined for all minimal hitting sets of all M HSi . These transversal routes can be traced in a depth-first manner by generating children of the current visiting minimal hitting set, hence we can enumerate sets of M HS in linear  all minimal hitting  time of i |M HSi |. Although i |M HSi | can be exponential to |M HS|, such cases are expected to be  exceptional in practice. Experimentally, i |M HSi | is linear in |M HS|. To find children of a minimal hitting set, we use the following proposition that immediately leads from the above lemma. • O(|Si+1 ∪ mhs|) time for line 2.  ′ • O((|E| − |mhs|) × e′ ∈mhs |crit(e , mhs) ∩ {S0 , ..., Si}|) time for lines 3 to 5, except for the computation of crit. To compute crit quickly, we store crit(e, mhs) in memory, and update them when we generate a recursive call. Note that this takes O(m) memory. Since crit(e′ , mhs ∪ {e}) is obtained from crit(e′ , mhs) by removing sets including e (i.e., crit(e′ , mhs ∪ {e}) = {S|S ∈ crit(e′ , mhs), e′ ∈ Si+1 }), crit(e′ , mhs ∪ {e}) for all e′ can be computed in O(m) time. Hence the computation time of an iteration is bounded by O(|E| × m). Based on this dualization algorithm, we developed a maximal frequent sets enumeration algorithm. First, the algorithm sets the input H of the dualization problem to the empty set. Then, the algorithm solves the dualization in the same way as the Proposition 4 [Uno02] Any child mhs′ of mhs ∈ M HSi satisfies one of the 3 Irredundant Border Enumerator + global integer bdpnum; sets bd+ 0 , bd1 ....; main() begin bdpnum := 0; construct bdp(0, ∅); output all the bd+ j (0 ≤ j ≤ bdpnum); end construct bdp(i, mhs) begin if i == bdpnum /* minimal hitting set for ∪bdpnum bd+ j:=0 j is found */ then goto 1 else goto 2 1. if mhs is not frequent, return; /* new Bd− element is found */ + bd+ bdpnum := go up2(mhs); /* new Bd element is found */ bdpnum := bdpnum + 1; /* proceed to 2 */ 2. if bd+ i ∩ mhs = ∅ then construct bdp(i + 1, mhs); else begin for every e ∈ bd+ i do + + + if bd+ i ∪ {e} is a minimal hitting set of {bd0 , bd1 ..., bdi−1}) then construct bdp(i + 1, mhs ∪ {e}); return; end Figure 3: Algorithm to Check Each Minimal Hitting Set Only Once above algorithm. When a minimal hitting set mhs is found, the algorithm checks its frequency. If mhs is frequent, the algorithm finds a maximal frequent set S including it, and adds S to H as a new element of H. Now mhs is not a minimal hitting set since S ∩ mhs = ∅. The algorithm continues generating a recursive call to find a minimal hitting set of the updated H. In the case that mhs is not frequent, from Lemma 1, mhs continues to be a minimal hitting set even when H is updated. Hence, we backtrack and find other minimal hitting sets. When the algorithm terminates, H is the set of maximal frequent sets, and the set of all minimal hitting sets the algorithm found is the set of minimal infrequent sets. The recursive tree the algorithm generated is a subtree of the recursive tree obtained by Uno’s dualization algorithm inputting Bd+ , which is the set of the complement of maximal frequent sets. This algorithm is described in Fig. 3. We call the algorithm Irredundant Border Enumerator (IBE algorithm, for short). Theorem 1 The computation time of IBE is ¯ + ) is the O(Dual(Bd+ ) + |Bd+ |g), where Dual(Bd computation time of Uno’s algorithm for dualizing Bd+ , and g is the computation time for go up. Note also that, the space complexity of the IBE algorithm is O(ΣS∈Bd+ |S|) since all we need to memorize is Bd+ and once a set in Bd− is checked, it is no longer need to be recorded. On the other hand, Gunopulos et al. [Gunopulos97a] suggests a usage of Fredman and Khachiyan’s algorithm [Fredman96] which needs a space of O(ΣS∈(Bd+ ∪Bd− ) |S|) since the algorithm needs both Bd+ and Bd− at the last stage. 4. Using sparseness In this section, we speed up the dualization phase of our algorithm by using a sparseness of H. In real data, the sizes of maximal frequent sets are usually small. They are often bounded by a constant. We use this sparse structure for accelerating the algorithm. 4 global S0 , ..., Sm; compute mhs(i, mhs) /* mhs is a minimal hitting set of S0 , ..., Si */ begin 1 if uncov(mhs) == ∅ then output mhs and return; 2 i := minimum index of uncov(mhs) ; 3 for every e ∈ mhs do 4 increase the counter of items in ∪S∈crit(mhs,e) S by one end 5 for every e′ ∈ mhs s.t. counter is increased by |mhs| do /* items included in all ∪S∈crit(mhs,e) S */ 6 compute mhs(i + 1, mhs ∪ {e}); return; end Figure 4: Improved Dualization Algorithm Using Sparseness First, we consider a way to reduce the computation time of iterations. Let us see the algorithm described in Fig. 2. The bottle neck part of the computation of an iteration of the algorithm is lines 3 to 5, which check the existence of a critical set Sj ∈ crit(mhs, e′ ), j < i such that e ∈ Sj . To check  this condition for an item e ∈ mhs, we spend O( e′ ∈mhs |crit(mhs, e′ )|) time, hence this check for all e ∈ mhs takes O((|E| − |mhs|) ×  ′ e′ ∈mhs |crit(mhs, e )|) time.  Instead of this, we compute S∈crit(mhs,e) S for  each e ∈ mhs. If and only if e′ ∈ S∈crit(mhs,e) S the condition of “if” at for all e ∈ mhs, e′ satisfies  line 4. To compute S∈crit(mhs,e) S for all e ∈ mhs,   we take O( e∈mhs S∈crit(mhs,e) |S|) time. In the case of IBE algorithm, S is a maximal frequent set, hence the average size of |S| is expected to be small. The sizes of minimal infrequent sets are not greater than the maximum size of maximal frequent sets, and they are usually smaller than the average size of the maximal frequent sets. Hence, |mhs| is also expected to be small. sive call, we allocate memory for each variable used in the recursive call. Hence, the memory required by the algorithm can be up to O(|E| × m). However, experimentally the required memory is always linear in the input size. Note that we can reduce the worst case memory complexity by some sophisticated algorithms. 5. Experiments In this section, we show some results of the computational experiments of our algorithms. We implement our algorithm using C programming language, and examined instances of FIMI2003. For instances of KDD-cup 2000[KDDcup00], we compared the results to the computational experiments of CHARM[Zaki02], closed[Pei00], FP-growth[Han00], and Apriori[Agrawal96] shown in [Zheng01]. The experiments in [Zheng01] were done on a PC with a Duron 550MHz CPU and 1GB RAM memory. Our experiments were done on a PC with a Pentium III 500MHz CPU and 256MB RAM memory, which is little slower than a Duron 550MHz CPU. The results are shown in Figs. 4 – 14. Note that our algorithm uses at most 170MB for any following instance. We also show the number of frequent sets, frequent closed/maximal item sets, and minimal frequent sets. In our experiments, IBE algorithm takes approximately O(|Bd− | × ave(Bd+ )) time, while the computation time of other algorithms deeply depends on the number of frequent sets, the number of frequent closed item sets, and the minimum support. We recall that ave(Bd+ ) is the average size of maximal frequent sets. In some instances, our IBE algorithm performs rather well compared to other algorithms. In these cases, the number of maximal frequent item Second, we reduce the number of iterations. For mhs ⊆ E, we define uncov(mhs) by the set of S ∈ H satisfying S ∩ mhs = ∅. If mhs ∩ Si = ∅, the iteration inputting mhs and i does nothing but generates a recursive call with increasing i by one. This type of iterations should be skipped. Only iterations executing lines 3 to 5 are crucial. Hence, in each iteration, we set i to the minimum index among uncov(mhs). As a result of this, we need not execute  line 2, and the number of iterations is reduced from i |M HSi |  to | i M HSi |. We describe the improved algorithm in Fig. 4. In our implementation, when we generate a recur5 sets is smaller than number of frequent item sets. IBE algorithm seems to give a good performance for difficult problems such that the number of maximal frequent sets is very small rather than those of frequent item sets and frequent closed item sets. [KDDcup00] Kohavi, R., Brodley, C. E., Frasca, B., Mason, L., and Zheng, Z., “KDD-Cup 2000 Organizers’ Report: Peeling the Onion,” SIGKDD Explorations, 2(2), pp. 86-98, 2000. [Mannila96] Mannila, H. and Toivonen, T., “On an Algorithm for Finding All Interesting Sentences”, Cybernetics and Systems, Vol II, The Thirteen European Meeting on Cybernetics and Systems Research, pp. 973 – 978 (1996). 6. Conclusion In this paper, we describe the detailed implementation method of our algorithm proposed in [SatohUno03] and we give some experimental results on test data. [Pei00] Pei, J., Han, J., Mao, R., “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets,” ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 2000, pp. 21-30, 2000. Acknowledgments We are grateful to Heikki Mannila for participating useful discussions about this research. [SatohUno03] Satoh, K., Uno, T., “Enumerating Maximal Frequent Sets using Irredundant Dualization”, Lecture Notes in Artificial Intelligence (Proc. of Discovery Science 2003), SpringerVarlag, pp. 192-201, 2003. References [Agrawal96] Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A. I., “Fast Discovery of Association Rules”, U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, (eds), Advances in Knowledge Discovery and Data Mining, chapter 12, pp. 307–328 (1996). [Uno02] Uno, T., “A Practical Fast Algorithm for Enumerating Minimal Set Coverings”, SIGAL83, Information Processing Society of Japan, pp. 9 – 16 (in Japanese) (2002). [Zaki02] Zaki, M. J., Hsiao, C., “CHARM: An Efficient Algorithm for Closed Itemset Mining,” 2nd SIAM International Conference on Data Mining (SDM’02), pp. 457-473, 2002. [Fredman96] Fredman, M. L. and Khachiyan, L., “On the Complexity of Dualization of Monotone Disjunctive Normal Forms”, Journal of Algorithms 21(3), pp. 618 – 628 (1996) [Zheng01] Zheng, Z., Kohavi, R., and Mason, L., “Real World Performance of Association Rule Algorithms,” KDD 2001, pp. 401-406, 2001. [Gunopulos97a] Gunopulos, D., Khardon, R., Mannila, H. and Toivonen, H., “Data mining, Hypergraph Transversals, and Machine Learning”, Proc. of PODS’97, pp. 209 – 216 (1997). [Gunopulos97b] Gunopulos, D., Mannila, H., and Saluja, S., “Discovering All Most Specific Sentences using Randomized Algorithms”, Proc. of ICDT’97, pp. 215 – 229 (1997). [Han00] Han, J., Pei, J., Yin, Y., “Mining Frequent Patterns without Candidate Generation,” SIGMOD Conference 2000, pp. 1-12, 2000 [Kavvadias99] Kavvadias, D. J., and Stavropoulos, E. C., “Evaluation of an Algorithm for the Transversal Hypergraph Problem”, Algorithm Engineering, pp 72 – 84 (1999). 6 time(sec) 10000 time(sec) BMS-WebView1 T10I4D100K 10000 1000 1000 Apriori FP-growth closet CHARM IBE 100 10 Apriori FP-growth closet CHARM IBE 100 10 1 0.01 0.02 10000) 0.04 time(sec BMS-WebView2 0.06 support 0.08 6 12 24 36 48 60 time(sec) 100000 0.1 1 support T40I10D100K 10000 Apriori FP-growth closet CHARM IBE 1000 100 IBE 1000 10 100 1 7 15 31 46 62 77 time(sec) support 1600 time(sec) BMS-POS 100000 1300 1000 700 support pumsb 10000 10000 1000 Apiori FP-growth CHARM IBE 1000 100 100 IBE 10 51 103 206 310 413 517 10 support 45000 44000 43000 42000 41000 40000 7 support time(sec) time(sec pumsb_star 10000 connect 10000) 1000 IBE 1000 100 IBE 10 time(sec time(sec) kosarak support chess 10000) 10000 45 00 0 10000 48 00 0 15000 51 00 0 20000 54 00 0 25000 57 00 0 support 30000 60 00 0 63 00 0 100 1 1000 IBE 100 1000 10 2200 IBE 100 support 3000 2500 time(sec ) 2000 1500 1000 mushroom 1000 IBE 100 10 30 20 10 5 2 support 8 1900 1600 1300 1000 support BMS-Web-View1: #item 497, #transactions, 59602, ave. size of transaction 2.51 support 60 48 36 24 12 6 Apriori 1.1 3.6 113 FP-growth 1.2 1.8 51 Closet 33 74 Charm 2.2 2.7 7.9 133 422 IBE 5.8 9.6 45 42 333 2982 #freq. sets 3992 10287 461522 #closed sets 3974 9391 64762 155651 422692 1240701 #max. freq. sets 2067 4028 15179 12956 84833 129754 #min. infreq. sets 66629 81393 150278 212073 579508 4320003 maximum use of memory: 45MB BMS-Web-View2: #items 3340, #transactions 77512, ave. size of transaction 4.62 support 77 62 46 31 15 7 Apriori 13.1 15 29.6 58.2 444 Fp-growth 7.03 10 17.2 29.6 131 763 Closet 1500 2250 3890 6840 25800 Charm 5.82 6.66 7.63 13.8 27.2 76 IBE 25 32 46 98 355 1426 #closed sets 22976 37099 60352 116540 343818 754924 #freq. sets 24143 42762 84334 180386 1599210 9897303 #max. freq. sets 3901 5230 7841 16298 43837 118022 #min. infreq. sets 657461 958953 1440057 2222510 3674692 5506524 maximal use of memory: 100MB BMS-POS: #items 1657, #transactions 517255, ave. size of transaction 6.5 support 517 413 310 206 103 51 Apriori 251 341 541 1000 2371 10000 Fp-growth 196 293 398 671 1778 6494 Closet Charm 100 117 158 215 541 3162 IBE 1714 2564 4409 9951 44328 #closed sets 121879 200030 378217 840544 1742055 21885050 #freq. sets 121956 200595 382663 984531 5301939 33399782 #max. freq. sets 30564 48015 86175 201306 891763 4280416 #min. infreq. sets 236274 337309 530946 1047496 3518003 maximum use of memory: 110MB T10I4D100K: #items 1000, #transactions 100000, ave. size of transaction 10 support 100 80 60 40 20 10 Apriori 33 39 45 62 117 256 Fp-growth 7.3 7.7 8.1 9.0 12 20 Closet 13 16 18 23 41 130 Charm 11 13 16 24 45 85 IBE 96 147 263 567 1705 #freq. sets 15010 28059 46646 84669 187679 335183 #closed sets 13774 22944 38437 67537 131342 229029 #max. freq. sets 7853 11311 16848 25937 50232 114114 #min. infreq. sets 392889 490203 736589 1462121 4776165 maximum use of memory: 60MB T40I10D100K: #items 1000, #transactions 100000, ave. size of transaction 39.6 support 1600 1300 1000 700 IBE 378 552 1122 2238 #freq. sets 4591 10110 65236 550126 #closed sets 4591 10110 65236 548349 #max. freq. sets 4003 6944 21692 41473 #min. infreq. sets 245719 326716 521417 1079237 9 maximum memory use: 74MB pumsb: #items 7117, #transactions 49046, ave. size of transaction 74 support 45000 44000 43000 42000 IBE 301 582 1069 1840 #freq. sets 1163 2993 7044 15757 #closed sets 685 1655 3582 7013 #max. freq. sets 144 288 541 932 #min. infreq. sets 7482 7737 8402 9468 maximum use of memory: 70MB pumsb star: #items 7117, #transactions 49046, ave. size of transaction 50 support 30000 25000 20000 15000 10000 5000 IBE 8 19 59 161 556 2947 #freq. sets 165 627 21334 356945 >2G #closed sets 66 221 2314 14274 111849 #max. freq. sets 4 17 81 315 1666 15683 #min. infreq. sets 7143 7355 8020 9635 19087 98938 maximum use of memory: 44MB kosarak: #items 41217, #transactions 990002, ave. size of transaction 8 support 3000 2500 2000 1500 1000 IBE 226 294 528 759 2101 #freq. sets 4894 8561 34483 219725 711424 #closed sets 4865 8503 31604 157393 496675 #max. freq. sets 792 1146 2858 4204 16231 #min. infreq. sets 87974 120591 200195 406287 875391 maximum use of memory: 170MB mushroom: #items 120, #transactions 8124, ave. size of transaction 23 support 30 20 10 5 2 IBE 132 231 365 475 433 #freq. sets 505205917 781458545 1662769667 >2G >2G #closed sets 91122 109304 145482 181243 230585 #max. freq. sets 15232 21396 30809 34131 27299 #min. infreq. sets 66085 79481 81746 69945 31880 maximum use of memory 47MB connect: #items 130, #transactions 67577, ave. size of transaction 43 support 63000 60000 57000 54000 51000 48000 45000 IBE 229 391 640 893 1154 1381 1643 #freq. sets 6327 41143 171239 541911 1436863 #closed sets 1566 4372 9041 15210 23329 #max. freq. sets 152 269 464 671 913 1166 1466 #min. infreq. sets 297 486 703 980 1291 1622 1969 maximum use of memory: 60MB chess: #items 76, #transactions 3196, ave. size of transaction 37 support 2200 1900 1600 1300 1000 IBE 19 61 176 555 2191 #freq. sets 59181 278734 1261227 5764922 29442848 #closed sets 28358 106125 366529 1247700 4445373 #max. freq. sets 1047 3673 11209 35417 114382 #min. infreq. sets 1725 5202 14969 46727 152317 maximum use of memory: 50MB 10 Probabilistic Iterative Expansion of Candidates in Mining Frequent Itemsets Attila Gyenesei and Jukka Teuhola Turku Centre for Computer Science, Dept. of Inf. Technology, Univ. of Turku, Finland Email: {gyenesei,teuhola}@it.utu.fi Abstract A simple new algorithm is suggested for frequent itemset mining, using item probabilities as the basis for generating candidates. The method first finds all the frequent items, and then generates an estimate of the frequent sets, assuming item independence. The candidates are stored in a trie where each path from the root to a node represents one candidate itemset. The method expands the trie iteratively, until all frequent itemsets are found. Expansion is based on scanning through the data set in each iteration cycle, and extending the subtries based on observed node frequencies. Trie probing can be restricted to only those nodes which possibly need extension. The number of candidates is usually quite moderate; for dense datasets 2-4 times the number of final frequent itemsets, for non-dense sets somewhat more. In practical experiments the method has been observed to make clearly fewer passes than the well-known Apriori method. As for speed, our non-optimised implementation is in some cases faster, in some others slower than the comparison methods. 1. Introduction We study the well-known problem of finding frequent itemsets from a transaction database, see [2]. A transaction in this case means a set of so-called items. For example, a supermarket basket is represented as a transaction, where the purchased products represent the items. The database may contain millions of such transactions. The frequent itemset mining is a task, where we should find those subsets of items that occur at least in a given minimum number of transactions. This is an important basic task, applicable in solving more advanced data mining problems, for example discovering association rules [2]. What makes the task difficult is that the number of potential frequent itemsets is exponential in the number of distinct items. In this paper, we follow the notations of Goethals [7]. The overall set of items is denoted by I. Any subset X ⊆ I is called an itemset. If X has k items, it is called a k- itemset. A transaction is an itemset identified by a tid. A transaction with itemset Y is said to support itemset X, if X ⊆ Y. The cover of an itemset X in a database D is the set of transactions in D that support X. The support of itemset X is the size of its cover in D. The relative frequency (probability) of itemset X with respect to D is P( X , D) = Support ( X , D) D (1) An itemset X is frequent if its support is greater than or equal to a given threshold σ. We can also express the condition using a relative threshold for the frequency: P(X, D) ≥ σrel , where 0 ≤ σrel ≤ 1. There are variants of the basic ‘all-frequent-itemsets’ problem, namely the maximal and closed itemset mining problems, see [1, 4, 5, 8, 12]. However, here we restrict ourselves to the basic task. A large number of algorithms have been suggested for frequent itemset mining during the last decade; for surveys, see [7, 10, 15]. Most of the algorithms share the same general approach: generate a set of candidate itemsets, count their frequencies in D, and use the obtained information in generating more candidates, until the complete set is found. The methods differ mainly in the order and extent of candidate generation. The most famous is probably the Apriori algorithm, developed independently by Agrawal et al. [3] and Mannila et al. [11]. It is a representative of breadth-first candidate generation: it first finds all frequent 1-itemsets, then all frequent 2-itemsets, etc. The core of the method is clever pruning of candidate k-itemsets, for which there exists a non-frequent k-1-subset. This is an application of the obvious monotonicity property: All subsets of a frequent itemset must also be frequent. Apriori is essentially based on this property. The other main candidate generation approach is depthfirst order, of which the best-known representatives are Eclat [14] and FP-growth [9] (though the ‘candidate’ concept in the context of FP-growth is disputable). These two are generally considered to be among the fastest algorithms for frequent itemset mining. However, we shall mainly use Apriori as a reference method, because it is technically closer to ours. Most of the suggested methods are analytical in the sense that they are based on logical inductions to restrict the number of candidates to be checked. Our approach (called PIE) is probabilistic, based on relative item frequencies, using which we compute estimates for itemset frequencies in candidate generation. More precisely, we generate iteratively improving approximations (candidate itemsets) to the solution. Our general endeavour has been to develop a relatively simple method, with fast basic steps and few iteration cycles, at the cost of somewhat increased number of candidates. However, another goal is that the method should be robust, i.e. it should work reasonably fast for all kinds of datasets. 0 1 2 1 2 3 3 3 2 2 3 3 3 3 2. Method description Our method can be characterized as a generate-and-test algorithm, such as Apriori. However, our candidate generation is based on probabilistic estimates of the supports of itemsets. The testing phase is rather similar to Apriori, but involves special book-keeping to lay a basis for the next generation phase. We start with a general description of the main steps of the algorithm. The first thing to do is to determine the frequencies of all items in the dataset, and select the frequent ones for subsequent processing. If there are m frequent items, we internally identify them by numbers 0, …, m-1. For each item i, we use its probability (relative frequency) P(i) in the generation of candidates for frequent itemsets. The candidates are represented as a trie structure, which is normal in this context, see [7]. Each node is labelled by one item, and a path of labels from the root to a node represents an itemset. The root itself represents the empty itemset. The paths are sorted, so that a subtrie rooted by item i can contain only items > i. Note also that several nodes in the trie can have the same item label, but not on a single path. A complete trie, storing all subsets of the whole itemset, would have 2m nodes and be structurally a binomial tree [13], where on level j there are ( mj ) nodes, see Fig. 1 for m = 4. The trie is used for book-keeping purposes. However, it is important to avoid building the complete trie, but only some upper part of it, so that the nodes (i.e. their root paths) represent reasonable candidates for frequent sets. In our algorithm, the first approximation for candidate itemsets is obtained by computing estimates for their probabilities, assuming independence of item occurrences. It means that, for example, for an itemset {x, y, z} the estimated probability is the product P(x)P(y)P(z). Nodes are created in the trie from root down along all paths as long as the path-related probability is not less that the threshold σrel. Note that the probability values are monotonically non-increasing on the way down. Fig. 2 3 Figure 1. The complete trie for 4 items. 3/4 1/2 0 3/8 1 2 3/16 3/32 2 1 1/8 3 1/16 3 2 1/4 3/16 3/8 3 3 2 1/2 1/4 3 1/8 3 3/32 3 3/64 3 Figure 2. An initial trie for the transaction set {(0, 3), (1, 2), (0, 1, 3), (1)}, with minimum support threshold σ = 1/6. The virtual nodes with probabilities < 1/6 are shown using dashed lines. shows an example of the initial trie for a given set of transactions (with m = 4). Those nodes of the complete trie (Fig. 1) that do not exist in the actual trie are called virtual nodes, and marked with dashed circles in Fig. 2. The next step is to read the transactions and count the true number of occurrences for each node (i.e. the related path support) in the trie. Simultaneously, for each visited node, we maintain a counter called pending support (PS), being the number of transactions for which at least one virtual child of the node would match. The pending support will be our criterion for the expansion of the node: If PS(x) ≥ σ, then it is possible that a virtual child of node x is frequent, and the node must be expanded. If there are no such nodes, the algorithm is ready, and the result can be read from the trie: All nodes with support ≥ σ represent frequent itemsets. Trie expansion starts the next cycle, and we iterate until the stopping condition holds. However, we must be very careful in the expansion: which virtual nodes should we materialize (and how deep, recursively), in order to avoid trie ‘explosion’, but yet approach the final solution? Here we apply item probabilities, again. In principle, we could take advantage of all information available in the current trie (frequencies of subsets, etc.), as is done in the Apriori algorithm and many others. However, we prefer simpler calculation, based on global probabilities of items. Suppose that we have a node x with pending support PS(x) ≥ σ. Assume that it has virtual child items v0, v1, …, vs-1 with global probabilities P(v0), P(v1), …, P(vs-1). Every transaction contributing to PS(x) has a match with at least one of v0, v1, …, vs-1. The local probability (LP) for a match with vi is computed as follows: speeds up the local expansion growth by one level, on the average (k levels for αk). This acceleration restricts the number of iterations efficiently. The largest extensions are applied only to the ‘skewest’ subtries, so that the total size of the trie remains tolerable. Another approach to choose α would be to do a statistical analysis to determine confidence bounds for ES. However, this is left for future work. Fig. 3 shows an example of trie expansion, assuming that the minimum support threshold σ = 80, α = 0.8, and k = 1. The item probabilities are assumed to be P(y) = 0.7, P(z) = 0.5, and P(v) = 0.8. Node t has a pending support of 100, related to its two virtual children, y and z. This means that 100 transactions contained the path from root to t, plus either or both of items y and z, so we have to test for expansion. Our formula gives y a local probability LP(y) = 0.7 / (1−(1−0.7)(1−0.5)) ≈ 0.82, so the estimated support is 82 > α⋅σ = 64, and we expand y. However, the local probability of z is only ≈ 0.59, so its estimated support is 59, and it will not be expanded. PS=100 t LP (vi ) = P (vi matches | One of v 0 , v1 , … matches ) P ((vi matches) ∧ (One of v0 , v1 … matches)) = P(One of v0 , v1, … matches) P (vi matches) = P (One of v0 , v1, … matches) P (vi ) = 1− (1− P (v 0 ))(1− P (v1 ))K(1− P (v s )) ES=82 ES=41 z ES=59 z … ES=65.6 v (2) Figure 3. An example of expansion for probabilities P(y) = 0.7, P(z) = 0.5, and P(v) = 0.8. Using this formula, we get an estimated support ES(vi): ES (vi )= LP (vi ) PS ( Parent (Vi )) y x (3) If ES(vi) ≥ σ, then we conclude that vi is expected to be frequent. However, in order to guarantee a finite number of iterations in the worst case, we have to relax this condition a bit. Since the true distribution may be very skewed, almost the whole pending support may belong to only one virtual child. To ensure convergence, we apply the following condition for child expansion in the kth iteration, k (4) ES (vi ) ≥ α σ with some constant α between 0 and 1. In the worst case this will eventually (when k is high enough) result in expansion, to get rid of a PS-value ≥ σ. In our tests, we used the heuristic value α = average probability of frequent items. The reasoning behind this choice is that it When a virtual node (y) has been materialized, we immediately test also its expansion, based on its ES-value, recursively. However, in the recursive steps we cannot apply formula (2), because we have no evidence of the children of y. Instead, we apply the unconditional probabilities of z and v in estimation: LP(z) = 82⋅0.5 = 41 < α⋅σ = 64, and LP(v) = 82⋅0.8 = 65.6 > 64. Node v is materialized, but z is not. Expansion test continues down from v. Thus, both in initialization of the trie and in its expansion phases, we can create several new levels (i.e. longer candidates) at a time, contrary to e.g. the base version of Apriori. It is true that also Apriori can be modified to create several candidate levels at a time, but at the cost of increased number of candidates. After the expansion phase the iteration continues with the counting phase, and new values for node supports and pending supports are determined. The two phases alternate until all pending supports are less than σ. We have given our method the name ‘PIE’, reflecting this Probabilistic Iterative Expansion property. 3. Elaboration The above described basic version does a lot of extra work. One observation is that as soon as the pending support of some node x is smaller than σ, we can often ‘freeze’ the whole subtrie, because it will not give us anything new; we call it ‘ready’. The readiness of nodes can be checked easily with a recursive process: A node x is ready if PS(x) < σ and all its real children are ready. The readiness can be utilized to reduce work both in counting and expansion phases. In counting, we process one transaction at a time and scan its item subsets down the trie, but only until the first ready node on each path. Also the expansion procedure is skipped for ready nodes. Finally, a simple stopping condition is when the root becomes ready. Another tailoring, not yet implemented, relates to the observation that most of the frequent itemsets are found in the first few iterations, and a lot of I/O effort is spent to find the last few frequent sets. For those, not all transactions are needed in solving the frequency. In the counting phase, we can distinguish between relevant and irrelevant transactions. A transaction is irrelevant, if it does not increase the pending support value of any nonready node. If the number of relevant transactions is small enough, we can store them separately (in main memory or temporary file) during the next scanning phase. Our implementation of the trie is quite simple; saving memory is considered, but not as the first preference. The child linkage is implemented as an array of pointers, and the frequent items are renumbered to 0, …, m-1 (if there are m frequent items) to be able to use them as indices to the array. A minor improvement is that for item i, we need only m-i-1 pointers, corresponding to the possible children i+1, …, m-1. The main restriction of the current implementation is the assumption that the trie fits in the main memory. Compression of nodes would help to some extent: Now we reserve a pointer for every possible child node, but most of them are null. Storing only non-null pointers saves memory, but makes the trie scanning slower. Also, we could release the ready nodes as soon as they are detected, in order to make room for expansions. Of course, before releasing, the related frequent itemsets should be reported. However, a fully general solution should work for any main memory and trie size. Some kind of external representation should be developed, but this is left for future work. A high-level pseudocode of the current implementation is given in the following. The recursive parts are not coded explicitly, but should be rather obvious. Algorithm PIE − Probabilistic iterative expansion of candidates in frequent itemset mining Input: A transaction database D, the minimum support threshold σ. Output: The complete set of frequent itemsets. 1. 2. 3. 4. // Initial steps. scan D and collect the set F of frequent items; α := average probability of items in F; iter := 0; 5. // The first generation of candidates, based on // item probabilities. 6. create a PIE-trie P so that it contains all such ordered subsets S ⊆ F for which Π(Prob(s∈S)) ⋅ |D| ≥ σ ; // Frequency test 7. set the status of all nodes of P to not-ready; 8. // The main loop: alternating count, test and // expand. 9. loop 10. // Scan the database and check readiness. 11. scan D and count the support and pending support values for non-ready nodes in P; 12. iter := iter + 1; 13. for each node p∈P do 14. if pending_support(p) < σ then 15. if p is a leaf then set p ready 16. else if the children of p are ready then 17. set p ready; 18. if root(P) is ready then exit loop; 19. 20. 21. 22. 23. 24. 25. 26. 27. // Expansion phase: Creation of subtries on // the basis of observed pending supports. for each non-ready node p in P do if pending_support(p) ≥ σ then for each virtual child v of p do compute local_prob(v) by formula (2); estim_support(v) := local_prob(v) ⋅ pending_support(p); iter if estim_support(v) ≥ α σ then create node v as the child of p; add such ordered subsets S ⊆ F\{1..v} as descendant paths of v, for which Π(Prob(s∈S)) ⋅ estim_support(v) iter ≥α σ; 28. // Gather up results from the trie 29. return the paths for nodes p in P such that support(p) ≥ σ; 30. end 4. Experimental results For verifying the usability of our PIE algorithm, we used four of the test datasets made available to the Workshop on Frequent Itemset Mining Implementations (FIMI’03) [6]. The test datasets and some of their properties are described in Table 1. They represent rather different kinds of domains, and we wanted to include both dense and non-dense datasets, as well as various numbers of items. Table 3. Development of the trie for dataset ‘Chess’, with three different values of σ. Table 1. Test dataset description Dataset Chess Mushroom T40I10D100K Kosarak #Transactions 3 196 8 124 100 000 900 002 frequent itemset dictates the number of iterations. This is roughly the same as the trie depth, as shown in Table 2. The PIE method can also be characterized by describing the development of the trie during the iterations. The most interesting figures are the number of nodes and the number of ready nodes, given in Table 3. Especially the number of ready nodes implies that even though we have rather many candidates (= nodes in the trie), large parts of them are not touched in the later iterations. σ #Items 75 119 942 41 270 2600 For the PIE method, the interesting statistics to be collected are the number of candidates, depth of the trie, and the number of iterations. These results are given in Table 2 for selected values of σ, for the ‘Chess’ dataset. We chose values of σ that keep the number of frequent itemsets reasonable (extremely high numbers are probably useless for any application). The table shows also the number of frequent items and frequent sets, to enable comparison with the number of candidates. For this dense dataset, the number of candidates varies between 2-4 times the number of frequent itemsets. For non-dense datasets the ratio is usually larger. Table 2 shows also the values of the ‘security parameter’ α, being the average probability of frequent items. Considering I/O performance, we can see that the number of iteration cycles (= number of file scans) is quite small, compared to the base version of the Apriori method, for which the largest 2400 2200 Iteration 1 2 3 4 1 2 3 4 1 2 3 4 5 #Frequent sets found 4 720 6 036 6 134 6 135 15 601 20 344 20 580 20 582 44 022 58 319 59 176 59 181 59 181 #Nodes 4 766 9 583 10 296 10 516 15 760 34 995 47 203 47 515 44 800 112 370 206 292 216 931 216 943 #Ready nodes 2 021 9 255 10 173 10 516 5 219 25 631 46 952 47 515 1 210 64 174 196 782 216 922 216 943 For speed comparison, we chose the Apriori and FPgrowth implementations, provided by Bart Goethals [6]. The results for the four test datasets and for different minimum support thresholds are shown in Table 4. The processor used in the experiments was a 1.5 GHz Pentium 4, with 512 MB main memory. We used a g++ compiler, using optimizing switch –O6. The PIE algorithm was coded in C. Table 2. Statistics from the PIE algorithm for dataset ‘Chess’. σ 3 000 2 900 2 800 2 700 2 600 2 500 2 400 2 300 2 200 #Frequent items 12 13 16 17 19 22 23 24 27 #Frequent sets 155 473 1 350 3 134 6 135 11 493 20 582 35 266 59 181 Alpha 0.970 0.967 0.953 0.947 0.934 0.914 0.907 0.900 0.877 #Candidates 400 1 042 2 495 5 218 10 516 18 709 47 515 131 108 216 943 Trie depth 6 8 8 9 10 11 12 13 14 #Iterations 3 4 4 4 4 4 4 4 5 #Apriori’s iterations 6 7 8 8 9 10 11 12 13 Table 4. Comparison of execution times (in seconds) of three frequent itemset mining programs for four test datasets. (a) Chess σ 3 000 2 900 2 800 2 700 2 600 2 500 2 400 2 300 2 200 #Freq. sets 155 473 1 350 3 134 6 135 11 493 20 582 35 266 59 181 (b) Mushroom #Freq. σ sets 5 000 41 4 500 97 4 000 167 3 500 369 3 000 931 2 500 2 365 2 000 6 613 1 500 56 693 (c) T40I10D100K #Freq. σ sets 20 000 5 18 000 9 16 000 17 14 000 24 12 000 48 10 000 82 8 000 137 6 000 239 4 000 440 Apriori 0.312 0.469 0.797 1.438 3.016 10.204 21.907 42.048 73.297 Apriori 0.375 0.437 0.578 0.797 1.062 1.781 3.719 55.110 Apriori 2.797 2.828 3.001 3.141 3.578 4.296 7.859 20.531 35.282 FPgrowth 0.250 0.266 0.297 0.344 0.438 0.610 0.829 1.156 1.766 FPgrowth 0.391 0.406 0.438 0.500 0.546 0.610 0.750 1.124 FPgrowth 6.328 6.578 7.250 8.484 14.750 23.874 41.203 72.985 114.953 PIE 0.125 0.265 1.813 6.938 14.876 26.360 78.325 203.828 315.562 PIE 0.062 0.094 0.141 0.297 1.157 6.046 27.047 153.187 PIE 0.797 1.110 1.156 1.187 1.906 4.344 11.796 29.671 68.672 (c) Kosarak σ 20 000 18 000 16 000 14 000 12 000 10 000 8 000 6 000 #Freq. sets 121 141 167 202 267 376 575 1 110 Apriori 27.970 28.438 29.016 29.061 29.766 34.906 35.891 39.656 FPgrowth 30.141 31.296 32.765 33.516 34.875 37.657 41.657 51.922 PIE 5.203 6.110 7.969 9.688 12.032 18.016 30.453 70.376 We can see that in some situations the PIE algorithm is the fastest, in some others the slowest. This is probably a general observation: the performance of most frequent itemset mining algorithms is highly dependent on the data set and threshold. It seems that PIE is at its best for sparse datasets (such as T40I10D100K and Kosarak), but not so good for very dense datasets (such as ‘Chess’ and ‘Mushroom’). Its speed for large thresholds probably results from the simplicity of the algorithm. For smaller thresholds, the trie gets large and the counting starts to consume more time, especially with a small main memory size. One might guess that our method is at its best for random data sets, because those would correspond to our assumption about independent item occurrences. We tested this with a dataset of 100 000 transactions, each of which contained 20 random items out of 30 possible. The results were rather interesting: For all tested thresholds for minimum support, we found all the frequent itemsets in the first iteration. However, verification of the completeness required one or two additional iterations, with a clearly higher number of candidates, consuming a majority of the total time. Table 5 shows the time and number of candidates both after the first and after the final iteration. The stepwise growth of the values reveals the levelwise growth of the trie. Apriori worked well also for this dataset, being in most cases faster than PIE. Results for FP-growth (not shown) are naturally much slower, because randomness prevents a compact representation of the transactions. We wish to point out that our implementation was an initial version, with no special tricks for speed-up. We are convinced that the code details can be improved to make the method still more competitive. For example, buffering of transactions (or temporary files) were not used to enhance the I/O performance. 5. Conclusions and future work A probability-based approach was suggested for frequent itemset mining, as an alternative to the ‘analytic’ methods common today. It has been observed to be rather robust, working reasonably well for various kinds of datasets. The number of candidate itemsets does not ‘explode’, so that the data structure (trie) can be kept in the main memory in most practical cases. The number of iterations is smallest for random datasets, because candidate generation is based on just that assumption. For skewed datasets, the number of iterations may somewhat grow. This is partly due to our simplifying premise that the items are independent. This point could be tackled by making use of the conditional probabilities obtainable from the trie. Initial tests did not show any significant advantage over the basic approach, but a more Table 5. Statistics from the PIE algorithm for a random dataset. σ #Freq. sets 50 000 44 000 43 800 43 700 43 600 43 500 40 000 28 400 28 300 28 200 28 100 28 000 30 42 124 214 331 413 465 522 724 1 270 2 223 3 357 PIE After iteration 1. After the last iteration (final) #IterTime #Freq. Time #Cand. #Cand. ations (sec.) sets (sec.) 30 0.500 30 464 2 2.234 42 2.016 465 509 3 2.704 124 1.875 465 1 247 3 10.579 214 1.876 465 1 792 3 20.250 331 1.891 465 2 775 3 37.375 413 1.860 465 3 530 3 48.953 465 1.844 465 4 443 2 62.000 522 60.265 4 525 4 900 3 64.235 724 61.422 4 525 5 989 3 82.140 1 270 61.469 4 525 8 697 3 115.250 2 223 61.734 4 525 13 608 3 167.047 3 357 60.969 4 525 19 909 3 219.578 Apriori #Iterations 2 3 3 3 3 3 3 4 4 4 4 4 Time (sec.) 3.953 5.173 6.015 7.235 9.657 11.876 13.875 15.016 15.531 19.265 31.266 69.797 sophisticated probabilistic analysis might imply some ways to restrict the number of candidates. The exploration of these elaborations, as well as tuning the buffering, data structure, and parameters, is left for future work. [8] K. Gouda and M.J. Zaki, “Efficiently Mining Maximal Frequent Itemsets”, In N. Cercone, T.Y. Lin, and X. Wu (eds.), Proc. of 2001 IEEE International Conference on Data Mining, Nov. 2001, pp. 163-170. References [9] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns Without Candidate Generation”, In W. Chen, J. Naughton, and P.A. Bernstein (eds.), Proc. of ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 1-12. [1] R. Agrawal, C. Aggarwal, and V.V.V. Prasad, “Depth First Generation of Long Patterns”, In R. Ramakrishnan, S. Stolfo, R. Bayardo, and I. Parsa (eds.), Proc. of the Int. Conf. on Knowledge Discovery and Data Mining, ACM, Aug. 2000, pp. 108-118. [2] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Databases”, In P. Buneman and S. Jajodia (eds.), Proc. of ACM SIGMOD Int. Conf. of Management of Data, May 1993, pp. 207-216. [3] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases”, In J.B. Bocca, M. Jarke, and C. Zaniolo (eds.), Proc. of the 20th VLDB Conf., Sept. 1994, pp. 487-499. [4] R.J. Bayardo, “Efficiently Mining Long Patterns from Databases”, In L.M. Haas and A. Tiwary (eds.), Proc. of the ACM SIGMOD Int. Conf. on Management of Data, June 1998, pp. 85-93. [5] D. Burdick, M. Calimlim, and J. Gehrke, “MAFIA: a Maximal Frequent Itemset Algorithm for Transactional Databases”, Proc. of IEEE Int. Conf. on Data Engineering, April 2001, pp. 443-552. [6] Frequent Itemset Mining Implementations (FIMI’03) Workshop website, http://fimi.cs.helsinki.fi, 2003. [7] B. Goethals, “Efficient Frequent Pattern Mining”, PhD thesis, University of Limburg, Belgium, Dec. 2002. [10] J. Hipp, U. Güntzer, and N. Nakhaeizadeh, “Algorithms for Association Rule Mining - a General Survey and Comparison”, ACM SIGKDD Explorations 2, July 2000, pp. 58-65. [11] H. Mannila, H. Toivonen, and A.I. Verkamo, “Efficient Algorithms for Discovering Association Rules”, In U.M. Fayyad and R. Uthurusamy (eds.), Proc. of the AAAI Workshop on Knowledge Discovery in Databases, July 1994, pp. 181-192. [12] J. Pei, J. Han, and R. Mao, “Closet: An Efficient Algorithm for Mining Frequent Closed Itemsets”, Proc. of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, May 2000, pp. 21-30. [13] J. Vuillemin, “A Data Structure for Manipulating Priority Queues”, Comm. of the ACM, 21(4), 1978, pp. 309-314. [14] M.J. Zaki, “Scalable Algorithms for Association Mining”, IEEE Transactions on Knowledge and Data Engineering 12 (3), 2000, pp. 372-390. [15] Z. Zheng, R. Kohavi, and L. Mason, “Real World Performance of Association Rule Algorithms”, In F. Provost and R. Srikant (eds.), Proc. of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001, pp. 401-406. Intersecting Data to Closed Sets with Constraints Taneli Mielikäinen HIIT Basic Research Unit Department of Computer Science University of Helsinki, Finland Taneli.Mielikainen@cs.Helsinki.FI Abstract We describe a method for computing closed sets with data-dependent constraints. Especially, we show how the method can be adapted to find frequent closed sets in a given data set. The current preliminary implementation of the method is quite inefficient but more powerful pruning techniques could be used. Also, the method can be easily applied to wide variety of constraints. Regardless of the potential practical usefulness of the method, we hope that the sketched approach can shed some additional light to frequent closed set mining. 1 Introduction Much of the research in data mining has concentrated on finding from some given (finite) set R all subsets that satisfy some condition. (For the rest of the paper we assume, w.l.o.g., that R is a finite subset of N.) The most prominent example of this task is probably the task of finding all subsets X ⊆ R that are contained at least minsupp times in the sets of a given sequence d = d1 . . . dn of subsets di ⊆ R, i.e., to find the collection F (minsupp, d) = {X ⊆ R : supp (X, d) ≥ minsupp} where supp (X, d) = |{i : X ⊆ di , 1 ≤ i ≤ n}| . The collection F (minsupp, d) is known as the collection of frequent sets. (We could have defined the collection of frequent sets by by the frequency of sets which is a normalized version of supports: f r (X, d) = supp (X, d) /n.) Recently one particular subclass of frequent sets, frequent closed sets, has received quite much attention. A set X is closed in d if supp (X, d) > supp (Y, d) for all proper supersets Y of X. The collection of closed sets (in d) is denoted by C (d) = {X ⊆ R : Y ⊆ R, Y ⊃ X ⇒ supp (X, d) > supp (Y.d)} The collection of frequent closed sets consists of the sets that are frequent and closed, i.e., FC (minsupp, d) = F (minsupp, d) ∩ C (d) . Most of the closed set mining algorithms [3, 12, 13, 14, 16, 19, 20] are based on backtracking [10]. In this paper we describe an alternative approach based on alternating between closed set generation by intersections and pruning heuristics. The method can be adapted to many kinds of constraints and needs only few passes over the data. The paper is organized as follows. In Section 2 we sketch the method, in Section 3 we adapt the method for finding closed sets with frequency constraints, in Section 4 we describe some implementations details the method, and in Section 5 we experimentally study the properties of the method. Section 6 concludes the work and suggests some improvements to the work. 2 The Method S Let us assume that R = i∈{1,...,n} di as sometimes R is not known explicitly. Furthermore, we shall use shorthand di,j for the subsequence di . . . dj , 1 ≤ i ≤ j ≤ n. The elements of R are sometimes called items and the sets di transactions. As noted in the previous section, a set X ⊆ R is closed in d if and only if supp (X, d) > supp (Y, d) for all proper supersets Y of X. However, the closed sets can be defined also as intersection of the transactions (see e.g. [11]): Definition 1 A set X ⊆ R is closed T in d if and only if there is I ⊆ {1, . . . , n} such that X = i∈I di . (By convention, T d = R.) i∈∅ i A straightforward implementation of Definition 1 ) ( \ di : I ⊆ {1, . . . , n} C (d) = i∈I leads to quite inefficient method for computing all closed sets:1 B RUTE -F ORCE S (d) 1 R ← i∈{1,...,n} di 2 supp (R) ← 0 3 for each I ⊆ T{1, . . . , n} , I 6= ∅ 4 do X ← i∈I di 5 if supp (X) < |I| 6 then supp (X) ← |I| 7 return (supp : C → N) A more efficient solution can be found by the following recursive definition of closed sets: C (d1 ) C (d1,i+1 ) = {R, d1 } = C (d1,i ) ∪ {X ∩ di+1 : X ∈ C (d1,i )} Thus n theSclosed sets can o be computed by initializing C = R = i∈{1,...,n} di (since R is always closed), initializing supp to R 7→ 0, and calling the following algorithm for each di (1 ≤ i ≤ n): I NTERSECT(supp : C → N, di ) 1 for each X ∈ C 2 do C ← C ∪ {X ∩ di } 3 if supp (X ∩ di ) < supp (X) + 1 4 then supp (X ∩ di ) ← supp (X) + 1 5 return (supp : C → N) Using the above algorithm the sequence d does not have to be stored as each di is needed just for updating the current approximation of R and intersecting the current collection C of closed sets. The closed sets can be very useful way to understand data sets that consist of only few different transactions and they have been studied in the field of Formal Concept Analysis [6]. However, many times all closed sets are not of interest but only frequent closed sets are needed. The simplest way to adapt the approach described above for finding the frequent closed sets is to first compute all closed sets C (d) and then remove the infrequent ones: FC (minsupp, d) = {X ∈ C (d) : supp (X, d) ≥ minsupp} by removing all closed sets that are not frequent. Unfortunately the collection of closed sets can be much larger than the collection of frequent closed sets. Thus the 1 If supp (X) is not defined then its value is interpreted to be 0. above approach can generate huge number of closed sets that do not have to be generated. A better approach to find the frequent closed sets is to prune the closed sets that cannot satisfy the constraints – such as the minimum support constraint – as soon as possible. If the sequence is scanned only once and nothing is known about the sequence d in advance then no pruning of infrequent closed sets can be done: the rest of the sequence can always contain each closed set at least minsupp times. If more than one pass can be afforded or something is known about the data d in advance then the pruning of closed sets that do not satisfy the constraints can be done as follows: I NTERSECTOR(d) 1 supp ← I NIT-C ONSTRAINTS (d) 2 for each di in d 3 do supp ← I NTERSECT (supp, di ) 4 U PDATE -C ONSTRAINTS (supp, di ) 5 supp ← P RUNE -B Y-C ONSTRAINTS (supp, di ) 6 return (supp : C → N) The function I NTERSECTOR is based on three subroutines: function I NIT-C ONSTRAINTS initializes the data structures used in pruning and computes the initial collection of closed sets, e.g. the the collection C = {R}, function U PDATE -C ONSTRAINTS updates the data structures by one transaction at a time, and function P RUNE -B YC ONSTRAINTS prunes those current closed sets that cannot satisfy the constraints. 3 Adaptation to Frequency Constraints The actual behaviors of the functions I NITC ONSTRAINTS, U PDATE -C ONSTRAINTS and P RUNE -B YC ONSTRAINTS depend on the constraints used to determine the closed sets that are interesting. We shall concentrate on implementing the minimum and the maximum support constraints, i.e., finding the closed sets X ∈ C (d) such that minsupp ≤ supp (X, d) ≤ maxsupp. The efficiency of pruning depends crucially on how much is known about the data. For example, if only the number of transactions in the sequence is known, then all possible pruning is essentially determined by Observation 1 and Observation 2. Observation 1 For all 1 ≤ i ≤ n holds: supp (X, d1,i )+n−i < minsupp ⇒ supp (X, d) < minsupp Observation 2 For all 1 ≤ i ≤ n holds: supp (X, d1,i ) > maxsupp ⇒ supp (X, d) > maxsupp Checking the constraints induced by Observation 1 and Observation 2 can be computed very efficiently. However, the pruning based on these observations might not be very effective: all closed sets in d1,n−minsupp can have frequency at least minsupp and all closed sets in d1,maxsupp can have frequency at most maxsupp. Thus all closed sets in d1,min{n−minsupp,maxsupp} are generated before the observations can be used to prune anything. To be able to do more extreme pruning we need more information about the sequence d. If we are able to know the number of transactions in the sequence, it might be possible to count the supports of items. In that case Observation 3 can be exploited. Observation 3 If there exists A ∈ X such that supp (X, d1,i ) + supp ({A} , di+1,n ) < minsupp then supp (X, d) < minsupp. Also, if we know the frequencies of some sets then we can make the following observation: Observation 4 If there exists Y supp (X, d1,i ) + supp (Y, di+1,n ) supp (X, d) < minsupp. ⊆ X such that < minsupp then Proposition 2 Let S be the collection of sets such that supp (Y, d1,i ) and supp (Y, d1,i ) are known for all Y ∈ S, and let S ′ consist of sets Y ∈ S, Y ⊆ X, such that supp (X, d1,i ) + supp (Y, di+1,n ) < minsupp. Then all frequent subsets of X ∈ C are the collection S ′′ of subsets Z ⊆ X such that Z 6⊆ Y for all Y ∈ S ′ , no W ⊂ Z ∈ S ′′ is contained in S ′′ . Proof. If Z ⊆ X is frequent then there is a set in S ′′ containing Z, or Z is contained in some set in S ′ but there is another set Y ∈ C such that supp (Y, d1,i ) > supp (X, d1,i ).  Proposition 3 Let S, S ′ and S ′′ be as in Proposition 2. Then X ∈ C can be replaced by the collection S ′′′ consisting of sets in S ′′ such that supp (Y, d1,i ) + supp (W, di+1,n ) < minsupp for some W ⊆ Z ⊆ Y with Y ∈ C and W ∈ S. Proof. If Z ⊆ X is frequent then it is subset of some set in S ′′ or there is Y ∈ C, Z ⊆ Y , such that supp (Y, d1,i ) + supp (W, di+1,n ) < minsupp for all W ∈ S, W ⊆ Z. If Z ∈ S ′′ is not closed then it is infrequent since none of its supersets is frequent.  Note that these observations do not mean that we could remove the infrequent closed sets from the collection since an intersection of an infrequent closed set with some transaction might still be frequent in the sequence d. However, we can do some pruning based on the observations as shown in Proposition 1. The efficiency of pruning depends crucially also on the ordering of the transactions. In Section 5 we experimentally evaluate some orderings with different data sets. Proposition 1 Let Z be the largest subset of X ∈ C such that for all A ∈ Z hold supp (X, d1,i ) + supp ({A} , di+1,n ) ≥ minsupp. Then X ∈ C can be removed from C if there is a W ⊆ Y ∈ C, Z ⊂ W , such that supp (Y, d1,i ) ≥ supp (X, d1,i ) and for all A ∈ W hold supp (Y, d1,i ) + supp ({A} , di+1,n ) ≥ minsupp, and replaced by Z otherwise. A preliminary adaptation of the algorithm I NTERSEC of Section 2 to minimum support and maximum support constraints is implemented as a program intersector. The main components of the implementation are classes Itemarray, ItemarrayInput and ItemarrayMap. The class Itemarray is a straightforward implementation consisting of int n expressing the number of items in the set and int* items that is a length (at least) n array of items (that are assumed to be nonnegative integers) in ascending order. One of the reasons why this very simple representation of a set is used is that Itemarrays are used also in the data sources, and although some more sophisticated data structures would enable to do some operations more efficiently, we believe that Itemarray reflects better what an arbitrary source of transactions could give. The class ItemarrayInput implements an interface to the data set d. The class handles the pruning of infrequent items from the input and maintaining the numbers of remaining occurrences of each item occurring in the data set. The data set d is accessed by a function pair<Itemarray*,int>* Proof. All frequent subsets of X are contained in Z. If there is a proper superset W ⊆ Y ∈ C of Z such that supp (Y, d1,i ) + supp ({A} , di+1,n ) ≥ minsupp then all frequent subsets of X are contained in W and thus X can be removed. Otherwise Z is the largest subset of X that can be frequent and there is no superset of Z that could be frequent. If Z is not closed, then its support is equal to some of its proper supersets’ supports. If Z is added to C then none of proper supersets is frequent and thus also Z is infrequent.  This idea of replacing infrequent sets based on the supports items can be generalized to the case where we know supports for some collection S of sets. 4 The Organization of the Implementation TOR getItemarray() which returns a pointer to next pair<Itemarray*,int>. The returned pointer is NULL if the previous pair were the last one in the data set d. The main difference to the reference implementation given at the home page of Workshop on Frequent Itemset Mining Implementations2 is that pair<Itemarray*,int>* is returned instead of Itemarray*. This change were made partly to reflect the attempt to have the closure property of inductive databases [9] but also because in some cases the data set is readily available in that format (or can be easily transformed into that format). The interface ItemarrayInput is currently implemented in two classes ItemarrayFileInput and ItemarrayMemoryInput. Both of the classes read the data set d from a file consisting of rows of integers with possible count in brackets. Multiple occurrences of same item in one row are taken into account only once. For example, the input file 1 2 4 3 2 5 (54) 1 1 1 1 is transformed into pairs h(1, 2, 3, 4, 5) , 54i and h(1) , 1i. The class ItemarrayFileInput maintains in the main memory only the item statistics (such as the number of remaining occurrences of each item) thus possibly reading the data set several times. The class ItemarrayMemoryInput reads the whole data set d into main memory. The latter one can be much faster since it can also reorder the data set and replace all transactions di , 1 ≤ i ≤ n, with same frequent items by one pair with appropriate count. The implementations of these classes are currently quite slow which might be seen as imitating the performance of real databases quite faithfully. The class ItemarrayMap represents a mapping from Itemarrays to supports. The class consists of a mapping map<Itemarray*,int,CardLex> itemarrays that maps the sets to supports, and a set set<Itemarray*,CardLex> forbidden consisting of the sets that are known to be infrequent or too frequent. The set set<Itemarray*,CardLex> forbidden is needed mainly because of the maximum frequency constraint. The class consists two methods: • The method intersect(const pair<Itemarray*,int>*) intersects the current collection sets represented by the mapping map<Itemarray*,int,CardLex> itemarrays by the given set pair<Itemarray*,int>* and updates the supports appropriately. • The method prune(ItemarrayInput&) prunes the sets that are already known to be infrequent or 2 http://fimi.cs.helsinki.fi/ too frequent based on the statistics maintained by the implementation of the interface ItemarrayInput. (The class CardLex defines a total ordering of sets (of integers) by their cardinality and lexicographically within of each group with same sizes.) The pruning rules used in the current implementation of the method prune(ItemarrayInput&) are Observation 1, Observation 2, and Observation 3. 5 The Experiments name T10I4D100K T40I10D100K chess connect internet kosarak mushroom pumsb pumsb* # of rows 100000 100000 3196 67557 10104 990002 8124 49046 49046 total # of items 1010228 3960507 118252 2904951 300985 8019015 186852 3629404 2475947 Table 1. The data sets We tested the efficiency and behavior of the implementation by the data sets listed in Table 1. All data sets except internet were provided by the Workshop on Frequent Itemset Mining Implementations. The data set internet is the Internet Usage data from UCI KDD Repository3 . If the data sequence is read to main memory then it can be easily reordered. Also, even if this is not the case, there exist efficient external memory sorting algorithms that can be used to reorder the data [18]. The ordering of the data can affect the performance significantly. We experimented especially with two orderings: ordering in ascending cardinality and ordering in descending cardinality. The results are shown in Figures 1–9. Each point (|C| , i) in the figures corresponds to the number |C| of closed sets in the sequence d1,i that could be frequent in the whole sequence d. Note that the reason why there is no point for each number i (1 ≤ i ≤ n) of seen transactions is that same set of items can occur several times in the sequence d. There is no clear winner within the ascending and descending orderings: with data sets T10I4D100K, T40I10D100K, internet, kosarak, and mushroom the ascending order is better whereas the descending order seems to be better with data sets chess, connect, and pumsb. However, it is not clear whether this is due to the chosen minimum support thresholds. 3 http://kdd.ics.uci.edu/ 1200 350000 increasing cardinality decreasing cardinality 300000 number of potential frequent closed sets 1000 number of potential frequent closed sets increasing cardinality decreasing cardinality 800 600 400 200 250000 200000 150000 100000 50000 0 0 0 20000 40000 60000 80000 100000 0 10000 20000 30000 number of seen transactions Figure 1. T10I4D100K, minsupp = 4600 30000 50000 60000 70000 Figure 4. connect, minsupp = 44000 30000 increasing cardinality decreasing cardinality increasing cardinality decreasing cardinality 25000 number of potential frequent closed sets 25000 number of potential frequent closed sets 40000 number of seen transactions 20000 15000 10000 5000 20000 15000 10000 5000 0 0 0 20000 40000 60000 80000 100000 0 2000 4000 number of seen transactions Figure 2. T40I10D100K, minsupp = 16000 1.2e+06 6000 8000 10000 12000 number of seen transactions Figure 5. internet, minsupp = 4200 5000 increasing cardinality decreasing cardinality increasing cardinality decreasing cardinality 4500 1e+06 number of potential frequent closed sets number of potential frequent closed sets 4000 800000 600000 400000 3500 3000 2500 2000 1500 1000 200000 500 0 0 0 500 1000 1500 2000 2500 number of seen transactions Figure 3. chess, minsupp = 2300 3000 3500 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 number of seen transactions Figure 6. kosarak, minsupp = 42000 1e+06 30000 increasing cardinality decreasing cardinality number of potential frequent closed sets 25000 20000 15000 10000 5000 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 number of seen transactions Figure 7. mushroom, minsupp = 2000 140000 increasing cardinality decreasing cardinality number of potential frequent closed sets 120000 100000 80000 6 Conclusions and Future Work 60000 40000 20000 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 number of seen transactions Figure 8. pumsb, minsupp = 44000 800 increasing cardinality decreasing cardinality In this paper we have sketched an approach for finding closed sets with some constraints from data with only few passes over the data. Also, we described a preliminary implementation of the method for finding frequent but not too frequent closed sets from data. The current version of the implementation is still quite inefficient but it can hopefully shed some light to the interplay of data and closed sets. As the current implementation of the approach is still very preliminary, there is plenty of room for improvements, e.g., the following ones: • The ordering of input seems to play crucial role in the efficiency of the method. Thus the favorable orderings should be detected and strategies for automatically finding them should be studied. 700 number of potential frequent closed sets One interpretation of the results is the following: small set di cannot increase the size of C dramatically since all new closed sets are subsets of di and di has at most 2|di | closed subsets. However, the small sets do not decrease the remaining number of occurrences of items very much either. In the case of large sets dj the situation is the opposite: each large set dj decreases the supports supp ({A} , dj+1,n ) of each item A ∈ dj but on the other hand it can generate several new closed sets. Also, we experimented with two data sets internet and mushroom to see how the behavior of the method changes when changing the minimum support threshold minsupp. The results are shown in Figure 10 and Figure 11. The pruning seems to work satisfactory if the minimum support threshold minsupp is high enough. However, it is not clear how much this is due to the pruning of infrequent items in the class ItemarrayInput and how much due to the pruning done by the class ItemarrayMap. Unfortunately, the performance rapidly collapses as the minimum support threshold decreases. It is possible that more aggressive pruning could help when the minimum support threshold minsupp is low. 600 500 • The pruning heuristics described in this paper are still quite simplistic. Thus, more sophisticated pruning techniques such as inclusion-exclusion [5] should be tested. Also, pruning co-operation between closed sets generation and the data source management should be tighten. 400 300 200 100 0 0 5000 10000 15000 20000 25000 30000 35000 40000 number of seen transactions Figure 9. pumsb*, minsupp = 32000 45000 50000 • The pruning done by the data source management could be improved. For example, the data source management could recognize consecutive redundancy in the the data source. 250 70 60 200 elapsed time in seconds elapsed time in seconds 50 150 100 40 30 20 50 10 0 3800 4000 4200 4400 4600 minimum support threshold 4800 70000 2000 2200 2400 minimum support threshold 2600 2800 40000 30000 20000 10000 3000 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 35000 number of potential frequent closed sets number of potential frequent closed sets 50000 1800 40000 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000 60000 0 1600 5000 30000 25000 20000 15000 10000 5000 0 0 0 2000 4000 6000 8000 number of seen transactions 10000 12000 Figure 10. internet, scalability • The intersection approach can be used to find all closed sets that are subsets of some given sets [11]. The method can be used to compute closed sets from the maximal sets in one pass over the data. As there exist very efficient methods for computing maximal sets [1, 2, 4, 7, 8, 15], it is possible that the performance of the combination could be quite competitive. Also, supersets of maximal frequent sets can be found with high probability from a small sample. Using these estimates one could compute supersets of frequent closed sets. This approach can be efficient if the supersets found from the sample are close enough to the actual maximal sets. • After two passes over the data it is easy to do the third pass, or even more. Thus one could apply the intersections with several different minimum support thresholds to get refining collection of frequent closed sets in the data: the already found frequent closed sets with high frequencies could be used to prune less frequent closed sets more efficiently than e.g. the occurrence counters for frequent items. 0 1000 2000 3000 4000 5000 6000 number of seen transactions 7000 8000 9000 Figure 11. mushroom, scalability • If it is not necessary to find the exact collection of closed sets with exact supports, then a sampling could be applied [17]. Also, if the data is generated by e.g. an i.i.d. source then one can sometimes obtain accurate bounds for the supports from relatively short prefixes d1,i of the sequence d. • Other kinds of constraints than frequency thresholds should be implemented and experimented with. References [1] R. J. Bayardo Jr. Efficiently mining long patterns from databases. In A. T. Laura M. Haas, editor, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, pages 85–93. ACM, 1998. [2] E. Boros, V. Gurvich, L. Khachiyan, and K. Makino. On the complexity of generating maximal frequent and minimal infrequent sets. In H. Alt and A. Ferreira, editors, STACS 2002, volume 2285 of Lecture Notes in Computer Science, pages 133–141. Springer-Verlag, 2002. [3] J.-F. Boulicaut and A. Bykowski. Frequent closures as a concise representation for binary data mining. In T. Terano, [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] H. Liu, and A. L. P. Chen, editors, Knowledge Discovery and Data Mining, volume 1805 of Lecture Notes in Artificial Intelligence, pages 62–73. Springer-Verlag, 2000. D. Burdick, M. Calimlim, and J. Gehrke. M AFIA: A maximal frequent itemset algorithm for transactional databases. In Proceedings of the 17th International Conference of Data Engineering (ICDE’01), pages 443–452, 2001. T. Calders and B. Goethals. Mining all non-derivable frequent itemsets. In T. Elomaa, H. Mannila, and H. Toivonen, editors, Principles of Data Mining and Knowledge Discovery, volume 2431 of Lecture Notes in Artificial Intelligence, pages 74–865. Springer-Verlag, 2002. B. Ganter and R. Wille. Formal Concept Analysis: Mathematical Foundations. Springer-Verlag, 1999. K. Gouda and M. J. Zaki. Efficiently mining maximal frequent itemsets. In N. Cercone, T. Y. Lin, and X. Wu, editors, Proceedings of the 2001 IEEE International Conference on Data Mining, pages 163–170. IEEE Computer Society, 2001. D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivonen, and R. S. Sharma. Discovering all most specific sentences. ACM Transactions on Database Systems, 28(2):140– 174, 2003. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of The ACM, 39(11):58–64, 1996. D. L. Kreher and D. R. Stinson. Combinatorial Algorithms: Generation, Enumeration and Search. CRC Press, 1999. T. Mielikäinen. Finding all occurring sets of interest. In J.-F. Boulicaut and S. Džeroski, editors, 2nd International Workshop on Knowledge Discovery in Inductive Databases, pages 97–106, 2003. F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. J. Zaki. CARPENTER: Finding closed patterns in long biological datasets. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2003. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In C. Beeri and P. Buneman, editors, Database Theory - ICDT’99, volume 1540 of Lecture Notes in Computer Science, pages 398– 416. Springer-Verlag, 1999. J. Pei, J. Han, and T. Mao. CLOSET: An efficient algorithm for mining frequent closed itemsets. In D. Gunopulos and R. Rastogi, editors, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 21– 30, 2000. K. Satoh and T. Uno. Enumerating maximal frequent sets using irredundant dualization. In G. Grieser, Y. Tanaka, and A. Yamamoto, editors, Discovery Science, volume 2843 of Lecture Notes in Artificial Intelligence, pages 256–268. Springer-Verlag, 2003. G. Stumme, R. Taouil, Y. Bastide, N. Pasquier, and L. Lakhal. Computing iceberg concept lattices with T I TANIC . Data & Knowledge Engineering, 42:189–222, 2002. H. Toivonen. Sampling large databases for association rules. In T. Vijayaraman, A. P. Buchmann, C. Mohan, and N. L. Sarda, editors, VLDB’96, Proceedings of 22nd International Conference on Very Large Data Bases, pages 134–145. Morgan Kaufmann, 1996. [18] J. S. Vitter. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys, 33(2):209–271, 2001. [19] J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003. [20] M. J. Zaki and C.-J. Hsiao. CHARM: An efficient algorithms for closed itemset mining. In R. Grossman, J. Han, V. Kumar, H. Mannila, and R. Motwani, editors, Proceedings of the Second SIAM International Conference on Data Mining. SIAM, 2002. ARMOR: Association Rule Mining based on ORacle Vikram Pudi Intl. Inst. of Information Technology Hyderabad, India vikram@iiit.net Abstract In this paper, we first focus our attention on the question of how much space remains for performance improvement over current association rule mining algorithms. Our strategy is to compare their performance against an “Oracle algorithm” that knows in advance the identities of all frequent itemsets in the database and only needs to gather their actual supports to complete the mining process. Our experimental results show that current mining algorithms do not perform uniformly well with respect to the Oracle for all database characteristics and support thresholds. In many cases there is a substantial gap between the Oracle’s performance and that of the current mining algorithms. Second, we present a new mining algorithm, called ARMOR, that is constructed by making minimal changes to the Oracle algorithm. ARMOR consistently performs within a factor of two of the Oracle on both real and synthetic datasets over practical ranges of support specifications. 1 Introduction We focus our attention on the question of how much space remains for performance improvement over current association rule mining algorithms. Our approach is to compare their performance against an “Oracle algorithm” that knows in advance the identities of all frequent itemsets in the database and only needs to gather the actual supports of these itemsets to complete the mining process. Clearly, any practical algorithm will have to do at least this much work in order to generate mining rules. This “Oracle approach” permits us to clearly demarcate the maximal space available for performance improvement over the currently available algorithms. Further, it enables us to construct new mining algorithms from a completely different perspective, namely, as minimally-altered derivatives of the Oracle. First, we show that while the notion of the Oracle is conceptually simple, its construction is not equally straightfor- Jayant R. Haritsa Database Systems Lab, SERC Indian Institute of Science Bangalore, India haritsa@dsl.serc.iisc.ernet.in ward. In particular, it is critically dependent on the choice of data structures used during the counting process. We present a carefully engineered implementation of Oracle that makes the best choices for these design parameters at each stage of the counting process. Our experimental results show that there is a considerable gap in the performance between the Oracle and existing mining algorithms. Second, we present a new mining algorithm, called ARMOR (Association Rule Mining based on ORacle), whose structure is derived by making minimal changes to the Oracle, and is guaranteed to complete in two passes over the database. Although ARMOR is derived from the Oracle, it may be seen to share the positive features of a variety of previous algorithms such as PARTITION [9], CARMA [5], AS-CPA [6], VIPER [10] and DELTA [7]. Our empirical study shows that ARMOR consistently performs within a factor of two of the Oracle, over both real (BMS-WebView1 [14] from Blue Martini Software) and synthetic databases (from the IBM Almaden generator [2]) over practical ranges of support specifications. The environment we consider is one where the pattern lengths in the database are small enough that the size of mining results is comparable to the available main memory. This holds when the mined data conforms to the sparse nature of market basket data for which association rule mining was originally intended. It is perhaps inappropriate to apply the problem of mining all frequent itemsets on dense datasets. For ease of exposition, we will use the notation shown in Table 1 in the remainder of this paper. Organization The remainder of this paper is organized as follows: The design of the Oracle algorithm is described in Section 2 and is used to evaluate the performance of current algorithms in Section 3. Our new ARMOR algorithm is presented in Section 4 and its main memory requirements are discussed in Section 5. The performance of ARMOR is evaluated in Section 6. Finally, in Section 7, we summarize the conclusions of our study.    ½ ·   Database of customer purchase transactions User-specified minimum rule support Set of frequent itemsets in Set of itemsets in the negative border of Set of disjoint partitions of Transactions in partitions scanned so far during algorithm execution excluding the current partition Transactions in partitions scanned so far during algorithm execution including the current partition DAG structure to store candidates   Table 1. Notation 2 The Oracle Algorithm In this section we present the Oracle algorithm which, as mentioned in the Introduction, “magically” knows in advance the identities of all frequent itemsets in the database and only needs to gather the actual supports of these itemsets. Clearly, any practical algorithm will have to do at least this much work in order to generate mining rules. Oracle in item-list format (which takes as input the database, is organized as a set of rows with each row storing an ordered list of item-identifiers (IID), representing the items purchased in the transaction), the set of frequent itemsets, , and its corresponding negative border,  , and outputs the supports of these itemsets by making one scan over the database. We first describe the mechanics of the Oracle algorithm below and then move on to discuss the rationale behind its design choices in Section 2.2. 2.1 The Mechanics of Oracle For ease of exposition, we first present the manner in which Oracle computes the supports of 1-itemsets and 2itemsets and then move on to longer itemsets. Note, however, that the algorithm actually performs all these computations concurrently in one scan over the database. 2.1.1 Counting Singletons and Pairs Data-Structure Description The counters of singletons (1-itemsets) are maintained in a 1-dimensional lookup array, ½ , and that of pairs (2-itemsets), in a lower triangular 2-dimensional lookup array,  ¾ (Similar arrays are also used in Apriori [2, 11] for its first two passes.) The  th entry in the array ½ contains two fields: (1) , the counter for the itemset  corresponding to the  th item, and (2)  , the number of frequent itemsets prior to  in  ½ , if  is frequent; null, otherwise. ArrayCount (  ½ ¾ ) Input: Transaction , 1-itemsets Array ½ , 2-itemsets Array ¾ Output: Arrays ½ and ¾ with their counts updated over 1. Itemset  = null; // to store frequent items from in Item-List format 2. for each item in transaction 3. ½ ℄  ; ℄  null 4. if ½ 5. append to  6. for = 1 to    // enumerate 2-itemsets 7. for =   to     ℄ ℄ 8. // row index ½ = ½  ℄ ℄ // column index 9. ¾ = ½ ¾ 10. ½ ¾ ℄  ;                             Figure 1. Counting Singletons and Pairs in Oracle Algorithm Description The ArrayCount function shown in Figure 1 takes as inputs, a transaction along with  ½ and ¾ , and updates the counters of these arrays over . In the ArrayCount function, the individual items in the transaction are enumerated (lines 2–5) and for each item, its corresponding count in  ½ is incremented (line 3). During this process, the frequent items in are stored in a separate itemset (line 5). We then enumerate all pairs of items contained in (lines 6–10) and increment the counters of the corresponding 2-itemsets in  ¾ (lines 8–10). 2.1.2 Counting  -itemsets,  ¾ Data-Structure Description Itemsets in   of length greater than 2 and their related information (counters, etc.) are stored in a DAG structure  , which is pictorially shown in Figure 2 for a database with items A, B, C, D. Although singletons and pairs are stored in lookup arrays, as mentioned before, for expository ease, we assume that they too are stored in  in the remainder of this discussion. Each itemset is stored in a separate node of  and is linked to the first two (in a lexicographic ordering) of its subsets. We use the terms “mother” and “father” of an itemset to refer to the (lexicographically) smaller subset and the (lexicographically) larger subset, respectively. E.g., A, B and A, C are the mother and father respectively of A, B, C. For each itemset  in  , we also store with it links to those supersets of  for which  is a mother. We call this list of links as childset. Since each itemset is stored in a separate node in the DAG, we use the terms “itemset” and “node” interchangeably in the remainder of this discussion. Also, we use  to denote the set of itemsets that are stored in the DAG structure  . A AB AC ABC B C AD ABD BC ACD D BD BCD CD mother father ABCD Figure 2. DAG Structure Containing Power Set of A,B,C,D Algorithm Description We use a partitioning [9] scheme wherein the database is logically divided into  disjoint horizontal partitions ½  ¾    . In this scheme, itemsets being counted are enumerated only at the end of each partition and not after every tuple. Each partition is as large as can fit in available main memory. For ease of exposition, we assume that the partitions are equi-sized. However, we hasten to add that the technique is easily extendible to arbitrary partition sizes. The pseudo-code of Oracle is shown in Figure 3 and operates as follows: The ReadNextPartition function (line 3) reads tuples from the next partition and simultaneously creates tid-lists½ (within that partition) of singleton itemsets in  . Note that this conversion of the database to the tid-list (TL) format is an on-the-fly operation and does not change the complexity of Oracle by more than a (small) constant factor. Next, the Update function (line 5) is applied on each singleton in  . This function takes a node  in  as input and updates the counts of all descendants of  to reflect their counts over the current partition. The count of any itemset within a partition is equal to the length of its corresponding tidlist (within that partition). The tidlist of an itemset can be obtained as the intersection of the tidlists of its mother and father and this process is started off using the tidlists of frequent 1-itemsets. The exact details of tidlist computation are discussed later. We now describe the manner in which the itemsets in  are enumerated after reading in a new partition. The set of links,       , induce a spanning tree of  (e.g. consider only the solid edges in Figure 2). We perform a depth first search on this spanning tree to enumerate all its itemsets. When a node in the tree is visited, we compute the tidlists of all its children. This ensures that when an itemset is visited, the tidlists of its mother and father have already been computed. Ë ½ A tid-list of an itemset contain . is an ordered list of TIDs of transactions that Oracle ( ,  ) Input: Database , Itemsets to be Counted  Output: Itemsets in  with Supports  = Number of Partitions 1. 2. for  = 1 to  3. ReadNextPartition( ,  ); 4. for each singleton  in  5. Update( );    Figure 3. The Oracle Algorithm Update ( ) Input: DAG Node  Output:  and its Descendents with Counts Updated 1.  = convert    to Tid-vector format //  is statically allocated 2. for each node  in    3.    = Intersect( ,     );   +=    4. 5. for each node  in    6. Update( ); Figure 4. Updating Counts Intersect ( , ) Input: Tid-vector  , Tid-list Output:   1. Tid-list  =  2. for each  in 3.  =    (tid of first transaction in current partition) 4. if   ℄ = 1 then 5.  =    6. return  Figure 5. Intersection The above processing is captured in the function Update whose pseudo-code is shown in Figure 4. Here, the tidlist of a given node  is first converted to the tid-vector (TV) format¾ (line 1). Then, tidlists of all children of  are computed (lines 2–4) after which the same children are visited in a depth first search (lines 5–6). The mechanics of tidlist computation, as promised earlier, are given in Figure 5. The Intersect function shown here takes as input a tid-vector  and a tid-list . Each  in is added to the result if    ℄ is 1 (lines 2–5) where  is defined in line 3 and represents the position of the transaction relative to the current partition. ¾ A tid-vector of an itemset is a bit-vector of 1’s and 0’s to represent the presence or absence respectively, of in the set of customer transactions. 2.2 Rationale for the Oracle Design 3 Performance Study We show that Oracle is optimal in two respects: (1) It enumerates only those itemsets in  that need to be enumerated, and (2) The enumeration is performed in the most efficient way possible. These results are based on the following two theorems. Due to lack of space we have deferred the proofs of theorems to [8]. In the previous section, we have described the Oracle algorithm. In order to assess the performance of current mining algorithms with respect to the Oracle algorithm, we have chosen VIPER [10] and FP-growth [4], among the latest in the suite of online mining algorithms. For completeness and as a reference point, we have also included the classical Apriori in our evaluation suite. Our experiments cover a range of database and mining workloads, and include the typical and extreme cases considered in previous studies – the only difference is that we also consider database sizes that are significantly larger than the available main memory. The performance metric in all the experiments is the total execution time taken by the mining operation. The databases used in our experiments were synthetically generated using the technique described in [2] and attempt to mimic the customer purchase behavior seen in retailing environments. The parameters used in the synthetic generator and their default values are described in Table 2. In particular, we consider databases with parameters T10.I4, T20.I12 and T40.I8 with 10 million tuples in each of them. Theorem 2.1 If the size of each partition is large enough   of length greater than 2 is that every itemset in present at least once in it, then the only itemsets being enumerated in the Oracle algorithm are those whose counts need to be incremented in that partition. Theorem 2.2 The cost of enumerating each itemset in Oracle is ¢´½µ with a tight constant factor. While Oracle is optimal in these respects, we note that there may remain some scope for improvement in the details of tidlist computation. That is, the Intersect function (Figure 5) which computes the intersection of a tid-vector  and a tid-list requires ¢´ µ operations.  itself was originally constructed from a tid-list, although this cost is amortized over many calls to the Intersect function. We plan to investigate in our future work whether the intersection of two sets can, in general, be computed more efficiently – for example, using diffsets, a novel and interesting approach suggested in [13]. The diffset of an itemset  is the setdifference of the tid-list of  from that of its mother. Diffsets can be easily incorporated in Oracle – only the Update function in Figure 4 of Section 2 is to be changed to compute diffsets instead of tidlists by following the techniques suggested in [13]. Advantages of Partitioning Schemes Oracle, as discussed above, uses a partitioning scheme. An alternative commonly used in current association rule mining algorithms, especially in hashtree [2] based schemes, is to use a tuple-by-tuple approach. A problem with the tuple-by-tuple approach, however, is that there is considerable wasted enumeration of itemsets. The core operation in these algorithms is to determine all candidates that are subsets of the current transaction. Given that a frequent itemset  is present in the current transaction, we need to determine all candidates that are immediate supersets of  and are also present in the current transaction. In order to achieve this, it is often necessary to enumerate and check for the presence of many more candidates than those that are actually present in the current transaction. Parameter     Meaning Number of items Mean transaction length Number of potentially frequent itemsets Mean length of potentially frequent itemsets Number of transactions in the database Default Values 1000 10, 20, 40 2000 4, 8, 12 10M Table 2. Parameter Table We set the rule support threshold values to as low as was feasible with the available main memory. At these low support values the number of frequent itemsets exceeded twenty five thousand! Beyond this, we felt that the number of rules generated would be enormous and the purpose of mining – to find interesting patterns – would not be served. In particular, we set the rule support threshold values for the T10.I4, T20.I12 and T40.I8 databases to the ranges (0.1%– 2%), (0.4%–2%) and (1.15%–5%), respectively. Our experiments were conducted on a 700-MHz Pentium III workstation running Red Hat Linux 6.2, configured with a 512 MB main memory and a local 18 GB SCSI 10000 rpm disk. For the T10.I4, T20.I12 and T40.I8 databases, the associated database sizes were approximately 500MB, 900MB and 1.7 GB, respectively. All the algorithms in our evaluation suite are written in C++. We implemented a ba- sic version of the FP-growth algorithm ¿ wherein we assume that the entire FP-tree data structure fits in main memory. Finally, the partition size in Oracle was fixed to be 20K tuples. 3.1 Experimental Results for Current Mining Algorithms We now report on our experimental results. We conducted two experiments to evaluate the performance of current mining algorithms with respect to the Oracle. Our first experiment was run on large (10M tuples) databases, while our second experiment was run on small (100K tuples) databases. 3.1.1 Experiment 1: Performance of Current Algorithms In our first experiment, we evaluated the performance of Apriori, VIPER and Oracle algorithms for the T10.I4, T20.I12 and T40.I8 databases each containing 10M transactions and these results are shown in Figures 6a–c. The x-axis in these graphs represent the support threshold values while the y-axis represents the response times of the algorithms being evaluated. In these graphs, we see that the response times of all algorithms increase exponentially as the support threshold is reduced. This is only to be expected since the number of itemsets in the output,   , increases exponentially with decrease in the support threshold. We also see that there is a considerable gap in the performance of both Apriori and VIPER with respect to Oracle. For example, in Figure 6a, at a support threshold of 0.1%, the response time of VIPER is more than 6 times that of Oracle whereas the response time of Apriori is more than 26 times! In this experiment, we could not evaluate the performance of FP-growth because it did not complete in any of our runs on large databases due to its heavy and database size dependent utilization of main memory. The reason for this is that FP-growth stores the database itself in a condensed representation in a data structure called FP-tree. In [4], the authors briefly discuss the issue of constructing disk-resident FP-trees. We however, did not take this into account in our implementation. 3.1.2 Experiment 2: Small Databases Since, as mentioned above, it was not possible for us to evaluate the performance of FP-growth on large databases due to its heavy utilization of main memory, we evaluated the performance of FP-growth and other current algorithms on ¿ The original implementation by Han et al. was not available. small databases consisting of 100K transactions. The results of this experiment are shown in Figures 7a–c, which correspond to the T10.I4, T20.I12 and T40.I8 databases, respectively. In these graphs, we see there continues to be a considerable gap in the performance of current mining algorithms with respect to Oracle. For example, for the T40.I8 database, the response time of FP-growth is more than 8 times that of Oracle for the entire support threshold range. 4 The ARMOR Algorithm     ARMOR ( , ) Input: Database , Set of Items , Minimum Support Output:  with Supports = Number of Partitions 1.   2. 3. 4. 5. 6. 7. //—– First Pass —– = // candidate set (in a DAG) for = 1 to ReadNextPartition( ,  ); for each singleton in  +=   Update1( , ); 8. 9. 10. 11. 12. 13. 14. 15. //—– Second Pass —– ); RemoveSmall( , ); OutputFinished( , for = 1 to if (all candidates in  have been output) exit ReadNextPartition( ,  ); for each singleton in  Update2( , );                      Figure 8. The ARMOR Algorithm In the previous section, our experimental results have shown that there is a considerable gap in the performance between the Oracle and existing mining algorithms. We now move on to describe our new mining algorithm, ARMOR (Association Rule Mining based on ORacle). In this section, we overview the main features and the flow of execution of ARMOR – the details of candidate generation are deferred to [8] due to lack of space. The guiding principle in our design of the ARMOR algorithm is that we consciously make an attempt to determine the minimal amount of change to Oracle required to result in an online algorithm. This is in marked contrast to the earlier approaches which designed new algorithms by trying to address the limitations of previous online algorithms. That is, we approach the association rule mining problem from a completely different perspective. (a) T10.I4.D10M 7000 (b) T20.I12.D10M 18000 Oracle Apriori VIPER 6000 (c) T40.I8.D10M 25000 Oracle Apriori VIPER 16000 20000 4000 3000 2000 Time (seconds) Time (seconds) Time (seconds) 14000 5000 12000 10000 8000 6000 15000 10000 4000 1000 Oracle Apriori VIPER 5000 2000 0 0 0 0.5 1 Support (as a %) 1.5 2 0 0 0.5 1 Support (as a %) 1.5 2 0 1 2 3 Support (as a %) 4 5 Figure 6. Performance of Current Algorithms (Large Databases) (a) T10.I4.D100K Oracle Apriori VIPER FP-growth 60 140 50 40 30 20 10 0 (c) T40.I8.D100K 450 Oracle Apriori VIPER FP-growth 160 Time (seconds) Time (seconds) (b) T20.I12.D100K 180 350 120 100 80 60 300 250 200 150 40 100 20 50 0 0 0.5 1 Support (as a %) 1.5 2 Oracle Apriori VIPER FP-growth 400 Time (seconds) 70 0 0 0.5 1 Support (as a %) 1.5 2 0 1 2 3 Support (as a %) 4 5 Figure 7. Performance of Current Algorithms (Small Databases) In ARMOR, as in Oracle, the database is conceptually partitioned into  disjoint blocks  ½  ¾    . At most two passes are made over the database. In the first pass we form a set of candidate itemsets,  , that is guaranteed to be a superset of the set of frequent itemsets. During the first pass, the counts of candidates in  are determined over each partition in exactly the same way as in Oracle by maintaining the candidates in a DAG structure. The 1-itemsets and 2-itemsets are stored in lookup arrays as in Oracle. But unlike in Oracle, candidates are inserted and removed from  at the end of each partition. Generation and removal of candidates is done simultaneously while computing counts. The details of candidate generation and removal during the first pass are described in [8] due to lack of space. For ease of exposition we assume in the remainder of this section that all candidates (including 1-itemsets and 2-itemsets) are stored in the DAG. Along with each candidate  , we also store the following three integers as in the CARMA algorithm [5]: (1)   : the number of occurrences of  since  was last inserted in  . (2)      : the index of the partition at which  was inserted in  . (3)    : upper bound on the number of occurrences of  before  was inserted in  .      and    are computed when  is inserted into  in a manner identical to CARMA. While the CARMA algorithm works on a tuple-by-tuple basis, we have adapted the semantics of these fields to suit the partitioning approach. If the database scanned so far is (refer Table 1), then the support of any candidate  in  will lie in the range          ℄ [5]. These bounds are denoted by minSupport( ) and maxSupport( ), respectively. We define an itemset  to be -frequent if minSupport    . Unlike in the CARMA algorithm where only frequent itemsets are stored at any stage, the DAG structure in ARMOR contains other candidates, including the negative border of the -frequent itemsets, to ensure efficient candidate generation. The details are given in [8]. At the end of the first pass, the candidate set  is pruned to include only -frequent itemsets and their negative border. The counts of itemsets in  over the entire database are determined during the second pass. The counting process is again identical to that of Oracle. No new candidates are generated during the second pass. However, candidates may be removed. The details of candidate removal in the second pass is deferred to [8]. The pseudo-code of ARMOR is shown in Figure 8 and is explained below. 4.1 First Pass At the beginning of the first pass, the set of candidate itemsets  is initialized to the set of singleton itemsets (line 2). The ReadNextPartition function (line 4) reads tuples from the next partition and simultaneously creates tid-lists of singleton itemsets in  . After reading in the entire partition, the Update1 function (details in [8]) is applied on each singleton in  (lines 5–7). It increments the counts of existing candidates by their corresponding counts in the current partition. It is also responsible for generation and removal of candidates. At the end of the first pass,  contains a superset of the set of frequent itemsets. For a candidate in  that has been inserted at partition   , its count over the partitions     will be available. 4.2 Second Pass At the beginning of the second pass, candidates in  that are neither -frequent nor part of the current negative border are removed from  (line 8). For candidates that have been inserted in  at the first partition, their counts over the entire database will be available. These itemsets with their counts are output (line 9). The OutputFinished function also performs the following task: If it outputs an itemset  and  has no supersets left in  ,  is removed from  . During the second pass, the ReadNextPartition function (line 13) reads tuples from the next partition and creates tid-lists of singleton itemsets in  . After reading in the entire partition, the Update2 function (details in [8]) is applied on each singleton in  (lines 14–15). Finally, before reading in the next partition we check to see if there are any more candidates. If not, the mining process terminates. 5 Memory Utilization in ARMOR In the design and implementation of ARMOR, we have opted for speed in most decisions that involve a space-speed tradeoff. Therefore, the main memory utilization in ARMOR is certainly more as compared to algorithms such as Apriori. However, in the following discussion, we show that the memory usage of ARMOR is well within the reaches of current machine configurations. This is also experimentally confirmed in the next section. The main memory consumption of ARMOR comes from the following sources: (1) The 1-d and 2-d arrays for storing counters of singletons and pairs, respectively; (2) The DAG structure for storing counters (and tidlists) of longer itemsets, and (3) The current partition. The total number of entries in the 1-d and 2-d arrays and in the DAG structure corresponds to the number of candidates in ARMOR, which as we have discussed in [8], is only marginally more than    . For the moment, if we disregard the space occupied by tidlists of itemsets, then the amortized amount of space taken by each candidate is a small constant (about 10 integers for the dag and 1 integer for the arrays). E.g., if there are 1 million candidates in the dag and 10 million in the array, the space required is about 80MB. Since the environment we consider is one where the pattern lengths are small, the number of candidates will typically be well within the available main memory. [12] discusses alternative approaches when this assumption does not hold. Regarding the space occupied by tidlists of itemsets, note that ARMOR only needs to store tidlists of -frequent itemsets. The number of -frequent itemsets is of the same order as the number of frequent itemsets,  . The total space occupied by tidlists while processing partition   is then bounded by       integers. E.g., if     and     , then the space occupied by tidlists is bounded by about 400MB. We assume   to be in the range of a few thousands at most because otherwise the total number of rules generated would be enormous and the purpose of mining would not be served. Note that the above bound is very pessimistic. Typically, the lengths of tidlists are much smaller than the partition size, especially as the itemset length increases. Main memory consumed by the current partition is small compared to the above two factors. E.g., If each transaction occupies 1KB, a partition of size 20K would require only 20MB of memory. Even in these extreme examples, the total memory consumption of ARMOR is 500MB, which is acceptable on current machines. Therefore, in general we do not expect memory to be an issue for mining market-basket databases using ARMOR. Further, even if it does happen to be an issue, it is easy to modify ARMOR to free space allocated to tidlists at the ex- 6 Experimental Results for ARMOR We evaluated the performance of ARMOR with respect to Oracle on a variety of databases and support characteristics. We now report on our experimental results for the same performance model described in Section 3. Since Apriori, FP-growth and VIPER have already been compared against Oracle in Section 3.1, we do not repeat those observations here, but focus on the performance of ARMOR. This lends to the visual clarity of the graphs. We hasten to add that ARMOR does outperform the other algorithms. 6.1 Experiment 3: Performance of ARMOR In this experiment, we evaluated the response time performance of the ARMOR and Oracle algorithms for the T10.I4, T20.I12 and T40.I8 databases each containing 10M transactions and these results are shown in Figures 9a–c. In these graphs, we first see that ARMOR’s performance is close to that of Oracle for high supports. This is because of the following reasons: The density of the frequent itemset distribution is sparse at high supports resulting in only a few frequent itemsets with supports “close” to  . Hence, frequent itemsets are likely to be locally frequent within most partitions. Even if they are not locally frequent in a few partitions, it is very likely that they are still -frequent over these partitions. Hence, their counters are updated even over these partitions. Therefore, the complete counts of most candidates would be available at the end of the first pass resulting in a “light and short” second pass. Hence, it is expected that the performance of ARMOR will be close to that of Oracle for high supports. Since the frequent itemset distribution becomes dense at low supports, the above argument does not hold in this support region. Hence we see that ARMOR’s performance relative to Oracle decreases at low supports. But, what is far more important is that ARMOR consistently performs within a factor of two of Oracle. This is highlighted in Table 3 where we show the ratios of the performance of ARMOR to that of Oracle for the lowest support values considered for each of the databases. 6.2 Experiment 4: Memory Utilization in ARMOR The previous experiments were conducted with the total number of items,  , being set to 1K. In this experiment we  Database ( =10M) T10.I4 T20.I12 T40.I8 (%) 0.1 0.4 1.15 ARMOR (seconds) 371.44 1153.42 2703.64 Oracle (seconds) 226.99 814.01 2267.26 ARMOR / Oracle 1.63 1.41 1.19 Table 3. Worst-case Efficiency of ARMOR w.r.t Oracle T10.I4.D10M 160 1K items 20K items 140 Memory Used (MB) pense of time:    can be freed after line 3 in the Update function shown in Figure 4. A final observation is that the main memory consumption of ARMOR is proportional to the size of the output and does not “explode” as the input problem size increases. 120 100 80 60 40 20 0 0 0.005 0.01 Support (as a %) 0.015 0.02 Figure 10. Memory Utilization in ARMOR set the value of  to 20K items for the T10.I4 database – this environment represents an extremely stressful situation for ARMOR with regard to memory utilization due to the very large number of items. Figure 10 shows the memory utilization of ARMOR as a function of support for the  = 1K and  = 20K cases. We see that the main memory utilization of ARMOR scales well with the number of items. For example, at the 0.1% support threshold, the memory consumption of ARMOR for  = 1K items was 104MB while for  = 20K items, it was 143MB – an increase in less than 38% for a 20 times increase in the number of items! The reason for this is that the main memory utilization of ARMOR does not depend directly on the number of items, but only on the size of the output,   , as discussed in Section 5. 6.3 Experiment 5: Real Datasets Despite repeated efforts, we were unable to obtain large real datasets that conform to the sparse nature of market basket data since such data is not publicly available due to proprietary reasons. The datasets in the UC Irvine public domain repository [3] which are commonly used in data mining studies were not suitable for our purpose since they (a) T10.I4.D10M 400 (b) T20.I12.D10M 1200 Oracle ARMOR 350 (c) T40.I8.D10M 3000 Oracle ARMOR 1000 Oracle ARMOR 2500 250 200 150 Time (seconds) Time (seconds) Time (seconds) 300 800 600 400 2000 1500 1000 100 200 50 0 500 0 0 0.5 1 Support (as a %) 1.5 2 0 0 0.5 1 Support (as a %) 1.5 2 0 1 2 3 Support (as a %) 4 5 Figure 9. Performance of ARMOR (Synthetic Datasets) (b) EachMovie 9 Oracle ARMOR 8 Time (seconds) 7 6 5 4 3 2 1 0 3 4 5 6 7 Support (as a %) 8 9 10 (a) BMS-WebView-1 9 Oracle ARMOR 8 are dense and have long patterns. We could however obtain two datasets – BMS-WebView-1, a clickstream data from Blue Martini Software [14] and EachMovie, a movie database from Compaq Equipment Corporation [1], which we transformed to the format of boolean market basket data. The resulting databases had 59,602 and 61,202 transactions respectively with 870 and 1648 distinct items. We set the rule support threshold values for the BMS-WebView-1 and EachMovie databases to the ranges (0.06%–0.1%) and (3%–10%), respectively. The results of these experiments are shown in Figures 11a–b. We see in these graphs that the performance of ARMOR continues to be within twice that of Oracle. The ratio of ARMOR’s performance to that of Oracle at the lowest support value of 0.06% for the BMS-WebView-1 database was 1.83 whereas at the lowest support value of 3% for the EachMovie database it was 1.73. 6.4 Discussion of Experimental Results Time (seconds) 7 6 5 4 3 2 1 0 0.06 0.07 0.08 0.09 0.1 Support (as a %) Figure 11. Performance of Armor (Real Datasets) We now explain the reasons as to why ARMOR should typically perform within a factor of two of Oracle. First, we notice that the only difference between the single pass of Oracle and the first pass of ARMOR is that ARMOR continuously generates and removes candidates. Since the generation and removal of candidates in ARMOR is dynamic and efficient, this does not result in a significant additional cost for ARMOR. Since candidates in ARMOR that are neither -frequent nor part of the current negative border are continuously removed, any itemset that is locally frequent within a partition, but not globally frequent in the entire database is likely to be removed from  during the course of the first pass (unless it belongs to the current negative border). Hence the resulting candidate set in ARMOR is a good approximation of the required mining output. In fact, in our experiments, we found that in the worst case, the number of candidates counted in ARMOR was only about ten percent more than the required mining output. The above two reasons indicate that the cost of the first pass of ARMOR is only slightly more than that of (the single pass in) Oracle. Next, we notice that the only difference between the second pass of ARMOR and (the single pass in) Oracle is that in ARMOR, candidates are continuously removed. Hence the number of itemsets being counted in ARMOR during the second pass quickly reduces to much less than that of Oracle. Moreover, ARMOR does not necessarily perform a complete scan over the database during the second pass since the second pass ends when there are no more candidates. Due to these reasons, we would expect that the cost of the second pass of ARMOR is usually less than that of (the single pass in) Oracle. Since the cost of the first pass of ARMOR is usually only slightly more than that of (the single pass in) Oracle and that of the second pass is usually less than that of (the single pass in) Oracle, it follows that ARMOR will typically perform within a factor of two of Oracle. In summary, due to the above reasons, it appears unlikely for it to be possible to design algorithms that substantially reduce either the number of database passes or the number of candidates counted. These represent the primary bottlenecks in association rule mining. Further, since ARMOR utilizes the same itemset counting technique of Oracle, further overall improvement without domain knowledge seems extremely difficult. Finally, even though we have not proved optimality of Oracle with respect to tidlist intersection, we note that any smart intersection techniques that may be implemented in Oracle can also be used in ARMOR. 7 Conclusions A variety of novel algorithms have been proposed in the recent past for the efficient mining of association rules, each in turn claiming to outperform its predecessors on a set of standard databases. In this paper, our approach was to quantify the algorithmic performance of association rule mining algorithms with regard to an idealized, but practically infeasible, “Oracle”. The Oracle algorithm utilizes a partitioning strategy to determine the supports of itemsets in the required output. It uses direct lookup arrays for counting singletons and pairs and a DAG data-structure for counting longer itemsets. We have shown that these choices are optimal in that only required itemsets are enumerated and that the cost of enumerating each itemset is ¢´½µ. Our experimental results showed that there was a substantial gap between the performance of current mining algorithms and that of the Oracle. We also presented a new online mining algorithm called ARMOR (Association Rule Mining based on ORacle), that was constructed with minimal changes to Oracle to result in an online algorithm. ARMOR utilizes a new method of candidate generation that is dynamic and incremental and is guaranteed to complete in two passes over the database. Our experimental results demonstrate that ARMOR performs within a factor of two of Oracle. Acknowledgments This work was partially supported by a Swarnajayanti Fellowship from the Dept. of Science and Technology, Govt. of India. References [1] Eachmovie collaborative filtering data set. http://www.research.compaq.com/SRC/eachmovie/, 1997. [2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of Intl. Conf. on Very Large Databases (VLDB), Sept. 1994. [3] C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/ mlearn/MLRepository. html, 1998. [4] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. of ACM SIGMOD Intl. Conf. on Management of Data, May 2000. [5] C. Hidber. Online association rule mining. In Proc. of ACM SIGMOD Intl. Conf. on Management of Data, June 1999. [6] J. Lin and M. H. Dunham. Mining association rules: Antiskew algorithms. In Proc. of Intl. Conf. on Data Engineering (ICDE), 1998. [7] V. Pudi and J. Haritsa. Quantifying the utility of the past in mining large databases. Information Systems, July 2000. [8] V. Pudi and J. Haritsa. On the optimality of association-rule mining algorithms. Technical Report TR-2001-01, DSL, Indian Institute of Science, 2001. [9] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. of Intl. Conf. on Very Large Databases (VLDB), 1995. [10] P. Shenoy, J. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, and D. Shah. Turbo-charging vertical mining of large databases. In Proc. of ACM SIGMOD Intl. Conf. on Management of Data, May 2000. [11] R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. of Intl. Conf. on Very Large Databases (VLDB), Sept. 1995. [12] Y. Xiao and M. H. Dunham. Considering main memory in mining association rules. In Proc. of Intl. Conf. on Data Warehousing and Knowledge Discovery (DAWAK), 1999. [13] M. J. Zaki and K. Gouda. Fast vertical mining using diffsets. Technical Report 01-1, Rensselaer Polytechnic Institute, 2001. [14] Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In Proc. of Intl. Conf. on Knowledge Discovery and Data Mining (KDD), Aug. 2001. AIM: Another Itemset Miner Amos Fiat, Sagi Shporer School of Computer Science Tel-Aviv University Tel Aviv, Israel {fiat, shporer}@tau.ac.il Abstract 1.1. Contributions of this paper We present a new algorithm for mining frequent itemsets. Past studies have proposed various algorithms and techniques for improving the efficiency of the mining task. We integrate a combination of these techniques into an algorithm which utilize those techniques dynamically according to the input dataset. The algorithm main features include depth first search with vertical compressed database, diffset, parent equivalence pruning, dynamic reordering and projection. Experimental testing suggests that our algorithm and implementation significantly outperform existing algorithms/implementations. We combine several pre-existing ideas in a fairly straightforward way and get a new frequent itemset mining algorithm. In particular, we combine the sparse vertical bit vector technique along with the difference sets technique of [14], thus reducing the computation time when compared with [14]. The various techniques were put in use dynamically according to the input dataset, thus utilizing the advantages and avoiding the drawbacks of each technique. Experimental results suggest that for a given level of support, our algorithm/implementation is faster than the other algorithms with which we compare ourselves. This set includes the dEclat algorithm of [14] which seems to be the faster algorithm amongst all others. 1. Introduction 2. Related Work Finding association rules is one of the driving applications in data mining, and much research has been done in this field [10, 7, 4, 6]. Using the supportconfidence framework, proposed in the seminal paper of [1], the problem is split into two parts — (a) finding frequent itemsets, and (b) generating association rules. Let I be a set of items. A subset X ⊆ I is called an itemset. Let D be a transactional database, where each transaction T ∈ D is a subset of I : T ⊆ I. For an itemset X, support(X) is defined to be the number of transactions T for which X ⊆ T . For a given parameter minsupport, an itemset X is call a frequent itemset if support(X) ≥ minsupport. The set of all frequent itemsets is denoted by F. The remainder of this paper is organized as follows. Section 2 contains a short of related work. In section 3 we describe the AIM-F algorithm. Section 4 contains experimental results. In Section 5 we conclude this short abstact with a discussion. Since the introduction of the Apriori algorithm by [1, 2] many variants have been proposed to reduce time, I/O and memory. Apriori uses breath-first search, bottom-up approach to generate frequent itemsets. (I.e., constructs i + 1 item frequent itemsets from i item frequent itemsets). The key observation behind Apriori is that all subsets of a frequent itemset must be frequent. This suggests a natural approach to generating frequent itemsets. The breakthrough with Apriori was that the number of itemsets explored was polynomial in the number of frequent itemsets. In fact, on a worst case basis, Apriori explores no more than n itemsets to output a frequent itemset, where n is the total number of items. Subsequent to the publication of [1, 2], a great many variations and extensions were considered [3, 7, 13]. In [3] the number of passes over the database was reduced . [7] tried to reduce the search space by combining bottom-up and top-down search – if a set is infre- quent than so are supersets, and one can prune away infrequent itemsets found during the top-down search. [13] uses equivalence classes to skip levels in the search space. A new mining technique, FP-Growth, proposed in [12], is based upon representing the dataset itself as a tree. [12] perform the mining from the tree representation. We build upon several ideas appearing in previous work, a partial list of which is the following: {:123} {1:23} {12:3} {2:3} {3:} {13:} {23:} {123:} • Vertical Bit Vectors [10, 4] - The dataset is stored in vertical bit vectors. Experimentally, this has been shown to be very effective. Figure 1. Full lexicographic tree of 3 items • Projection [4] - A technique to reduce the size of vertical bit vectors by trimming the bit vector to include only transaction relevant to the subtree currently being searched. 3. The AIM-F algorithm In this section we describe the building blocks that make up the AIM-F algorithm. High level pseudo code for the AIM-F algorithm appears in Figure 7. • Difference sets [14] - Instead of holding the entire tidset at any given time, Diffsets suggest that only changes in the tidsets are needed to compute the support. 3.1. Lexicographic Trees Let < be some lexicographic order of the items in I such that for every two items i and j, i = j : i < j or i > j. Every node n of the lexicographic tree has two fields, n.head which is the itemset node n represent, and n.tail which is a list of items, possible extensions to n.head. A node of the lexicographic tree has a l evel. Itemsets for nodes at level k nodes contain k items. We will also say that such itemsets have length k. The root (level 0) node n.head is empty, and n.tail = I. Figure 1 is an example of lexicographic tree for 3 items. The use of lexicographic trees for itemset generation was proposed by [8]. • Dynamic Reordering [6] - A heuristic for reducing the search space - dynamically changing the order in which the search space is traversed. This attempts to rearrange the search space so that one can prune infrequent itemsets earlier rather than later. • Parent Equivalence Pruning [4, 13] - Skipping levels in the search space, when a certain item added to the itemset contributes no new information. To the best of our knowledge no previous implementation makes use of this combination of ideas, and some of these combinations are non-trivial to combine. For example, projection has never been previously used with difference sets and to do so requires some new observations as to how to combine these two elements. We should add that there are a wide variety of other techniques introduced over time to find frequent itemsets, which we do not make use of. A very partial list of these other ideas is 3.2. Depth First Search Traversal In the course of the algorithm we traverse the lexicographic tree in a depth-first order. At node n, for every element α in the node’s tail,a new node n′ is generated such that n′ .head = n.head α and n′ .tail = n.tail−α. After the generation of n′ , α is removed from n.tail, as it will be no longer needed (see Figure 3). Several pruning techniques, on which we elaborate later, are used in order to speed up this process. • Sampling - [11] suggest searching over a sample of the dataset, and later validates the results using the entire dataset. This technique was shown to generate the vast majority of frequent itemsets. 3.3 • Adjusting support - [9] introduce SLPMiner, an algorithm which lowers the support as the itemsets grow larger during the search space. This attempts to avoid the problem of generating small itemsets which are unlikely to grow into large itemsets. Vertical Sparse Bit-Vectors Comparison between horizontal and vertical database representations done in [10] shows that the representation of the database has high impact on the performance of the mining algorithm. In a vertical database the data is represented as a list of items, 2 Project(p : vector, v : vector ) /* p - vector to be projected upon v - vector being projected */ (1) t = Empty Vector (2) i = 0 (3) for each nonzero bit in p, at offset j, in ascending order of offsets: (4) Set i’th bit of target vector t to be the j’th bit of v. (5) i=i+1 (6) return t Apriori(n : node, minsupport : integer) (1) t = n.tail (2) while t = ∅ (3) Let α be the first item in t (4) remove α from t  (5) n′ .head = n.head α (6) n′ .tail = t (7) if (support(n′ .head) ≥ minsupport) (8) Report n′ .head as frequent itemset (9) Apriori(n′ ) Figure 4. Apriori Figure 2. Projection DFS(n : node,) (1) t = n.tail (2) while t = ∅ (3) Let α be the first item in t (4) remove α from t  (5) n′ .head = n.head α (6) n′ .tail = t (7) DFS(n′ ) PEP(n : node, minsupport : integer) (1) t = n.tail (2) while t = ∅ (3) Let α be the first item in t (4) remove α from t  (5) n′ .head = n.head α (6) n′ .tail = t (7) if (support(n′ .head) = support(n.head)) (8) add α to the list of items removed by PEP ≥ minsupport) (9) else if (support(n′ .head)  (10) Report n′ .head {All subsets of items removed by PEP} as frequent itemsets (11) PEP(n′ ) Figure 3. Simple DFS where every item holds a list of transactions in which it appears. The list of transactions held by every item can be represented in many ways. In [13] the list is a tid-list, while [10, 4] use vertical bit vectors. Because the data tends to be sparse, vertical bit vectors hold many “0” entries for every “1”, thus wasting memory and CPU for processing the information. In [10] the vertical bit vector is compressed using an encoding called skinning which shrinks the size of the vector. We choose to use a sparse vertical bit vector. Every such bit vector is built from two arrays - one for values, and one for indexes. The index array gives the position in the vertical bit vector, and the value array is the value of the position, see Figure 8. The index array is sorted to allow fast AND operations between two sparse bit vectors in a similar manner to the AND operation between the tid-lists. Empty values will be thrown away during the AND operation, save space and computation time. 3.3.1 Figure 5. PEP itemsets. The idea is to eliminate redundant zeros in the bit-vector - for itemset P , all the transactions which does not include P are removed, leaving a vertical bit vector containing only 1s. For every itemset generated from P (a superset of P ), P X, all the transactions removed from P are also removed. This way all the extraneous zeros are eliminated. The projection done directly from the vertical bit representation. At initialization a two dimensional matrix of 2w by 2w is created, where w is the word length or some smaller value that we choose to work with. Every entry (i,j) is calculated to be the projection of j on i (thus covering all possible projections of single word). For every row of the matrix, the number of bits being projected is constant (a row represents the word being projected upon). Projection is done by traversing both the vector to project upon, p, and the vector to be projected, v. For every word index we compute the projection by table Bit-vector projection In [4], a technique called projection was introduced. Projection is a sparse bit vector compression technique specifically useful in the context of mining frequent 3 DynamicReordering(n : node, minsupport : integer) (1) t = n.tail (2) for each α in t  (3) Compute sα = support(n.head α) (4) Sort items α in t by sα in ascending order. (5) while t = ∅ (6) Let α be the first item in t (7) remove α from t  (8) n′ .head = n.head α (9) n′ .tail = t (10) if (support(n′ .head) ≥ minsupport) (11) Report n′ .head as frequent itemset (12) DynamicReordering(n′ ) AIM-F(n : node, minsupport : integer) /* Uses DFS traversal of lexicographic itemset tree Fast computation of small frequent itemsets for sparse datasets Uses difference sets to compute support Uses projection and bit vector compression Makes use of parent equivalence pruning Uses dynamic reordering */ (1) t = n.tail (2) for each α in t  (3) Compute sα = support(n.head α) (4) if (sα = support(n.head)) (5) add α to the list of items removed by PEP (6) remove α from t (7) else if (sα < minsupport) (8) remove α from t (9) Sort items in t by sα in ascending order. (10) While t = ∅ (11) Let α be the first item in t (12) remove α from t  (13) n′ .head = n.head α (14) n′ .tail = t  (15) Report n′ .head {All subsets of items removed by PEP} as frequent itemsets (16) AIM-F(n′ ) Figure 6. Dynamic Reordering lookup, the resulting bits are then concatenated together. Thus, computing the projection takes no longer than the AND operation between two compressed vertical bit lists. In [4] projection is used whenever a rebuilding threshold was reached. Our tests show that because we’re using sparse bit vectors anyway, the gain from projection is smaller, and the highest gains are when we use projection only when calculating the 2-itemsets from 1-itemsets. This is also because of the penalty of using projection with diffsets, as described later, for large k-itemsets. Even so, projection is used only if the sparse bit vector will shrunk significantly - as a threshold we set 10% - if the sparse bit vector contains less than 10% of ’1’s it will be projected. 3.3.2 Figure 7. AIM-F between those vectors. Let t(P ) be the tidset of P . The Diffset d(P X) is the tidset of tids that are in t(P ) but not in t(P X), formally : d(P X) = t(P ) − t(P X) = t(P ) − t(X). By definition support(P XY ) = support(P X)−|d(P XY )|, so only d(P XY ) should be calculated. However d(P XY ) = d(P Y ) − d(P X) so the Diffset for every candidate can be calculated from its generating itemsets. Diffsets have one major drawback - in datasets, where the support drops rapidly between k-itemset to k+1-itemset then the size of d(P X) can be larger than the size of t(P X) (For example see figure 9). In such cases the usage of diffsets should be delayed (in the depth of the DFS traversal) to such k-itemset where the support stops the rapid drop. Theoretically the X) break even point is 50%: t(P t(P ) = 0.5, where the size of d(P X) equals to t(P X), however experiments shows small differences for any value between 10% to 50%. For this algorithm we used 50%. Diffsets and Projection : As d(P XY ) in not a subset of d(P X), Diffsets cannot be used directly for projection. Instead, we notice that d(P XY ) ⊆ Counting and support To count the number of ones within a sparse bit vector, one can hold a translation table of 2w values, where w is the word length. To count the number of ones in a word requires only one memory access to the translation table. This idea first appeared in the context of frequent itemsets in [4]. 3.4 Diffsets Difference sets (Diffsets), proposed in [14], are a technique to reduce the size of the intermediate information needed in the traversal using a vertical database. Using Diffsets, only the differences between the candidate and its generating itemsets is calculated and stored (if necessary). Using this method the intermediate vertical bit-vectors in every step of the DFS traversal are shorter, this results in faster intersections 4  n.head α. Thus, X can be moved from the tail to the head, thus saving traversal of P and skipping to P X. This method was described by [4, 13]. Later when the frequent items are generated the items which were moved from head to tail should be taken into account when listing all frequent itemsets. For example, if k items were pruned using PEP during the DFS traversal of frequent itemset X then the all 2k subsets of those k items can be added to X without reducing the support. This gives creating 2k new frequent itemsets. See Figure 5 for pseudo code. Figure 8. Sparse Bit-Vector data structure 3.6 To increase the chance of early pruning, nodes are traversed, not in lexicographic order, but in order determined by support. This technique was introduced by [6]. Instead of lexicographic order we reorder the children of a node as follows. At node n, for  all α in the tail, we compute sα = support(t.head α), and the items are sorted in by sα in increasing  order. Items α in n.tail for which support(t.head α) < minsupport are trimmed away. This way, the rest of the sub-tree will benefit from a shortened tail. Items with smaller support, which are heuristically “likely” to be pruned earlier, will be traversed first. See Figure 6 for pseudo code. Figure 9. Diffset threshold t(P X) and t(P X) = t(P ) − d(P X). However d(P X) is known, and t(P ) can be calculated in the same way. For example t(ABCD) = t(ABC) − d(ABCD), t(ABC) = t(AB) − d(ABC), t(AB) = t(A) − d(AB) thus t(ABCD) = t(A)−d(AB)−d(ABC)−d(ABCD). Using this formula the t(P X) can be calculated using the intermediate data along the DFS trail. As the DFS goes deeper, the penalty of calculating the projection is higher. 3.5 3.5.1 3.7 Optimized Initialization In sparse datasets computing frequent 2-itemsets can be done more efficiently than than by performing n2 itemset intersections. We use a method similar to the one described in [13]: as a preprocessing step, for every transaction in the database, all 2-itemsets are counted and stored in an upper-matrix of dimensions n × n. This step may take up to O(n2 ) operations per transaction. However, as this is done only for sparse datasets, experimentally one sees that the number of operations is small. After this initialization step, we are left with frequent 2 item itemsets from which we can start the DFS proceedure. Pruning Techniques Apriori Proposed by [2] the Apriori pruning technique is based on the monotonicity property of support: support(P ) ≥ support(P X) as P X is contained in less transactions than P . Therefore if for an itemset P , support(P ) < minsupport, the support of any extension of P will also be lower than minsupport, and the subtree rooted at P can be pruned from the lexicographic tree. See Figure 4 for pseudo code. 3.5.2 Dynamic Reordering 4. Experimental Results The experiments were conducted on an Athlon 1.2Ghz with 256MB DDR RAM running Microsoft Windows XP Professional. All algorithms where compiled on VC 7. In the experiments described herein, we only count frequent itemsets, we don’t create output. Parent Equivalence Pruning (PEP) This is a pruning method based on the following  property : If support(n.head) = support(n.head α) then all the transactions that contain n.head also contain 5 We used five datasets to evaluate the algorithms performance. Those datasets where studied extensively in [13]. 1. connect — A database of game states in the game connect 4. 2. chess — A database of game states in chess. 3. mushroom — A database with information about various mushroom species. 4. pumsb* — This dataset was derived from the pumsb dataset and describes census data. 5. T10I4D100K - Synthetic dataset. The first 3 datasets were taken from the UN Irvine ML Database Repository (http://www.ics.uci.edu/ mlearn/MLRepository). The synthetic dataset created by the IBM Almaden synthetic data generator (http://www.almaden.ibm.com/cs/quest/demos.html). 4.1 Figure 11. Connect - support 50000 (75%) Comparing Data Representation We compare the memory requirements of sparse vertical bit vector (with the projection described earlier) versus the standard tid-list. For every itemset length the total memory requirements of all tid-sets is given in figures 10, 11 and 12. We do not consider itemsets removed by PEP. Figure 12. T10I4D100K - support 100 (0.1%) as much memory as tid-list. Tests to dynamically move from sparse vertical bit vector representation to tid-lists showed no significant improvement in performance, however, this should be carefully verified in further experiments. 4.2 Comparing The Various Optimizations We analyze the influence of the various optimization techniques on the performance of the algorithm. First run is the final algorithm on a given dataset, then returning on the task, with a single change in the algorithm. Thus trying to isolate the influence of every optimization technique, as shown in figures 13 and 14. As follows from the graphs, there is much difference in the behavior between the datasets. In the dense dataset, Connect, the various techniques had tremendous effect on the performance. PEP, dynamic reorder- Figure 10. Chess - support 2000 (65%) As follows from the figures, our sparse vertical bit vector representation requires less memory than tidlist for the dense datasets (chess, connect). However for the sparse dataset (T10I4D100K) the sparse vertical bit vector representation requires up to twice 6 ing and diffsets behaved in a similar manner, and the performance improvement factor gained by of them increased as the support dropped. From the other hand the sparse bit vector gives a constant improvement factor over the tid-list for all the tested support values, and projection gives only a minor improvement. In the second figure, for the sparse dataset T10I4D100K, the behavior is different. PEP gives no improvement, as can expected in sparse dataset, as every single item has a low support, and does not contain existing itemsets. There is drop in the support from k-itemset to k+1-itemset due to the low support therefore diffset also gives no impact, and the same goes for projection. A large gain in performance is made by optimized initialization, however the performance gain is constant, and not by a factor. Last is the dynamic reordering which contributes to early pruning much like in the dense dataset. 4.3 Figure 13. In¤uence of the various optimization on the Connect dataset mining Comparing Mining Algorithms For comparison, we used implementations of 1. Apriori [2] - horizontal database, BFS traversal of the candidates tree. 2. FPgrowth [5] - tree projected database, searching for frequent itemsets directly without candidate generation, and 3. dEclat [13] - vertical database, DFS traversal using diffsets. All of the above algorithm implementations were provided by Bart Goethals (http://www.cs.helsinki/u/goethals/) and used for comparison with the AIM-F implementation. Figures 15 to 19 gives experimental results on the various algorithms and datasets. Not surprising, Apriori [2] generally has the lowest performance amongst the algorithms compared, and in some cases the running time could not be computed as it did not finish even at the highest level of support checked. For these datasets and compared with the specific algorithms and implementations described above, our algorithm/implementation, AIM-F, seemingly outperforms all others. In general, for the dense datasets (Chess, Connect, Pumsb* and Mushroom, figures 15,16,17 and 18 respectively), the sparse bit vector gives AIM-F an order of magnitude improvement over dEclat. The diffsets gives dEclat and AIM-F another order of magnitude improvement over the rest of the algorithms. For the sparse dataset T10I4D100K (Figure 19), the optimized initialization gives AIM-F head start, which Figure 14. In¤uence of the various optimization on the T10I4D100K dataset mining 7 Figure 15. Chess dataset Figure 17. Pumsb* dataset Figure 16. Connect dataset Figure 18. Mushroom dataset is combined in the lower supports with the advantage of the sparse vertical bit vector (See details in figure 14) 5. Afterword This paper presents a new frequent itemset mining algorithm, AIM-F. This algorithm is based upon a mixture of previously used techniques combined dynamically. It seems to behave quite well experimentally. References [1] R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207–216, 1993. Figure 19. T10I4D100K dataset 8 [2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 12–15 1994. [3] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD, pages 255–264, 1997. [4] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: a maximal frequent itemset algorithm for transactional databases. In ICDE, 2001. [5] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, pages 1– 12, 2000. [6] R. J. B. Jr. Efficiently mining long patterns from databases. In SIGMOD, pages 85–93, 1998. [7] D.-I. Lin and Z. M. Kedem. Pincer search: A new algorithm for discovering the maximum frequent set. In EDBT’98, volume 1377 of Lecture Notes in Computer Science, pages 105–119, 1998. [8] R. Rymon. Search through systematic set enumeration. In KR-92, pages 539–550, 1992. [9] M. Seno and G. Karypis. Slpminer: An algorithm for finding frequent sequential patterns using length decreasing support constraint. In ICDE, 2002. [10] P. Shenoy, J. R. Haritsa, S. Sundarshan, G. Bhalotia, M. Bawa, and D. Shah. Turbo-charging vertical mining of large databases. In SIGMOD, 2000. [11] H. Toivonen. Sampling large databases for association rules. In VLDB, pages 134–145, 1996. [12] S. Yen and A. Chen. An efficient approach to discovering knowledge from large databases. In 4th International Conference on Parallel and Distributed Information Systems. [13] M. J. Zaki. Scalable algorithms for association mining. Knowledge and Data Engineering, 12(2):372–390, 2000. [14] M. J. Zaki and K. Gouda. Fast vertical mining using diffsets. Technical Report 01-1, RPI, 2001. 9 LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets 2 Takeaki Uno1 , Tatsuya Asai2 , Yuzo Uchida2 , Hiroki Arimura2 1 National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan e-mail: uno@nii.jp Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Fukuoka 812-0053, JAPAN e-mail:{t-asai,y-uchida,arim}@i.kyushu-u.ac.jp Abstract: an occurrence of X. For a given constant α ≥ 0, an item set X is called frequent if |T (X)| ≥ α. If a frequent item set is included in no other frequent set, it is said to bemaximal. For a transaction set S ⊆ T , let I(S) = T ∈S T . If an item set X satisfies I(T (X)) = X, then X is called a closed item set. We denote by F and C the sets of all frequent itemsets and all frequent closed item sets, respectively. In this paper, we propose three algorithms LCMfreq, LCM, and LCMmax for mining all frequent sets, frequent closed item sets, and maximal frequent sets, respectively, from transaction databases. The main theoretical contribution is that we construct treeshaped transversal routes composed of only frequent closed item sets, which is induced by a parent-child relationship defined on frequent closed item sets. By traversing the route in a depth-first manner, LCM finds all frequent closed item sets in polynomial time per item set, without storing previously obtained closed item sets in memory. Moreover, we introduce several algorithmic techniques using the sparse and dense structures of input data. Algorithms for enumerating all frequent item sets and maximal frequent item sets are obtained from LCM as its variants. By computational experiments on real world and synthetic databases to compare their performance to the previous algorithms, we found that our algorithms are fast on large real world datasets with natural distributions such as KDD-cup2000 datasets, and many other synthetic databases. In this paper, we propose an efficient algorithm LCM for enumerating all frequent closed item sets. LCM is an abbreviation of Linear time Closed item set Miner. Existing algorithms for this task basically enumerate frequent item sets with cutting off unnecessary frequent item sets by pruning. However, the pruning is not complete, hence the algorithms operate unnecessary frequent item sets, and do something more. In LCM, we define a parent-child relationship between frequent closed item sets. The relationship induces tree-shaped transversal routes composed only of all the frequent closed item sets. Our algorithm traverses the routes, hence takes linear time of the number of frequent closed item sets. This algorithm is obtained from the algorithms for enumerating maximal bipartite cliques [14, 15], which is designed based on reverse search technique [3, 16]. 1. Introduction In addition to the search tree technique for closed item sets, we use several techniques to speed-up the update of the occurrences of item sets. One technique is occurrence deliver, which simutaneously computes the occurrence sets of all the successors of the current item set during a single scan on the current occurrence set. The other is diffsets proposed in [18]. Since there is a trade-off between these two methods that the former is fast for sparse data while the latter is fast for dense data, we developed the hybrid algorithm combining them. In some iterations, we make Frequent item set mining is one of the fundamental problems in data mining and has many applications such as association rule mining [1], inductive databases [9], and query expansion [12]. Let E be the universe of items, consisting of items 1, ..., n. A subset X of E is called an item set. T is a set of transactions over E, i.e., each T ∈ T is composed of items of E. For an item set X, let T (X) = { t ∈ T | X ⊆ t } be the set of transactions including X. Each transaction of T (X) is called 1 a decision based of the estimation of their computation time, hence our algorithm can use appropriate one for dense parts and sparse parts of the input. We also consider the problems of enumerating all frequent sets, and maximal frequent sets, and derive two algorithms LCMfreq and LCMmax from LCM. LCMmax is obtained from LCM by adding the explicit check of maximality. LCMfreq is not merely a LCM without the check of closedness, but also achives substantial speed-up using closed itemset discovery techniques because it enumerates only the representatives of groups of frequent item sets, and generate other frequent item sets from the representatives. From computer experiments on real and artificial datasets with the previous algorithms, we observed that our algorithms LCMfreq, LCM, and LCMmax significantly outperform the previous algorithms on real world datasets with natural distributions such as BMS-Web-View-1 and BMS-POS datasets in the KDD-CUP 2000 datasets as well as large synthesis datasets such as IBM T10K4D100K. The performance of our algorithms is similar to other algorithms for hard datasets such as Connect and Chess datasets from UCI-Machine Learning Repository, but less significant than MAFIA, however LCM works with small memory rather than other algorithms. The organization of the paper is as follows. In Section 2, we explain our tree enumeration method for frequent closed item sets and our algorithm LCM. In Section 3, we describe several algorithmic techniques for speeding up and saving memory. Then, Section 4 and 5 give LCMmax and LCMfreq for maximal and all frequent item sets, respectively. Techniques for implementation is described in Section 6, and the results of computational experiments are reported in Section 7. Finally, we conclude in Section 8. T E i(X) X occurrences of X parent of X occurrences of the parent Figure 1: An example of the parent of X: The parent of X is obtained by deleting items larger than i(X) (in the gray area) and take a closure. prefix of Y if X = Y (i) holds for i = tail(X). Then, the parent-child relation P for the set enumeration tree for F is define as X = P(Y ) iff Y = X ∪ {i} for some i > tail(X), or equivalently, X = Y \{tail(Y )}. Then, the whole search space for F forms a prefix tree (or trie) with this edge relation P. Now, we define the parent-child relation P for closed item sets in C as follows. For X ∈ C, we define the parent of X by P(X) = I(T (X(i(X) − 1))), where i(X) be the minimum item i such that T (X) = T (X(i)) but T (X) = T (X(i − 1)). If Y is the parent of X, we say X is a child of Y . Let ⊥ = I(T (∅)) be the smallest item set in C called the root. For any X ∈ C \ {⊥}, its parent P(X) is always defined and belongs to C. An illustration is given in Fig. 1. For any X ∈ C and its parent Y , the proper inclusion Y ⊂ X holds since T (X(i(X) − 1)) ⊂ T (X). Thus, the relation P is acyclic, and its graph representation forms a tree. By traversing the tree in a depth-first manner, we can enumerate all the frequent closed item sets in linear time in the size of the tree, which is equal to the number of the frequent closed item sets in C. In addition, we need not store the tree in memory. Starting from the root ⊥, we find a child X of the root, and go to X. In the same way, we go to a child of X. When we arrive at a leaf of the tree, we backtrack, and find another child. Repeating this process, we eventually find all frequent closed item set in C. To find the children of the current frequent closed item set, we use the following lemma. For an item set X and an index i, let X[i] = X ∪ H where H is the set of the items j ∈ I(T (X ∪ {i})) satisfying j ≥ i. 2. Enumerating Frequent Closed Item Sets In this section, we introduce a parent-child relationship between frequent closed item sets in C, and describe our algorithm LCM for enumeration them. Recent efficient algorithms for frequent item sets, e.g.,[4, 17, 18], use a tree-shaped search structure for F , called the set enumeration tree [4] defined as follows. Let X = {x1 , . . . , xn } be an itemset as an ordered sequence such that x1 < · · · < xn , where the tail of X is tail(X) = xn ∈ E. Let X, Y be item sets. For an index i, X(i) = X ∩ {1, . . . , i}. X is a Lemma 1 X ′ is a child of X ∈ C (X ′ ∈ C and the parent of X ′ is X) if and only if (cond1) X ′ = X[i] for some i > i(X), (cond2) X ′ = I(T (X ′ )) (X ′ is a closed item set) (cond3) X ′ is frequent 2 occurrences Proof : Suppose that X ′ = X[i] satisfies the conditions (cond1), (cond2) and (cond3). Then, X ′ ∈ C. Since T (X(i − 1)) = T (X) and T (X(i − 1) ∪ {i}) = T (X ′ ) holds thus i(X ′ ) = i holds. Hence, X ′ is a child of X. Suppose that X ′ is a child of X. Then, (cond2) and (cond3) hold. From the definition of i(X ′ ), T (X(i(X ′ )) ∪ {i(X ′ )}) = T (X ′ ) holds. Hence, X ′ = X[i(X ′ )] holds. We also have i(X ′ ) > i(X) since T (X ′ (i(X ′ ) − 1)) = T (X). Hence (cond1) holds. A B C D F G H T(X) A C D F G jQ[i] = T(X[i]) • • • ••• A C D F G A B C F jQ[|E|-2] jQ[|E|-1] A B C F H jQ[|E|] Right first sweep Clearly, T (X[i]) = T (X ∪ {i}) holds if (cond2) I(T (X[i])) = X[i] holds. Note that no child X ′ of X satisfies X[i] = X[i′ ], i = i′ , since the minimum item of X[i]\X and X[i′ ]\X are i and i′ , respectively. Using this lemma, we construct the following algorithm scheme for, given a closed itemset X, enumerating all descendants in the search tree for closed itemsets. Figure 2: Occurrence deliver and right first sweep: In the figure, J [i] is written as JQ[i]. For each occurrence T of X, occurrence deliver inserts T to J [i] such that i ∈ T. When the algorithm generates a recursive call respect to X[|E| − 2], the recursive calls respect to X[|E| − 1] and X[|E|] have been terminated, J [|E| − 1] and J [|E|] are cleared. The recursive call of X[|E|−2] uses only J [|E| − 1] and J [|E|], and hence the algorithm re-uses them in the recursive call. Algorithm LCM (X : frequent closed item set) 1. output X 2. For each i > i(X) do 3. If X[i] is frequent and X[i] = I(T (X[i])) then Call LCM( X[i] ) 4. End for 3. Reducing Computation Time The computation time of LCM described in the previous section is linear in |C|, with a factor depending on T (X) for each closed item set X ∈ C. However, this still takes long time if it is implemented in a straightforward way. In this section, we introduce some techniques based on sparse and dense structures of the input data. Occurrence Deliver. First, We introduce the technique called the occurrence deliver for reducing the construction time for T (X[i]), which is needed to check (cond3). This technique is particularly efficient in the case that |T (X[i])| is much smaller than |T (X)|. In a usual way, T (X[i]) is obtained from T (X) in O(|T (X)|) time by removing all transactions not including i based on the equiality T (X[i]) = T (X ∪ {i}) = T (X) ∩ T ({i}) (this method is known as down-project). Thus, the total computation for all children takes |E| scans and O(|T (X)| · |E|) time. Instead of this, we build for all i = i(X), . . . , |E| Theorem 1 Let 0 < σ < 1 be a minimum support. Algorithm LCM enumerates, given the root closed item set ⊥ = I(T (∅)), all frequent closed item sets in linear time in the number of frequent closed item sets in C. The existing enumeration algorithm for frequent closed item sets are based on backtrack algorithm, which traverse a tree composed of all frequent item sets in F , and skip some item sets by pruning the tree. Since the pruning is not complete, however, these algorithms generate unnecessary frequent item sets. On the other hand, the algorithm in [10] directly generates only closed item sets with the closure operation I(T (·)) as ours, but their method may generate duplicated closed item sets and needs expensive duplicate check. On the other hand, our algorithm traverses a tree composed only of frequent closed item sets, and each iteration is not as heavy as the previous algorithms. Hence, our algorithm runs fast in practice. If we consider our algorithm as a modification of usual backtracking algorithm, each iteration of our algorithm re-orders the items larger than i(X) such that the items not included in X follow the items included in X. Note that the parent X is not a prefix of X[i] in a recursive call. The check of (cond2) can be considered as a pruning of non-closed item sets. def the occurrence lists J [i] = T (X[i]) simultaneously by scanning the transactions in T (X) at once as follows. We initialize J [i] = ∅ for all i = i(X), . . . , |E|. For each T ∈ T (X) and for each i ∈ T (i > i(X)), we insert T to J [i]. See Fig. 2 for explanation, where we write jQ[i] for J [i]. This correctly computes J [i] for all i in the total time O(|T (X)|). Furthermore, we need not make recursive call of LCM for X[i] if T (X[i]) = ∅ (this is often called lookahead [4]). In 3 our experiments on BMS instances, the occurrence deliver reduces the computation time up to 1/10 in some cases. Right-first sweep. The occurrence deliver method needs eager computation of the occurrence sets J [i] = T (X[i]) for all children before expanding one of them. A simple implementation of it may require much memory than the ordinary lazy computation of T (X[i]) as in [17]. However, we can reduce the memory usage using a method called the rightfirst sweep as follows. Given a parent X, we make the recursive call for X[i] in the decreasing order for each i = |E|, . . . , i(X) (See Fig. 2). At each call of X[i], we collect the memory allocated before for J [i + 1], . . . , J [|E|] and then re-use it for J [i]. After terminating the call for X[i], the memory for J [i] is released for the future use in J [j] for j < i. Since |J [i]| = |T (X[i])| ≤  |J [i]| |T ({i})| for any i and X, the total memory i  is bounded by the input size ||T || = T ∈T |T |, and thus, it is sufficient to allocate the memory for J at once as a global variable. Diffsets. In the case that |T (X[i])| is nearly equal to |T (X)| we use the diffset technique proposed in [18]. The diffset for index i is DJ [i] = T (X)\T (X[i]), where T (X[i]) = T (X ∪ {i}). Then, the frequency of X[i] is obtained by |T (X[i])| = |T (X)| − |DJ [i]|. When we generate a recursive call respect to X[i], we update DJ [j], j  > i by setting DJ [j] to be DJ [j] \ DJ [i] in time O( i>i(X),X[i]∈F ((|T (X)|−|T (X[i])|). Diffsets are needed for only i such that X[i] is frequent. By diffsets, the computation time for instances such as connect, chess, pumsb are reduced to 1/100, where |T (X[i])| is as large as |T (X)|. Hybrid Computation. As we saw in the preceding subsections, our occurrence deliver is fast when |T (X[i])| is much smaller than |T (X)| while the diffset of [18] is fast when |T (X[i])| is nearly close to |T (X)|. Therefore, our LCM dinamically decides which of occurrence deliver and diffsets we will use. To do this, we compare two quantities on X: A(X) = i |T (X ∪ {i})| and B(X) = i:X∪{i}∈F (|T (X)| − |T (X ∪ {i})|). For some fixed constant α > 1, we decide to use the occurrence deliver if A(X) < αB(X) and the diffset otherwise. We make this decision only at the child iterations of the root set ⊥ since this decision takes much time. Empirically, restricting the range i ∈ {1, . . . , |E|} of the the index i in A(X) and B(X) to i ∈ {i(X) + 1, . . . , |E|} results significant speedup. By experiments on BMS instances, we observe that the hybrid technique reduces the computation time up to 1/3. The hybrid technique is also useful in reducing the memory space in diffset as follows. Although the memory B(X) used by diffsets is not bounded by the input size ||T || in the worst case, it is ensured in hybrid that B(X) does not exceed A(X) ≤ ||T || because the diffset is chosen only when A(X) ≥ αB(X). Checking the closedness in occrrence deliver. Another key is to efficiently check the closedness X[i] = I(T (X[i])) (cond 2). The straightforward computation of the closure I(T (X[i])) takes much time since it requires the access to the whole sets T (X[j]), j < i and i is usually as large as |E|. By definition, (cond 2) is violated iff there exists some j ∈ {1, . . . , i − 1} such that j ∈ T for every T ∈ T (X ∪ {i}). We first choose a transaction T ∗ (∪{i}) ∈ T (X ∪ {i}) of minimum size, and tests if j  ∈ T for increasing j ∈ T ∗ (∪{i}). This results O( j∈T ∗ (X∪{i}) m(X[i], j)) time algorithm, where m(X ′ , j) is the maximum index m such that all of the first m transactions of T (X ′ ) include j, which is much faster than the straightforward algorithm with O( j<i |T (X ∪ {i} ∪ {j}])|) time. In fact, the efficient check requires the adjacency matrix (sometime called bitmap) representing the inclusion relationship between items and transactions. However, the adjacency matrix requires O(|T | × |E|) memory, which is quite hard to store for large instances. Hence, we make columns of adjacency matrix for only transactions of size larger than  ( T ∈T |T |)/δ.  Here δ is a constant. This uses at most O(δ × T ∈T |T |), linear in the input size. Checking the closedness in diffsets. In the case that |T (X[i])| is nearly equal to |T (X)|, the above check is not done in short time. In this case, we keep diffset DJ [j] for all j < i, i ∈ X such that X[i] is frequent. To maintain DJ for all i is a heavy task, thus we discard unnecessary DJ ’s as follows. If T (X ∪ {j}) includes an item included in no T (X[i′ ]), i′ > i(X), then for any descendant X ′ of X, j ∈ I(T (X ′ [j ′ ])) for any j ′ > i(X ′ ). Hence, we no longer have to keep DJ [j] for such j. Let N C(X) be the set of items j such that X[j] is frequent and any item of T (X) \ T (X ∪ {j}) is included in some T (X[j ′ ]), j ′ > i(X). Then, the computation time for checking (cond2) is written  as O( j∈N C(X),j<i |T (X) \ T (X ∪ {j})|). By checking (cond2) in these ways, the computation time for checking (cond2) is reduced from 1/10 to 1/100. Detailed Algorithm. We present below the description of the algorithm LCM, which recursively computes (X, T (X), i(X)), simultaneously. 4 global: J , DJ 111 ••• 1 /* Global sets of lists */ Algorithm LCM() 1. X := I(T (∅)) /* The root ⊥ */ 2. For i := 1 to |E| 3. If X[i] satisfies (cond2) and (cond3) then Call LCM Iter( X[i], T (X[i]), i ) or Call LCMd Iter2( X[i], T (X[i]), i, DJ ) based on the decision criteria 4. End for LCM Iter( X, T (X), i(X) ) /* occurrence deliver */ 1. output X 2. For each T ∈ T (X) For each j ∈ T, j > i(X), insert t to J [j] 4. For each j, J [j] = ∅ in the decreasing order 5. If |J [j]| ≥ α and (cond2) holds then LCM Iter( T (J [j]), J [j], j ) 6. Delete J [j] 7. End for 000 ••• 0 Figure 3: Hypercube decomposition: LCMfreq decomposes a closed item set class into several sublattices (gray rectangles). complexity but increase the computation time. In the case of occurrence deliver, we generate T (X ∪{j}) for all j in the same way as the occurrence  deliver, and check the maximality. This takes O( j<i(X) |T (X ∪ {j}|) time. In the case of difference update, we do not discard diffsets unnecessary for closed item set enumeration. We keep diffsets DJ for all j such that X ∪ {j} is frequent. To update and maintain this,  we spend O( j,X∪{j}∈F |T (X) \ T (X ∪ {j})|) time. Note that we are not in need of check the maximality if X has a child. LCM Iter2( X, T (X), i(X), DJ ) /* diffset */ 1. output X 2. For each i, X[i] is frequent 3. If X[i] satisfies (cond2) then 4. For each j, X[i] ∪ {j} is frequent, DJ ′ [j] := DJ [j] \ DJ [i] 5. LCM Iter2( T (J [j]), J [j], j, DJ ′ ) 6. End if 7. End for Theorem 3 Algorithm LCMmax enumerates all maximal frequent item sets in O( i |T (X ∪ {i})|)  time, or O( i,X∪{i}∈F ((|T (X)|−|T (X ∪{i})|)) time for each frequent closed item set X, with memory linear in the input size. Theorem 2 Algorithm LCM  enumerates all frequent closed item sets in O( j>i(X) |T (X[j])| +   ′ time, j>i(X),X[j]∈F j ′ ∈T ∗ (X) m(X[j], j ))  or O( i>i(X),X[i]∈F ((|T (X)| − |T (X[i])|) +  j∈N C(X),j<i |T (X) \ T (X ∪ {j})|)) time for each frequent closed item set X, with memory linear to the input size. 4. closed item set class 01 lattice 5. Enumerating Frequent Sets In this section, we describe an enumeration algorithm for frequent item sets. The key idea of our algorithm is that we classify the frequent item sets into groups and enumerate the representative of each group. Each group is composed of frequent item sets included in the class of a closed item set. This idea is based on the following lemma. Enumerating Maximal Frequent Sets In this section, we explain an enumeration algorithm of maximal frequent sets with the use of frequent closed item set enumeration. The main idea is very simple. Since any maximal frequent item set is a frequent closed item set, we enumerate frequent closed item sets and output only those being maximal frequent sets. For a frequent closed item set X, X is a maximal frequent set if and only if X ∪ {i} is infrequent for any i ∈ X. By adding this check to LCM, we obtain LCMmax. This modification does not increase the memory Lemma 2 Suppose that frequent item sets X and S ⊃ X satisfy T (X) = T (S). Then, for any item set X ′ including X, T (X ′ ) = T (X ′ ∪ S). Particularly, T (X ′ ) = T (R) holds for any X ′ ⊆ R ⊆ X ′ ∪S, hence all R are included in the same class of a closed item set. Hence, any frequent item set X ′ 5 is generated from X ′ \ (S \ X). We call X ′ \ (S \ X) representative. Let us consider a backtracking algorithm finding frequent item sets which adds items one by one in lexicographical order. Suppose that we currently have a frequent item set X, and find another frequent item set X ∪ {i}. Let S = X[i]. Then, according to the above lemma, we can observe that for any frequent item set X ′ including X and not intersecting S \ X, any item set including X ′ and included in X ′ ∪ S is also frequent. Conversely, any frequent item set including X is generated from X ′ not intersecting S\X. Hence, we enumerate only representatives including X and not intersecting S \ X, and generate other frequent item sets by adding each subset of S \ X. This method can be considered that we “decompose” classes of closed item sets into several sublattices (hypercubes) each of whose maximal and minimal elements are S and X ′ , respectively (see Fig. 3). We call this technique hypercube decomposition. Suppose that we are currently operating a representative X ′ including X, and going to generate a recursive call respect to X ′ ∪ {j}. Then, if (X ′ [i] \ X ′ ) \ S = ∅, X ′ and S ∪ (X ′ [i] \ X ′ ) satisfies the condition of Lemma 2. Hence, we add X ′ [i] \ X ′ to S. We describe LCMfreq as follows. call. Hence, the algorithm first starts with occurrence  deliver, and compares them in each iteration. If j>i(X),X[j]∈F |T (X) \ T (X[j])| becomes smaller, then we change to diffsets. Note that these estimators can computed in short time by using the result of occurrence deliver. Theorem 4 LCMfreq enumerates all frequent sets of F in O( j>i(X) |T (X[j])|) time or  O( j>i(X),X[j]∈F |T (X) \ T (X[j])|) time for  each frequent set X, within O( T ∈T |T |) space. Particularly, LCMfreq requires one integer for each item of any transaction, which is required to store the input data. Other memory LCMfreq uses is bounded by O(|T | + |E|). Experimentally, an iteration of LCMfreq inputting frequent set X takes O(|T (X)| + |X|) or O((size of diffset) + |X|) steps in average. In some sense, this is optimal since we have to take O(|X|) time to output, and O(|T (X)|) time or O((size of diffset)) time to check the frequency of X. 6. Implementation Algorithm LCMfreq ( X : representative, S : item set, i : item ) 1. Output all item sets R, X ⊆ R ⊆ X ∪ S 2. For each j > i, j ∈ X ∪ S 3. If X ∪ {j} is frequent then Call LCMfreq ( X ∪ {j}, S ∪ (X[j] \ (X ∪ {j})), j) 4. End for For some synthetic instances such that frequent closed item sets are fewer than frequent item sets, the average size of S is up to 5. In these cases, the algorithm finds 2|S| = 32 frequent item sets at once, hence the computation time is reduced much by the improvement. To check the frequency of all X ∪ {j}, we can use occurrence deliver and diffsets used for LCM. LCMfreq does not require the check of (cond2), hence  The computation time of each iteration is O( j>i(X) |T (X[j])|) time for occurrence deliver,  and O( j>i(X),X[j]∈F |T (X) \ T (X[j])|) for diffsets. Since the computation time change, we use another estimators for hybrid. In almost all  cases, if once |T (X) \ T (X[j])| bej>i(X),X[j]∈F  comes smaller than j>i(X) |T (X[j])|, the condition holds in any iteration generated by a recursive In this section, we explain our implementation. First, we explain the data structure of our algorithm. A transaction T of input data is stored by an array with length |T |. Each cell of the array stores the index of an item of T. For example, t = {4, 2, 7} is stored in an array with 3 cells, [2, 4, 7]. We sort the elements of the array so that we can take {i, ..., |E|} ∩ T in linear time of {i, ..., |E|}∩T. J is also stored in arrays in the same way. We are not in need of doubly linked lists or binary trees, which take much time to be operated. To reduce the practical computation time, we sort the transactions by their sizes, and items by the number of transactions including them. Experimentally,  this reduces j>i(X) |T (X ∪{j})|. In some cases, the computation time has been reduced by a factor of 1/3. 7. Computational Experiments To examine the practical efficiency of our algorithms, we run the experiments on the real and synthetic datasets, which are made available on FIMI’03 site. In the following, we will report the results of the experiments. 6 Table 1: The datasets. AvTrSz means the average transaction size Dataset #items #Trans AvTrSz #FI #FCI #MFI Minsup (%) BMS-Web-View1 497 59,602 2.51 3.9K–NA 3.9K–1241K 2.1K–129.4K 0.1–0.01 BMS-Web-View2 3,340 77,512 4.62 24K–9897K 23K–755K 3.9K–118K 0.1–0.01 BMS-POS 1,657 517,255 6.5 122K–33400K 122K–21885K 30K–4280K 0.1–0.01 T10I4D100K 1,000 100,000 10.0 15K–335K 14K–229K 7.9K–114K 0.15–0.025 T40I10D100K 1,000 100,000 39.6 2–0.5 pumsb 7,117 49,046 74.0 95–60 pumsb star 7,117 49,046 50.0 50–10 mushroom 120 8,124 23.0 20–0.1 connect 130 67,577 43.0 95–40 chess 76 3196 37.0 90–30 7.1 Datasets and Methods 7.2 We implemented our algorithms our three algorithms LCMfreq (LCMfreq), LCM (LCM), LCMmax (LCMmax) in C and compiled with gcc3.2. The algorithms were tested on the datasets shown in Table 1. available from the FIMI’03 homepage1 , which include: T10I4D100K, T40I10D100K from IBM Almaden Quest research group; chess, connect, mushroom, pumsb, pumsb star from UCI ML repository2 and PUMSB; BMS-WebView-1, BMSWebView-2, BMS-POS from KDD-CUP 20003. We compare our algorithms LCMfreq, LCM, LCMmax with the following frequent item set mining algorithms: Implementations of Fp-growth [7], Eclat [17], Apriori [1, 2] by Bart Goethals 4 ; We also compare the LCM algorithms with the implementation of Mafia [6], a fast maximal frequent pattern miner, by University of Cornell’s Database group 5 . This versions of mafia with frequent item sets, frequent closed item sets, and maximal frequent item sets options are denoted by mafia-fi, mafia-fci, mafia-mfi, respectively. Although we have also planned to make the performance comparison with Charm, the state-ofthe-art frequent closed item set miner, we gave up the comparison in this time due to the time constraint. All experiments were run on a PC with the configuration of Pen4 2.8GHz, 1GB memory, and RPM 7200 hard disk of 180GB. In the experiments, LCMfreq, LCM and LCMmax use at most 123MB, 300MB, and 300MB of memory, resp. Note that LCM and LCMmax can save the memory use by decreasing δ. Figures 6 through Figure 12 show the running time with varying minimum supports for the seven algorithms, namely LCMfreq, LCM, LCMmax, FPgrowth, eclat, apriori, mafia-mfi on the nine datasets described in the previous subsection. In the following, we call all, maximal, closed frequent item set mining simply by all, maximal, closed. Results on Synthetic Data Figure 4 shows the running time with minimum support ranging from 0.15% to 0.025% on IBM-Artificial T10I4D100K datasets. From this plot, we see that most algorithms run within around a few 10 minutes and the behaviors are quite similar when minimum support increases. In Figure 4, All of LCMmax, LCM, and LCMfreq are twice faster than FP-growth on IBM T10I4D100K dataset. On the other hand, Mafia-mfi, Mafia-fci, and Mafia-fi are slower than every other algorithms. In Figure 5, Mafia-mfi is fastest for maximal, and LCMfreq is fastest for all, for minimum support less than 1% on IBM T10I4D100K dataset. Results on KDD-CUP datasets Figures 6 through Figure 8 show the running time with range of minimum supports from ranging from 0.1% to 0.01% on three real world datasets BMSWebView-1, BMS-WebView-2, BMS-POS datasets. In the figure, we can observe that LCM algorithms outperform others in almost cases, especially for lower minimum support. In particular, LCM was best among seven algorithms in a wide range of minimum support from 0.1% to 0.01% on all datasets. 1 http://fimi.cs.helsinki.fi/testdata.html 2 http://www.ics.uci.edu/ Results mlearn/MLRepository.html 3 http://www.ecn.purdue.edu/KDDCUP/ 4 http://www.cs.helsinki.fi/u/goethals/software/ 5 University of Cornell Database group, Himalaya Data Mining Tools, http://himalaya-tools.sourceforge.net/ 7 㪠㪙㪤㩷㪫㪈㪇㪠㪋㪛㪈㪇㪇㪢 㪠㪙㪤㩷㪫㪋㪇㪠㪈㪇㪛㪈㪇㪇㪢 㪈㪇㪇㪇 㪈㪇㪇㪇 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤㫄㪸㫏 㪽㫇㪾㫉㫆㫎㫋㪿 㪽㫇㪶㪼㪺㫃㪸㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㪽㪺㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪀 㪈㪇㪇 㪺 㪼 㫊 㩿㩷 㪼 㫄 㫀 㫋 㪈㪇 㪈 㪇㪅㪈㪌 㪇㪅㪈㪊 㪇㪅㪈 㪇㪅㪇㪏 㪇㪅㪇㪌 㪇㪅㪇㪊 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤㫄㪸㫏 㪽㫇㪾㫉㫆㫎㫋㪿 㪽㫇㪶㪼㪺㫃㪸㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㪽㪺㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪀 㪈㪇㪇 㪺 㪼 㫊 㩿㩷 㪼 㫄 㫀 㫋 㪈㪇 㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 㪉 㪈㪅㪌 㪈 㪇㪅㪌 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 4: Running time of the algorithms on IBM- Figure 5: Running time of the algorithms on IBMArtificial T10I40D100K Artificial T40I10D100K 㪙㪤㪪㪄㪮㪼㪹㪭㫀㪼㫎㪄㪈 㪈㪇㪇㪇 㪈㪇㪇 㪀 㪺 㪼 㩿㫊 㩷 㪼 㫀㫄 㫋 㪈㪇㪇 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤㫄㪸㫏 㪽㫇㪾㫉㫆㫎㫋㪿 㪽㫇㪶㪼㪺㫃㪸㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪈㪇 㪈 㪇㪅㪈 㪇㪅㪈 㪇㪅㪇㪏 㪇㪅㪇㪍 㪇㪅㪇㪋 㪇㪅㪇㪉 㪙㪤㪪㪄㪮㪼㪹㪭㫀㪼㫎㪄㪉 㪈㪇㪇㪇 㪀 㪺 㪼 㩿㫊 㩷 㪼 㫀㫄 㫋 㪈㪇 㪈 㪇㪅㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 㪇㪅㪇㪈 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤㫄㪸㫏 㪽㫇㪾㫉㫆㫎㫋㪿 㪽㫇㪶㪼㪺㫃㪸㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪇㪅㪈 㪇㪅㪇㪏 㪇㪅㪇㪍 㪇㪅㪇㪋 㪇㪅㪇㪉 㪇㪅㪇㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 6: Running time of the algorithms on BMS- Figure 7: Running time of the algorithms on BMSWebView-1 WebView-2 㪙㪤㪪㪄㪧㪦㪪 㪈㪇㪇㪇㪇 㪈㪇㪇㪇 㪀 㪺 㪼 㫊 㩿㩷 㪼 㫀㫄 㫋 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤㫄㪸㫏 㪽㫇㪾㫉㫆㫎㫋㪿 㪽㫇㪶㪼㪺㫃㪸㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪈㪇㪇 㪈㪇 㪈 2, BMS-POS datasets. LCMfreq works quite well for higher minimum support, but takes more than 30 minutes for minimum support above 0.04% on BMSWeb-View-1. In these cases, the number of frequent item sets is quite large, over 100,000,000,000. Interestingly, Mafia-mfi’s performance is stable in a wide range of minimum support from 0.1% to 0.01%. In summary, LCM family algorithms significantly perform well on real world datasets BMS-WebView1, BMS-WebView-2, BMS-POS datasets. 㪇㪅㪈 㪇㪅㪇㪏 㪇㪅㪇㪍 㪇㪅㪇㪋 㪇㪅㪇㪉 㪇㪅㪇㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 8: Running time of the algorithms on BMS-POS Results on UCI-ML repository and PUMSB datasets For higher minimum support ranging from 0.1% to 0.06%, the performances of all algorithms are similar, and LCM families have slightly better performance. For lower minimum support ranging from 0.04% to 0.01%, Eclat and Apriori are much slower than every other algorithms. LCM outperforms others. Some frequent item set miners such as Mafia-fi, and Mafiafci runs out of 1GB of main memory for these minimum supports on BMS-WebView-1, BMS-WebView- Figures 9 through Figure 12 show the running time on middle sized data sets pumsb and pumsb*, kosarak and small sized datasets connect, chess, and mushroom. These datasets taken from machine learning domains are small but hard datasets for frequent pattern mining task since they have many frequent patterns even with high minimum supports, e.g., from 50% to 90%. These datasets are originally build for classification task and have slightly different characteristics than large business datasets such 8 㫇㫌㫄㫊㪹 㪈㪇㪇㪇 㪈㪇㪇 㪀 㪺 㪼 㫊 㩿㩷 㪼 㫄 㫀 㫋 㪈㪇㪇㪇 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤㫄㪸㫏 㪽㫇㪾㫉㫆㫎㫋㪿 㪽㫇㪶㪼㪺㫃㪸㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪈㪇 㪈 㪇㪅㪈 㪐㪌 㪐㪇 㪏㪌 㪏㪇 㪎㪌 㪎㪇 㪍㪌 㪍㪇 㪀 㪺 㪼 㫊 㩿㩷 㪼 㫄 㫀 㫋 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤㫄㪸㫏 㪽㫇㪾㫉㫆㫎㫋㪿 㪽㫇㪶㪼㪺㫃㪸㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪈㪇㪇 㪈㪇 㪈 㪇㪅㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 9: Running time of the algorithms on pumsb 㪌㪇 㪊㪌 㪊㪇 㪉㪌 㪉㪇 㪈㪌 㫄㫌㫊㪿㫉㫆㫆㫄 㪈㪇㪇㪇 㪈㪇㪇㪇 㪈㪇㪇 㪀 㪈㪇㪇 㪺 㪼 㩿㫊 㩷 㪼 㫀㫄 㫋 㪀 㪺 㪼 㫊 㩿㩷 㪼 㫀㫄 㫋 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤㫄㪸㫏 㪽㫇㪾㫉㫆㫎㫋㪿 㪽㫇㪶㪼㪺㫃㪸㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㪈㪇 㪇㪅㪉㪌 㪇㪅㪉 㪇㪅㪈㪌 㪇㪅㪈 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤㫄㪸㫏 㪽㫇㪾㫉㫆㫎㫋㪿 㪽㫇㪶㪼㪺㫃㪸㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㪽㪺㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪈㪇 㪈 㪇㪅㪈 㪇㪅㪇㪈 㪇㪅㪊 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 10: Running time of the algorithms on pumsb* 㫂㫆㫊㪸㫉㪸㫂 㪈 㫇㫌㫄㫊㪹㪶㫊㫋㪸㫉 㪈㪇㪇㪇㪇 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 㪉㪇 㪈㪇 㪌 㪈 㪇㪅㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 12: Running time of the algorithms on mushFigure 11: Running time of the algorithms on kosarak room as BMS-Web-View-1 or BMS-POS. In these figures, we see that Mafia-mfi constantly outperforms every other maximal frequent item sets mining algorithms for wide range of minimum supports except on pumsb*. On the other hand, Apriori is much slower than other algorithm. On the mining of all frequent item sets, LCMfreq is faster than the others algorithms. On the mining of frequent closed item sets, there seems to be no consistent tendency on the performance results. However, LCM does not store the obtained solutions in the memory, while the other algorithms do. Thus, in the sense of memorysaving, LCM has an advantage. 8 closed item set in the total input size. In practice, we show by experiments that our algorithms run fast on several real world datasets such as BMS-WebView-1. We also showed variants LCMfreq and LCMmax of LCM for computing maximal and all frequent item sets. LCMfreq uses new schemes hybrid and hypercube decomposition, and the schemes work well for many problems. Acknowledgement We gratefully thank to Prof. Ken Satoh of National Institute of Informatics. This research had been supported by group research fund of National Institute of Informatics, JAPAN. Conclusion In this paper, we present an efficient algorithm LCM for mining frequent closed item sets based on parentchild relationship defined on frequent closed item sets. This technique is taken from the algorithms for enumerating maximal bipartite cliques [14, 15] based on reverse search [3]. In theory, we demonstrate that LCM exactly enumerates the set of frequent closed item sets within polynomial time per References [1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases,” In Proc. VLDB ’94, pp. 487–499, 1994. [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen and A. I. Verkamo, “Fast Discovery of Associa9 㪺㫆㫅㫅㪼㪺㫋 㪈㪇㪇㪇 㪈㪇 㪈 㪇㪅㪈 㪈㪇㪇 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤㫄㪸㫏 㪽㫇㪾㫉㫆㫎㫋㪿 㪽㫇㪶㪼㪺㫃㪸㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㪽㪺㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㪈㪇㪇 㪀 㪺 㪼 㫊 㩿㩷 㪼 㫀㫄 㫋 㪐㪌 㪐㪇 㪏㪇 㪎㪇 㪍㪇 㪺㪿㪼㫊㫊 㪈㪇㪇㪇 㪌㪇 㪋㪇 㪊㪇 㪀 㪺 㪼 㫊 㩿㩷 㪼 㫄 㫀 㫋 㪈 㪇㪅㪈 㪇㪅㪇㪈 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 13: Running time of the algorithms on connect 㪈㪇 㪇㪅㪇㪇㪈 㪐㪇 㪏㪇 㪎㪇 㪍㪇 㪌㪇 㪋㪇 㪊㪇 㪣㪚㪤㪽㫉㪼㫈 㪣㪚㪤 㪣㪚㪤㫄㪸㫏 㪽㫇㪾㫉㫆㫎㫋㪿 㪽㫇㪶㪼㪺㫃㪸㫋 㪽㫇㪶㪸㫇㫉㫀㫆㫉㫀 㫄㪸㪽㫀㪸㪶㪽㫀 㫄㪸㪽㫀㪸㪶㪽㪺㫀 㫄㪸㪽㫀㪸㪶㫄㪽㫀 㫄㫀㫅㫊㫌㫇㩷㩿㩼㪀 Figure 14: Running time of the algorithms on chess tion Rules,” In Advances in Knowledge Discovery and Data Mining, MIT Press, pp. 307–328, 1996. [12] B. Possas, N. Ziviani, W. Meira Jr., B. A. Ribeiro-Neto, “Set-based model: a new approach for information retrieval,” In Proc. SIGIR’02, pp. 230-237, 2002. [3] D. Avis and K. Fukuda, “Reverse Search for Enumeration,” Discrete Applied Mathematics, Vol. 65, pp. 21–46, 1996. [13] S. Tsukiyama, M. Ide, H. Ariyoshi and I. Shirakawa, “A New Algorithm for Generating All the Maximum Independent Sets,” SIAM Journal on Computing, Vol. 6, pp. 505–517, 1977. [4] R. J. Bayardo Jr., Efficiently Mining Long Patterns from Databases, In Proc. SIGMOD’98, pp. 85–93, 1998. [14] Takeaki Uno, “A Practical Fast Algorithm for Enumerating Cliques in Huge Bipartite Graphs and Its Implementation,” 89th Special Interest Group of Algorithms, Information Processing Society Japan, 2003, [5] E. Boros, V. Gurvich, L. Khachiyan, and K. Makino, “On the Complexity of Generating Maximal Frequent and Minimal Infrequent Sets,” In Proc. STACS 2002, pp. 133-141, 2002. [15] Takeaki Uno, “Fast Algorithms for Enumerating Cliques in Huge Graphs,” Research Group of Computation, IEICE, Kyoto University, pp.5562, 2003 [6] D. Burdick, M. Calimlim, J. Gehrke, “MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases,” In Proc. ICDE 2001, pp. 443-452, 2001. [16] Takeaki Uno, “A New Approach for Speeding Up Enumeration Algorithms,” In Proc. ISAAC’98, pp. 287–296, 1998. [7] J. Han, J. Pei, Y. Yin, “Mining Frequent Patterns without Candidate Generation,” In Proc. SIGMOD’00, pp. 1-12, 2000 [17] M. J. Zaki, “Scalable algorithms for association mining,” Knowledge and Data Engineering, 12(2), pp. 372–390, 2000. [8] R. Kohavi, C. E. Brodley, B. Frasca, L. Mason and Z. Zheng, “KDD-Cup 2000 Organizers’ Report: Peeling the Onion,” SIGKDD Explorations, 2(2), pp. 86-98, 2000. [18] M. J. Zaki, C. Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset Mining,” In Proc. SDM’02, SIAM, pp. 457-473, 2002. [9] H. Mannila, H. Toivonen, “Multiple Uses of Frequent Sets and Condensed Representations,” In Proc. KDD’96, pp. 189–194, 1996. [19] Z. Zheng, R. Kohavi and L. Mason, “Real World Performance of Association Rule Algorithms,” In Proc. SIGKDD-01, pp. 401-406, 2001. [10] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal, Discovering frequent closed itemsets for association rules, In Proc. ICDT’99, pp. 398-416, 1999. [11] J. Pei, J. Han, R. Mao, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets,” ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 2000, pp. 21-30, 2000. 10 MAFIA: A Performance Study of Mining Maximal Frequent Itemsets Doug Burdick £ University of Wisconsin-Madison who0ps99@cs.wisc.edu Manuel Calimlim Cornell University calimlim@cs.cornell.edu Tomi Yiu Cornell University ty42@cornell.edu Johannes Gehrke Cornell University johannes@cs.cornell.edu Abstract We present a performance study of the MAFIA algorithm for mining maximal frequent itemsets from a transactional database. In a thorough experimental analysis, we isolate the effects of individual components of MAFIA, including search space pruning techniques and adaptive compression. We also compare our performance with previous work by running tests on very different types of datasets. Our experiments show that MAFIA performs best when mining long itemsets and outperforms other algorithms on dense data by a factor of three to thirty. 1 Introduction MAFIA uses a vertical bitmap representation for support counting and effective pruning mechanisms for searching the itemset lattice [6]. The algorithm is designed to mine maximal frequent itemsets (MFI), but by changing some of the pruning tools, MAFIA can also generate all frequent itemsets (FI) and closed frequent itemsets (FCI). MAFIA assumes that the entire database (and all data structures used for the algorithm) completely fit into main memory. Since all algorithms for finding association rules, including algorithms that work with disk-resident databases, are CPU-bound, we believe that our study sheds light on some important performance bottlenecks. In a thorough experimental evaluation, we first quantify the effect of each individual pruning component on the performance of MAFIA. Because of our strong pruning mechanisms, MAFIA performs best on dense datasets where large subtrees can be removed from the search space. On shallow datasets, MAFIA is competitive though not always the fastest algorithm. On dense datasets, our results indicate £ Research for this paper done while at Cornell University Ý Research for this paper done while at Cornell University Jason Flannick Ý Stanford University flannick@cs.stanford.edu that MAFIA outperforms other algorithms by a factor of three to thirty. 2 Search Space Pruning MAFIA uses the lexicographic subset tree originally presented by Rymon [9] and adopted by both Agarwal [3] and Bayardo [4]. The itemset identifying each node will be referred to as the node’s head, while possible extensions of the node are called the tail. In a pure depth-first traversal of the tree, the tail contains all items lexicographically larger than any element of the head. With a dynamic reordering scheme, the tail contains only the frequent extensions of the current node. Notice that all items that can appear in a subtree are contained in the subtree root’s head union tail ), a set formed by combining all elements of the head ( and tail. In the simplest itemset traversal, we traverse the lexicographic tree in pure depth-first order. At each node , each element in the node’s tail is generated and counted as a ½extension. If the support of n’s head  ½-extension is less than , then we can stop by the Apriori principle, since any itemset from that possible ½-extension would have an infrequent subset. For each candidate itemset, we need to check if a superset of the candidate itemset is already in the MFI. If no superset exists, then we add the candidate itemset to the MFI. It is important to note that with the depth-first traversal, itemsets already inserted into the MFI will be lexicographically ordered earlier.    2.1 Parent Equivalence Pruning (PEP) One method of pruning involves comparing the transaction sets of each parent/child pair. Let be a node ’s head and be an element in ’s tail. If ´ µ  ´ µ, then any transaction containing also contains . Since we only   want the maximal frequent itemsets, it is not necessary to count itemsets containing and not . Therefore, we can move item from the node’s tail to the node’s head. 2.2 FHUT Another type of pruning is superset pruning. We observe that at node , the largest possible frequent itemset contained in the subtree rooted at  is ’s HUT (head union tail) as observed by Bayardo [4]. If ’s HUT is discovered to be frequent, we never have to explore any subsets of the HUT and thus can prune out the entire subtree rooted at node . We refer to this method of pruning as FHUT (Frequent Head Union Tail) pruning. 2.3 HUTMFI There are two methods for determining whether an itemset is frequent: direct counting of the support of , and checking if a superset of has already been declared frequent; FHUT uses the former method. The latter approach determines if a superset of the HUT is in the MFI. If a superset does exist, then the HUT must be frequent and the subtree rooted at the node corresponding to can be pruned away. We call this type of superset pruning HUTMFI. 2.4 Dynamic Reordering The benefit of dynamically reordering the children of each node based on support instead of following the lexicographic order is significant. An algorithm that trims the tail to only frequent extensions at a higher level will save a lot of computation. The order of the tail elements is also an important consideration. Ordering the tail elements by increasing support will keep the search space as small as possible. This heuristic was first used by Bayardo [4]. In Section 5.3.1, we quantify the effects of the algorithmic components by analyzing different combinations of pruning mechanisms. 3 MAFIA Extensions MAFIA is designed and optimized for mining maximal frequent itemsets, but the general framework can be used to mine all frequent itemsets and closed frequent itemsets. The algorithm can easily be extended to mine all frequent itemsets. The main changes required are suppressing any pruning tools (PEP, FHUT, HUTMFI) and adding all frequent nodes in the itemset lattice to the set FI without any superset checking. Itemsets are counted using the same techniques as for the regular MAFIA algorithm. MAFIA can also be used to mine closed frequent itemsets. An itemset is closed if there are no supersets with the same support. PEP is the only type of pruning used when mining for frequent closed itemsets (FCI). Recall from Section 2.1 that PEP moves all extensions with the same support from the tail to the head of each node. Any items remaining in the tail must have a lower support and thus are different closed itemsets. Note that we must still check for supersets in the previously discovered FCI. 4 Optimizations 4.1 Effective MFI Superset Checking In order to enumerate the exact set of maximally frequent itemsets, before adding any itemset to the MFI we must check the entire MFI to ensure that no superset of the itemset has already been found. This check is done often, and significant performance improvements can be realized if it is done efficiently. To ensure this, we adopt the progressive focusing technique introduced by Gouda and Zaki [7]. The basic idea is that while the entire MFI may be large, at any given node only a fraction of the MFI are possible supersets of the itemset at the node. We therefore maintain for each node a LMFI (Local MFI), which is the subset of the MFI that contains supersets of the current node’s itemset. For more details on the LMFI concept, please see the paper by Gouda and Zaki [7]. 4.2 Support Counting and Bitmap Compression MAFIA uses a vertical bitmap representation for the database [6]. In a vertical bitmap, there is one bit for each transaction in the database. If item  appears in transaction , then bit of the bitmap for item  is set to one; otherwise, the bit is set to zero. This naturally extends to itemsets. Generation of new itemset bitmaps involves bitwise-ANDing bitmap( ) with a bitmap for 1-itemset  and storing the result in bitmap (   ). For each byte in bitmap(   ), the number of 1’s in the byte is determined using a pre-computed table. Summing these lookups gives the support of    . 4.3 Compression and Projected Bitmaps The weakness of a vertical representation is the sparseness of the bitmaps especially at the lower support levels. Since every transaction has a bit in vertical bitmaps, there are many zeros because both the absence and presence of the itemset in a transaction need to be represented. However, note that we only need information about transactions containing the itemset to count the support of the subtree rooted at node  . So, conceptually we can remove the bit for transaction  from if  does not contain . This is Figure 1. Dataset Statistics MFI Itemset Distribution 50 T10I4D100K 45 T40I10D100K 40 Frequency (%) Dataset T I ATL T10I4D100K 100,000 1,000 10 T40I10D100K 100,000 1,000 40 BMS-POS 515,597 1,657 6.53 BMS-WebView-1 59,602 497 2.51 BMS-WebView-2 3,340 161 4.62 chess 3196 76 37 connect4 67,557 130 43 pumsb 49,046 7,117 74 pumsb-star 49,046 7,117 50 T = Numbers of transactions I = Numbers of items ATL = Average transaction length 35 30 25 20 15 10 5 0 0 5 10 15 20 Itemset Length Figure 2. Itemset Lengths for shallow, artificial datasets a form of lossless compression on the vertical bitmaps to speed up calculations. MFI Itemset Distribution 30 BMS-POS 4.3.1 Adaptive Compression  BMS-WebView-2 Frequency (%) Determining when to compress the bitmaps is not as simple as it first appears. Each 1-extension bitmap in the tail of the node must be projected relative to the itemset , and the cost for projection may outweigh the benefits of using the compressed bitmaps. The best approach is to compress only when we know that the savings from using the compressed bitmaps outweigh the cost of projection. We use an adaptive approach to determine when to apply compression. At each node, we estimate both the cost of compression and the benefits of using the compressed bitmaps instead of the full bitmaps. When the benefits outweight the costs, compression is chosen for that node and the subtree rooted at that node. BMS-WebView-1 25 20 15 10 5 0 0 5 10 15 20 25 Itemset Length Figure 3. Itemset Lengths for shallow, real datasets 5 Experimental Results MFI Itemset Distribution 20 chess connect4 pumsb pumsb_star 18 5.1 Datasets To test MAFIA, we used three different types of data. The first group of datasets is sparse; the frequent itemset patterns are short and thus nodes in the itemset tree will have small tails and few branches. We first used artificial datasets that were created using the data generator from IBM Almaden [1]. Stats for these datasets can be found in Figure 1 under T10I4D100K and T40I10D100K. The distribution of maximal frequent itemsets is displayed in Figure 2. For all datasets, the minimum support was chosen to yield around 100,000 elements in the MFI. Note that both T10I4 and T40I10 have very high concentrations of itemsets around two and three items long with T40I10 having another smaller peak around eight to nine items. Frequency (%) 16 14 12 10 8 6 4 2 0 0 10 20 30 40 Itemset Length Figure 4. Itemset Lengths for dense, real datasets The second dataset type is click stream data from two different e-commerce websites (BMS-WebView-1 and BMSWebView-2) where each transaction is a web session and each item is a product page view; this data was provided by Blue Martini [8]. BMS-POS contains point-of-sale data from an electronics retailer with the item-ids corresponding to product categories. Figure 3 shows that BMS-POS and BMS-WebView-1 have very similar normal curve itemset distributions with the average length of a maximal frequent itemset around five to six items long. On the other hand, BMS-WebView-2 has a right skewed distribution; there’s a sharp incline until three items and then a more gradual decline on the right tail. Finally, the last datasets used for analysis are the dense datasets. They are characterized by very long itemset patterns that peak around 10-25 items (see Figure 4). Chess and Connect4 are gathered from game state information and are available from the UCI Machine Learning Repository [5]. The Pumsb dataset is census data from PUMS (Public Use Microdata Sample). Pumsb-star is the same dataset as Pumsb except all items of 80% support or more have been removed, making it less dense and easier to mine. Figure 4 shows that Chess and Pumsb have nearly identical itemset distributions that are normal around 10-12 items long. Connect4 and Pumsb-star are somewhat left-skewed with a slower incline that peaks around 20-23 items and then a sharp decline in the length of the frequent itemsets. 5.2.2 GenMax GenMax is a new algorithm by Gouda and Zaki for finding maximal itemset patterns [7]. GenMax introduced a novel concept for finding supersets in the MFI called progessive focusing. The newest version of MAFIA has incorporated this technique with the LMFI update. GenMax also uses diffset propagation for fast support counting. Both algorithms use similar methods for itemset lattice exploration and pruning of the search space. 5.3 Experimental Analysis We performed three types of experiments to analyze the performance of MAFIA. First, we analyze the effect of each pruning component of the MAFIA algorithm to demonstrate how the algorithm works to trim the search space of the itemset lattice. The second set of experiments examines the savings generated by using compression to speed support counting. Finally, we compare the performance of MAFIA against other current algorithms on all three types of data (see Section 5.1). In general, MAFIA works best on dense data with long itemsets, though the algorithm is still competitive on even very shallow data. These experiments were conducted on a 1500 Mhz Pentium with 1GB of memory running Redhat Linux 9.0. All code was written in C++ and compiled using gcc version 3.2 with all optimizations enabled. 5.2 Other Algorithms 5.3.1 Algorithmic Component Analysis 5.2.1 DepthProject DepthProject demonstrated an order of magnitude improvement over previous algorithms for mining maximal frequent itemsets [2]. MAFIA was originally designed with DepthProject as the primary benchmark for comparison and we have implemented our own version of the DepthProject algorithm for testing. The primary differences between MAFIA and DepthProject are the database representation (and consequently the support counting) and the application of pruning tools. DepthProject uses a horizontal database layout while MAFIA uses a vertical bitmap format, and supports of itemsets are counted very differently. Both algorithms use some form of compression when the bitmaps become sparse. However, DepthProject also utilizes a specialized counting technique called bucketing for the lower levels of the itemset lattice. When the tail of a node is small enough, bucketing will count the entire subtree with one pass over the data. Since bucketing counts all of the nodes in a subtree, many itemsets that MAFIA will prune out will be counted with DepthProject. For more details on the DepthProject algorithm, please refer to the paper by Agarwal and Aggarwal [2]. First, we present a full analysis of each pruning component of the MAFIA algorithm (see Section 2 for algorithmic details). There are three types of pruning used to trim the tree: FHUT, HUTMFI, and PEP. FHUT and HUTMFI are both forms of superset pruning and thus will tend to “overlap” in their efficacy for reducing the search space. In addition, dynamic reordering can significantly reduce the size of the search space by removing infrequent items from each node’s tail. Figures 5 and 6 show the effects of each component of the MAFIA algorithm on the Connect4 dataset at 40% minimum support. The components of the algorithm are represented in a cube format with the running times (in seconds) and the number of itemsets counted during the MAFIA search. The top of the cube shows the time for a simple traversal where the full search space is explored, while the bottom of the cube corresponds to all three pruning methods being used. Two separate cubes (with and without dynamic reordering) rather than one giant cube are presented for readability. Note that all of the pruning components yield great savings in running time compared to using no pruning. Applying a single pruning mechanism runs two to three orders of NONE 8,423.85s 341,515,395c NONE 12,158.15s 339,923,486c FHUT 173.62s 7,523,948c HUTMFI 101.54s 4,471,023c PEP 20.56s 847,439c FHUT 15.56s 609,993c HUTMFI 14.98s 609,100c PEP 9.89s 296,685c FH+HM 101.25s 4,429,998c FH+PEP 9.84s 409,741c HM+PEP 2.67s 102,759c FH+HM 14.78s 608,222c FH+PEP 1.82s 63,027c HM+PEP 1.74s 62,307c ALL 2.48s 96,871c ALL 1.72s 62,244c Figure 5. Pruning Components for Connect4 at 40% support without reordering Figure 6. Pruning Components for Connect4 at 40% support with reordering magnitude faster while using all of the pruning tools is four orders of magnitude faster than no pruning. Several of the pruning components seem to overlap in trimming the search space. In particular, HUTMFI and FHUT yield very similar results, since they use the same type of superset pruning but with different methods of implementation. It is interesting to see that adding FHUT when HUTMFI is already performed yields very little savings, i.e. from HUTMFI to FH+HM or from HM+PEP to ALL, the running times do not significantly change. HUTMFI first checks for the frequency of a node’s HUT by looking for a frequent superset in the MFI, while FHUT will explore the leftmost branch of the subtree rooted at that node. Apparently, there are very few cases where a superset of a node’s HUT is not in the MFI, but the HUT is frequent. PEP has the largest impact of the three pruning methods. Most of the running time of the algorithm occurs at the lower levels of the tree where the border between frequent and infrequent itemsets exists. Near this border, many of the itemsets have the same exact support right above the minimum support and thus, PEP is more likely to trim out large sections of the tree at the lower levels. Dynamically reordering the tail also has dramatic savings (cf. Figure 5 with Figure 6). At the top of each cube, it is interesting to note that without any pruning mechanisms, dynamic reordering will actually run slower than static ordering. Fewer itemsets get counted, but the cost of reorder- ing so many nodes outweighs the savings of counting fewer nodes. However, once pruning is applied, dynamic reordering runs nearly an order of magnitude faster than the static ordering. PEP is more effective since the tail is trimmed as early in the tree as possible; all of the extensions with the same support are moved from the tail to the head in one step at the start of the subtree. Also, FHUT and HUTMFI have much more impact. With dynamic reordering, subtrees generated from the end of tail have the itemsets with the highest supports and thus the HUT is more likely to be frequent. 5.3.2 Effects of Compression in MAFIA Adaptive compression uses cost estimation to determine when it is appropriate to compress the bitmaps. Since the cost estimate adapts to each dataset, adaptive compression is always better than using no compression. Results on different types of data show that adaptive compression is at least 25% faster as higher supports and at lower supports up to an order of magnitude faster. Figures 7 and 8 display the effect of compression on sparse data. First, we analyze the sparse, artificial datasets T10I4 and T40I10 that are characterized by very short itemsets, where the average length of maximally frequent itemsets is only 2-6 items. Because these datasets are so sparse with small subtrees, at higher supports compression is not Compression on T10I4D100K Compression on T40I10D100K 10000 10000 NONE NONE ADAPTIVE ADAPTIVE 1000 Time (s) Time (s) 1000 100 100 10 0.12 0.1 0.08 0.06 0.04 0.02 0 10 2 1.5 1 0.5 0 Min Sup (%) Min Sup (%) Compression on BMS-POS 10000 Figure 7. Compression on sparse datasets NONE ADAPTIVE Time (s) 1000 100 10 0.5 0.4 0.3 0.2 0.1 0 Min Sup (%) Compression on BMS-WebView-1 10000 NONE ADAPTIVE 100 Time (s) 1000 10 1 0.12 0.1 0.08 0.06 0.04 0.02 0 Min Sup (%) Compression on BMS-WebView-2 10000 NONE ADAPTIVE 1000 Time (s) often used and thus has a negligible effect. But as the support drops and the subtrees grow larger, the effect of compression is enhanced and the running times for adaptive compression increase to nearly 3-10 times faster. Next are the results on the sparse, real datasets: BMSPOS, BMS-WebView-1, and BMS-WebView-2 in Figure 8. Note that for BMS-POS, adaptive compression follows the exact same pattern as the synthetic datasets with the difference growing from negligible to over 10 times better. BMS-WebView-1 follows the same general pattern except for an anomalous spike in the running times without compression around .05%. However, for BMS-WebView-2 compression has a very small impact and is only really effective at the lowest supports. Recall from Figure 3 that BMS-WebView-2 has a right-skewed distribution of frequent itemsets, which may help explain the different compression effect. The final group of datasets is found in Figure 9 and shows the results of compression on dense, real data. The results on Chess and Pumsb indicate that very few compressed bitmaps were used; apparently, the adaptive compression algorithm determined compression to be too expensive. As a result, adaptive compression is only around 15-30% better than using no compression at all. On the other hand, the Connect4 and Pumsb-star datasets use a much higher ratio of compressed bitmaps and adaptive compression is more than three times faster than no compression. It is interesting to note that Chess and Pumsb both have left-skewed distributions (see Figure 4) while Connect4 and Pumsb-star follow a more normal distribution of itemsets. The results indicate that when the data is skewed (left or right), adaptive compression is not as effective. Still, even in the worst case adaptive compression will use the cost estimate to determine that compression should not be chosen and thus is at least as fast as never compressing at all. In the best case, compression can significantly speed up support 100 10 0.06 0.05 0.04 0.03 0.02 0.01 0 Min Sup (%) Figure 8. Compression on more sparse datasets Time Comparison on T10I4D100K Compression on Chess 10000 1000 MAFIA DP GENMAX NONE ADAPTIVE Time (s) 100 10 10 1 1 35 30 25 20 15 10 5 Time (s) 1000 100 0.06 0 0.05 0.04 0.03 0.02 0.01 0 Min Sup (%) Min Sup (%) Compression on Pumsb 10000 Figure 10. Performance on sparse datasets NONE ADAPTIVE Time (s) 1000 100 10 70 60 50 40 30 20 10 0 Min Sup (%) Compression on Connect4 10000 NONE ADAPTIVE 100 Time (s) 1000 10 1 35 30 25 20 15 10 5 0 Min Sup (%) Compression on Pumsb-star 10000 NONE ADAPTIVE Time (s) 1000 100 10 6 5 4 3 2 1 0 Min Sup (%) Figure 9. Compression on dense datasets counting by over an order of magnitude. 5.3.3 Performance Comparisons Figures 10 and 11 show the results of comparing MAFIA with DepthProject and GenMax on sparse data. MAFIA is always faster than DepthProject and grows from twice as fast at the higher supports to more than 20 times faster at the lowest supports tested. GenMax demonstrates the best performance of the three algorithms for higher supports and is around two to three times faster than MAFIA. However, note that as the support drops and the itemsets become longer, MAFIA passes Genmax in performance to become the fastest algorithm. The performances for sparse, real datasets are found in Figure 11. MAFIA has the worst performance on BMSWebView-2 for higher supports, though it eventually passes DepthProject as the support lowers. BMS-POS and BMSWebView-1 follow a similar pattern to the synthetic datasets where MAFIA is always better than DepthProject, and GenMax is better than MAFIA until the lower supports where they cross over. In fact, at the lowest supports for BMSWebView-1, MAFIA is an order of magnitude better than GenMax and over 50 times faster than DepthProject. It is clear that MAFIA performs best when the itemsets are longer, though even for sparse data MAFIA is within two to three times the running times of DepthProject and GenMax. The dense datasets in Figure 12 support the idea that MAFIA runs the fastest on longer itemsets. For all supports on the dense datasets, MAFIA has the best performance. MAFIA runs around two to five times faster than GenMax on Connect4, Pumsb, and Pumsb-star and over five to ten times faster on Chess. DepthProject is by far the slowest algorithm on all of the dense datasets and runs between ten to thirty times worse than MAFIA on all of the datasets across all supports. Time Comparison on T40I10D100K Time Comparison on Chess 10000 10000 MAFIA DP GENMAX MAFIA DP GEMAX 1000 100 10 10 1 1.5 1 0.5 1 0 50 Min Sup (%) 40 30 20 10 0 Min Sup (%) Time Comparison on BMS-POS Time Comparison on Pumsb 10000 10000 MAFIA DP GENMAX MAFIA DP GENMAX 1000 Time (s) 100 1000 100 10 10 1 0.35 0.3 0.25 0.2 0.15 0.1 0.05 1 0 80 Min Sup (%) 70 60 50 40 30 Min Sup (%) Time Comparison on BMS-WebView-1 Time Comparison on Connect4 10000 10000 MAFIA DP GENMAX MAFIA DP GENMAX 1000 1000 Time (s) 100 100 10 10 1 0.03 0.025 0.02 0.015 0.01 0.005 1 0 50 Min Sup (%) 40 30 20 10 0 Min Sup (%) Time Comparison on BMS-WebView-2 Time Comparison on Pumsb_star 10000 10000 MAFIA DP GENMAX MAFIA DP GENMAX 1000 1000 Time (s) 100 100 10 10 1 0.035 0.03 0.025 0.02 0.015 Time (s) 2 Time (s) 2.5 0.01 0.005 0 Min Sup (%) Figure 11. Performance on more sparse datasets Time (s) 3 Time (s) Time (s) 100 1000 1 35 30 25 20 15 10 5 0 Min Sup (%) Figure 12. Performance on dense datasets 6 Conclusion In this paper we present a detailed performance analysis of MAFIA. The breakdown of the algorithmic components show that powerful pruning techniques such as parentequivalence pruning and superset checking are very beneficial in reducing the search space. We also show that adaptive compression/projection of the vertical bitmaps dramatically cuts the cost of counting supports of itemsets. Our experimental results demonstrate that MAFIA is highly optimized for mining long itemsets and on dense data consistently outperforms GenMax by two to ten and DepthProject by ten to thirty. Acknowledgements: We would like to thank Ramesh Agarwal and Charu Aggarwal for discussing DepthProject and giving us advice on its implementation. We also thank Jayant Haritsa for his insightful comments on the MAFIA algorithm, Jiawei Han for helping in our understanding of CLOSET and providing us the executable of the FP-Tree algorithm, and Mohammed Zaki for making the source code of GenMax available. References [1] Data generator available at http://www.almaden.ibm.com/software/quest/Resources/. [2] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. Depth first generation of long patterns. In Knowledge Discovery and Data Mining, pages 108–118, 2000. [3] R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent item sets. Journal of Parallel and Distributed Computing, 61(3):350– 371, 2001. [4] R. J. Bayardo. Efficiently mining long patterns from databases. In SIGMOD, pages 85–93, 1998. [5] C. Blake and C. Merz. UCI repository of machine learning databases, 1998. [6] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In ICDE 2001, Heidelberg, Germany, 2001. [7] K. Gouda and M. J. Zaki. Efficiently mining maximal frequent itemsets. In ICDM, pages 163–170, 2001. [8] R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000 organizers’ report: Peeling the onion. SIGKDD Explorations, 2(2):86–98, 2000. http://www.ecn.purdue.edu/KDDCUP. [9] R.Rymon. Search through systematic set enumeration. In International Conference on Principles of Knowledge Representation and Reasoning, pages 539–550, 1992. kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets Claudio Lucchese1 , Salvatore Orlando1 , Paolo Palmerini1,2 , Raffaele Perego2 , Fabrizio Silvestri2,3 1 2 Dipartimento di Informatica ISTI-CNR Università Ca’ Foscari di Venezia Consiglio Nazionale delle Venezia, Italy Ricerche {orlando,clucches}@dsi.unive.it Pisa, Italy 3 Dipartimento di Informatica Università di Pisa Pisa, Italy silvestri@di.unipi.it {r.perego,p.palmerini}@isti.cnr.it Abstract This paper presents the implementation of kDCI, an enhancement of DCI [10], a scalable algorithm for discovering frequent sets in large databases. The main contribution of kDCI resides on a novel counting inference strategy, inspired by previously known results by Basted et al. [3]. Moreover, multiple heuristics and efficient data structures are used in order to adapt the algorithm behavior to the features of the specific dataset mined and of the computing platform used. kDCI turns out to be effective in mining both short and long patterns from a variety of datasets. We conducted a wide range of experiments on synthetic and real-world datasets, both in-core and out-of-core. The results obtained allow us to state that kDCI performances are not over-fitted to a special case, and its high performance is maintained on datasets with different characteristics. 1 Introduction Despite the considerable amount of algorithms proposed in the last decade for solving the problem of finding frequent patterns in transactional databases (among the many we mention [1] [11] [6] [13] [14] [4] [3] [7]), a single best approach still has to be found. The Frequent Set Counting (FSC) problem consists in finding all the set of items (itemsets) which occur in at least s% (s is called support) of the transactions of a database D, where each transaction is a variable length collection of items from a set I. Itemsets which verify the minimum support threshold are said to be frequent. The complexity of the FSC problem relies mainly in the potentially explosive growth of its full search space, whose dimension d is, in the worst case, d = P|tmax | |I| k=1 k , where tmax is the maximum transaction length. Taking into account the minimum support threshold, it is possible to reduce the search space, using the well known downward closure relation, which states that an itemset can only be frequent if all its subsets are frequent as well. The exploitation of this property, originally introduced in the Apriori algorithm [1], has transformed a potentially exponentially complex problem, into a more tractable one. Nevertheless, the Apriori property alone is not sufficient to permit to solve the FSC problem in a reasonable time, in all cases, i.e. on all possible datasets and for all possible interesting values of s. Indeed, another source of complexity in the FSC problem resides in the dataset internal correlation and statistical properties, which remain unknown until the mining is completed. Such diversity in the dataset properties is reflected in measurable quantities, like the total number of transactions, or the total number of distinct items |I| appearing in the database, but also in some other more fuzzy properties which, although commonly recognized as important, still lack a formal and univocal definition. It is the case, for example, of the notion of how dense a dataset is, i.e. how much its transactions tend to resemble among one another. Several important results have been achieved for specific cases. Dense datasets are effectively mined with compressed data structure [14], explosion in the candidates can be avoided using effective projections of the dataset [7], the support of itemsets in compact datasets can be inferred, without counting, using an equivalence class based partition of the dataset [3]. In order to take advantage of all these, and more specific results, hybrid approaches have been proposed [5]. Critical to this point is when and how to adopt a given solution instead of another. In lack of a complete theoretical understanding of the FSC problem, the only solution is to adopt a heuristic approach, where theoretical reasoning is supported by direct experience leading to a strategy that tries to cover a variety of cases as wide as possible. Starting from the previous DCI (Direct Count & Intersect) algorithm [10] we propose here kDCI, an enhanced version of DCI that extends its adaptability to the dataset specific features and the hardware characteristics of the computing platform used for running the FSC algorithm. Moreover, in kDCI we introduce a novel counting inference strategy, based on a new result inspired by the work of Bastide et al. in [3]. kDCI is a multiple heuristics hybrid algorithm, able to adapt its behavior during the execution. Since it origins from the already published DCI algorithm, we only outline in this paper how kDCI differs from DCI. A detailed description of the DCI algorithm can be found in [10]. 2 The kDCI algorithm Several considerations concerning the features of real datasets, the characteristics of modern hw/sw system, as well as scalability issues of FSC algorithms have motivated the design of kDCI. As already pointed out, transactional databases may have different characteristics in terms of correlations among the items inside transactions and of transactions among themselves [9]. A desirable feature of an FSC algorithm should be the ability to adapt its behavior to these characteristics. Modern hw/sw systems need high locality for exploiting memory hierarchies effectively and achieving high performance. Algorithms have to favor the exploitation of spatial and temporal locality in accessing in-core and out-core data. Scalability is the main concern in designing algorithms that aim to mine large databases efficiently. Therefore, it is important to be able to handle datasets bigger than the available memory. We designed and implemented our algorithm kDCI keeping in mind such performance issues. The pseudo code of kDCI is given in Algorithm 1. kDCI inherits from DCI the level-wise behavior and the hybrid horizontal-vertical dataset representation. As computation is started, kDCI maintains the database in horizontal format and applies an effective pruning tech- Algorithm 1 kDCI Require: D, min supp // During first scan get optimization figures F1 = first scan(D, min supp) // second and following scans on a temporary db D′ F2 = second scan(D′ , min supp) k=2 while (D′ .vertical size() > memory available()) do k++ // count-based iteration Fk = DCP(D’, min supp, k) end while k++ // count-based iteration + create vertical database VD Fk = DCP(D’, VD, min supp, k) dense = V D.is dense()) while (Fk 6= ∅) do k++ if (use key patterns()) then if (dense) then Fk = DCI dense keyp(VD, min supp, k) else Fk = DCI sparse keyp(VD, min supp, k) end if else if (dense) then Fk = DCI dense(VD, min supp, k) else Fk = DCI sparse(VD, min supp, k) end if end if end while nique to remove infrequent items and short transactions. A temporary dataset is therefore written to disk at every iteration. The first steps of the algorithm are described in [8] and [10] and remain unchanged in kDCI. In kDCI we only improved memory management by exploiting compressed and optimized data structures (see Section 2.1 and 2.2). The effectiveness of pruning is related to the possibility of storing the dataset in main memory in vertical format, due to the dataset size reduction. This normally occurs at the first iterations, depending on the dataset, the support threshold and the memory available on the machine, which is determined at run time. Once the dataset can be stored in main memory, kDCI switches to the vertical representation, and applies several heuristics in order to determine the most effective strategy for frequent itemset counting. The most important innovation introduced in kDCI regards a novel technique to determine the itemset supports, inspired by the work of Bastide et al. [3]. As we will discuss in Section 2.4, in some cases the support of candidate itemsets can be determined without actually counting transactions, but by a faster inference reasoning. Moreover, kDCI maintains the different strategies implemented in DCI for sparse and dense datasets. The result is a multiple strategy approach: during the execution kDCI collects statistical information on the dataset that allows to determine which is the best approach for the particular case. In the following we detail such optimizations and improvements and the heuristics used to decide which optimization to use. 2.1 index a b c a b d b d f 3 7 9 a a a a a a a a b b b b b b b b b b d d c c c c d d d d f f d e f g h i l m i n suffix d e f g h i l m i n 0 1 2 3 4 5 6 7 8 9 Compressed Memory = 9 + 3 + 10 = 21 Non−Compressed Memory = 4 x 10 = 40 Dynamic data type selection The first optimization is concerned with the amount of memory used to represent itemsets and their counters. Since such structures are extensively accessed during the execution of the algorithm, is it profitable to have such data occupying as little memory as possible. This not only allows to reduce the spatial complexity of the algorithm, but also permits low level processor optimizations to be effective at run time. During the first scan of the dataset, global properties are collected like the total number of distinct frequent items (m1 ), the maximum transaction size, and the support of the most frequent item. Once this information is available, we remap the survived (frequent) items to contiguous integer identifiers. This allows us to decide the best data type to represent such identifiers and their counters. For example if the maximum support of any item is less than 65536, we can use an unsigned short int to represent the itemset counters. The same holds for the remapped identifiers of the items. The decision of which is the most appropriate type to use for items and counters is taken at run time, by means of a C++ template-based implementation of all the kDCI code. Before remapping item identifiers, we also reorder them in increasingly support ordering: more frequent items are thus assigned larger identifiers. This also simplifies the intersection-based technique used for dense datasets (see Section 2.3). 2.2 prefix Compressed data structures Itemsets are often organized in collections in many FSC algorithms. Efficient representation of such collections can lead to important performance improvements. In [8] we pointed out the advantages of storing candidates in directly accessible data structures for the first passes of our algorithm. In kDCI we introduce a compressed representation of an itemset collection, used to store in the main memory collections of candidate and Figure 1. Compressed data structure used for itemset collection can reduce the amount of memory needed to store the itemsets. frequent itemsets. This representation take advantage of prefix sharing among the lexicographically ordered itemsets of the collection. The compressed data structure is based on three arrays (Figure 1). At each iteration k, the first array (prefix) stores the different prefixes of length k − 1. In the third array (suffix) all the length-1 suffixes are stored. Finally, in the element i of the second array (index), we store the position in the suffix array of the section of suffixes that share the same prefix. Therefore, when the itemsets in the collection have to be enumerated, we first access the prefix array. Then, from the corresponding entry in the index array we get the section of suffixes stored in suffix, needed to complete the itemsets. From our tests we can say that, in all the interesting cases – i.e., when the number of candidate (or frequent) iemsets explodes – this data structure works well and achieves up to 30% as compression ratio. For example, see the results reported in Figure 2. 2.3 Heuristics One of the most important features of kDCI is its ability to adapt its behavior to the dataset specific characteristics. It is well known that being able to distinguish between sparse and dense datasets, for example, allows to adopt specific and effective optimizations. Moreover, as we will explain the Section 2.4, if the number of frequent itemsets is much greater than the number of closed itemsets, it is possible to apply a counting inference procedure that allows to dramatically reduce the time needed to determine itemset supports. δ = 0.2 t=0..N BMS_View, min_supp=0.06% 120 t=0..N compressed non-compressed 100 i=0..M KBytes 80 60 f=M/4 40 d=pf/M=0.9 x 0.25 = 0.23 ✄☎✁ ✄☎✁ ✄☎✁ ✄☎✁ ✄☎✁ ✄☎✁ ✄☎✁ ✄☎✁ ✄☎✁ ✄☎✁ ✄☎✁ ☎✁ ☎✁ ✄☎✄✁ ✄☎✁ ✄✄✄ ☎✁ ✄✄✄ ☎✁ ✄✄✄ ☎✁ ✄✄✄ ☎✁ ✄✄✄ ☎✁ ✄✄✄ ☎✁ ✄✄✄ ☎✁ ✄✄✄ ☎✁ ✄✄✄ ☎✄☎✄☎✄✄ ✄✄ ☎✁ ✄✄☎✁ ☎✁ ☎✁ ☎✁ ☎✁ ☎✁ ☎✁ ☎✁ ☎✁ ☎✁ ☎✁ ✄ ☎✁ ✄ ☎✁ ✄ ☎✁ ✄ ☎✁ ✄ ☎✁ ✄ ☎✁ ✄ ☎✁ ✄ ☎✁ ✄ ☎✁ ✄ ☎☎✄ ☎✁☎✁ ☎✁ ☎✁ ☎✁ ☎✁ ☎✁ ☎✁ ☎✁ ☎✁ ☎✁ p=90% 20 d> δ 0 3 4 5 6 7 k 8 9 10 11 (a) DENSE i=0..M f=M/3 d=pf/M=0.5 x 0.3 = 0.15 ✁✁✁✁✁✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✂ ✁✁✁✁✁✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✁ ✂✁✁✁✁✁✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✁ ✂✂ ✁✁✁✁✁✁ ✂✂✁ ✁ ✂ ✁ ✂ ✁ ✂ ✁ ✂ ✁ ✂ ✁✁✁✁✁✁ ✁ ✁ ✂ ✁ ✂ ✁ ✂ ✁ ✂ ✁ ✂ ✂✂ ✁✁✁✁✁✁ ✂✁ ✁ ✂ ✁ ✂ ✁ ✂ ✁ ✂ ✁ ✂ ✁✁✁✁✁✁ ✂✁ ✁ ✂ ✁ ✂ ✁ ✂ ✁ ✂ ✁ ✂ ✁✁✁✁✁✁ ✂✁ ✂✁✂✁✂✁✂✁✂✁✂✂ p=50% d< δ SPARSE (b) (a) Figure 3. Heuristic to establish a dataset density or sparsity connect, min_supp=80% 700 compressed non-compressed 600 KBytes 500 400 300 200 100 0 2 4 6 8 k 10 12 14 (b) Figure 2. Memory usage with compressed itemsets collection representation for BMS with min sup=0.06% (a) and connect with min sup=80% (b) In kDCI we devised two main heuristics that allow to distinguish between dense and sparse datasets and to decide whether to apply the counting inference procedure or not. The first heuristic is simply based on the measure of the dataset density. Namely, we measure the correlation among the tidlists corresponding to the most frequent items. We require that the maximum number of frequent items for which such correlation is significant, weighted by the correlation degree itself, is above a given threshold. As an example, consider the two dataset in Figure 3, where tidlists are placed horizontally, i.e. rows correspond to items and columns to transactions. Suppose to choose a density threshold δ = 0.2. If we order the items according to their support, we have the most dense region of the dataset at the bottom of each figure. Starting from the bottom, we find the maximum number of items whose tidlists have a significant intersection. In the case of dataset (a), for example, a fraction f = 1/4 of the items share p = 90% of the transactions, leading to a density of d = f p = 0.25 × 0.9 = 0.23 which is above the density threshold. For dataset (b) on the other hand, to a smaller intersection of p = 50% is common to f = 1/3 of the items. In this last case the density d = f p = 0.3 × 0.5 = 0.15 is lower than the threshold and the dataset is considered as sparse. It is worth to notice that since this notion of density depends on the minimum support threshold, the same dataset can exhibits different behaviors when mined with different support thresholds. Once the dataset density is determined, we adopted the same optimizations described in [10] for sparse and dense datasets. We review them briefly for completeness. Sparse datasets. The main techniques used for sparse datasets can be summarized as follows: – projection. Tidlists in sparse datasets are characterized by long runs of 0’s. When intersecting the tidlists associated with the 2prefix items belonging to a given candidate itemset, we keep track of such empty elements (words), in order to perform the following intersections faster. This can be considered as a sort of raw projection of the vertical dataset, since some transactions, i.e. those corresponding to zero words, are not considered at all during the following tidlist intersections. – pruning. We remove infrequent items from the dataset. This can result in some transaction remaining empty or with too few items. We therefore remove such transactions (i.e. columns in the our bitvector vertical representation) from the dataset. Since this bitwise pruning may be expensive, we only perform it when the benefits introduced are expected to balance its cost. Dense datasets. If the dataset is dense, we expect to deal with strong correlations among the most frequent items. This not only means that the tidlists associated with these most frequent items contain long runs of 1’s, but also that they turn out to be very similar. The heuristic technique adopted by DCI and consequently by kDCI for dense dataset thus works as follows: – we reorder the columns of the vertical dataset by moving identical segments of the tidlists associated with the most frequent items to the first consecutive positions; – since each candidate is likely to include several of these most frequent items, we avoid repeated intersections of identical segments. The heuristic for density evaluation is applied only once, as soon as the vertical dataset is built. After this decision is taken, we further check if the counting inference strategy (see Section 2.4) can be profitable or not. The effectiveness of the inference strategy depends on the ratio between the total number of frequent itemsets and how many of them are key-patterns. The closer to 1 this ratio is, the less advantage is introduced by the inference strategy. Since this ratio is not known until the computation is finished, we found that the same information can be derived from the average support of the frequent singletons (items), after the first scan. The idea behind this is that if the average support of the single items that survived the first scan is high enough, then longer patterns can be expected to be frequent and more likely the number of key-patterns itemsets will be lower than that of frequent itemsets. We experimentally verified that this simple heuristic gives the correct output for all datasets - both real and synthetic. To resume the rationale behind kDCI multiple strategy approach, if the key-patterns optimization can be adopted, we use the counting inference method that allows to avoid many intersections. For the intersections that cannot be avoided and in the cases where the keypatterns inference method cannot be applied, we further distinguish between sparse and dense datasets, and apply the two strategies explained above. 2.4 Pattern Counting Inference In this section we describe the count inference method, which constitute the most important innovation Figure 4. Example lattice of frequent items. introduced in kDCI. We exploit a technique inspired by the theoretical results presented in [3], where the PAS CAL algorithm was introduced. PASCAL is able to infers the support of an itemset without actually count its occurrences in the database. In this seminal work, the authors introduced the concept of key pattern (or key itemset). Given a generic pattern Q, it is possible to determine an equivalence class [Q], which contains the set of patterns that have the same support and are included in the same set of database transactions. Moreover, if we define min[P ] as the set of the smallest itemsets in [P ], a pattern P is a key pattern if P ∈ min[P ], i.e. no proper subset of P is in the same equivalence class. Note that we can have several key patterns for each equivalence class. Figure 4 shows an example of a lattice of frequent itemsets, taken from [3], where equivalence classes and key patterns are highlighted. Given an equivalence class [P ], we can also define a corresponding closed set [12]: the closed set c of [P ] is equal to max[P ], so that no proper supersets of c can belong to the same equivalence class [P ]. Among the results illustrated in [3] we have the following important theorems: Theorem 1 Q is a key pattern iff supp(Q) minp∈Q (supp(Q \ {p})). 6= Theorem 2 If P is not a key pattern, and P ⊆ Q, then Q is a non-key pattern as well. From Theorem 1 it is straightforward to observe that if Q is a non-key pattern, then: supp(Q) = min(supp(Q \ {p})). p∈Q (1) Moreover, Theorem 1 says that we can check whether Q is a key pattern by comparing its support with the minimum support of its proper subsets, i.e. minp∈Q (supp(Q \ {p})). We will show in the following how to use this property to make faster candidate support counting. Theorems 1 and 2 give the theoretical foundations for the PASCAL algorithm, which finds the support of a non-key k-candidate Q by simply searching the minimum supports of all its k − 1 subsets. Note that such search can be performed during the pruning phase of the Apriori candidate generation. DCI does not perform candidate pruning because its intersection technique is comparably faster. For this reason we will not adopt the PASCAL counting inference in kDCI. The following theorem, partially inspired by the proof of Theorem 2, suggests a faster way to compute the support of a non-key k-candidate Q. Before introducing the theorem, we need to define the function f , which assigns to each pattern P the set of all the transactions that include this pattern. We can define the support of a pattern in terms of f : supp(P ) = |f (P )|. Note that f is a monotonically decreasing function, i.e. if P1 ⊆ P2 ⇒ f (P2 ) ⊆ f (P1 ). This is obvious, because every transaction containing P2 surely contains all the subsets of P2 . Theorem 3 If P is a non-key pattern and P ⊆ Q, the following holds: f (Q) = f (Q \ (P \ P ′ )). where P ′ ⊂ P , and P and P ′ belong to the same equivalence class, i.e. P, P ′ ∈ [P ]. P ROOF. Note that, since P is a non-key pattern, it is surely possible to find a pattern P ′ , P ′ ⊂ P , belonging to the same equivalence class [P ]. In order to demonstrate the Theorem we first show that f (Q) ⊆ f (Q \ (P \ P ′ )) and then that also f (Q) ⊇ f (Q \ (P \ P ′ )) holds, thus proving the Theorem hypotheses. The first assertion f (Q) ⊆ f (Q \ (P \ P ′ )) holds because (Q \ (P \ P ′ )) ⊆ Q, and f is a monotonically decreasing function. To prove the second assertion, f (Q) ⊇ f (Q \ (P \ P ′ )), we can rewrite f (Q) as f (Q\(P \P ′ )∪(P \P ′ )), which is equivalent to f (Q \ (P \ P ′ )) ∩ f (P \ P ′ ). Since f is decreasing, f (P ) ⊆ f (P \ P ′ ). But, since P, P ′ ∈ [P ], then we can write f (P ) = f (P ′ ) ⊆ f (P \ P ′ ). Therefore f (Q) = f (Q \ (P \ P ′ )) ∩ f (P \ P ′ ) ⊇ f (Q\(P \P ′ ))∩f (P ′ ). The last inequality is equivalent to f (Q) ⊇ f (Q \ (P \ P ′ ) ∪ P ′ ). Since P ′ ⊆ (Q \ (P \ P ′ )) clearly holds, it follows that f (Q\(P \P ′ )∪P ′ ) = f (Q\(P \P ′ )). So we can conclude that f (Q) ⊇ f (Q\ (P \ P ′ )), which completes the proof. ✷ The following corollary is trivial, since we defined supp(Q) = |f (Q)|. Corollary 1 If P is a non-key pattern, and P ⊆ Q, the support of Q can be computed as follows: supp(Q) = supp(Q \ (P \ P ′ )) where P ′ and P , P ′ ⊂ P , belong to the same equivalence class, i.e. P, P ′ ∈ [P ]. Finally, we can introduce Corollary 2, which is a particular case of the previous one. Corollary 2 If Q is k-candidate (i.e. Q ∈ Ck ) and P , P ⊂ Q, is a frequent non-key (k-1)-pattern (i.e. P ∈ Fk−1 ), there must exist P ′ ∈ Fk−2 , P ′ ⊂ P , such that P and P ′ belong to the same equivalence class, i.e. P, P ′ ∈ [P ] and P and P ′ differ for a single item: {pdiff } = P \ P ′ . The support of Q can thus be computed as: supp(Q) = supp(Q \ (P \ P ′ )) = supp(Q \ {pdiff }) Corollary 2 says that to find the support of a nonkey candidate pattern Q, we can simply check whether Q \ {pdiff } belongs to Fk−1 , or not. If Q \ {pdiff } ∈ Fk−1 , then Q inherits the same support as Q \ {pdiff } and is therefore frequent. Otherwise we can conclude that Q \ {pdiff } is not frequent. Using the theoretical result of Corollary 2, we adopted the following strategy in order to determine the support of a candidate Q at step k. In kDCI, we store with each itemset P ∈ Fk−1 the following information: • supp(P ); • a flag indicating if P is a key pattern or not; • if P is non-key pattern, also the item pdiff such that P \ {pdiff } = P ′ ∈ [P ]. Note that pdiff must be one of the items that we can remove from P to obtain a proper subset P ′ of P , belonging to the same equivalence class. During the generation of a generic candidate Q ∈ Ck , as soon as kDCI discovers that one of the subsets of Q, Dataset = chess Dataset = connect 10000 ECLATd FP OP kDCI 1000 Total Execution Time (sec) Total Execution Time (sec) 10000 100 10 1 ECLATd FP OP kDCI 1000 100 10 1 0.1 50 55 75 70 65 Support (%) 60 85 80 90 30 35 Dataset = pumsb 60 65 70 Dataset = pumsb_star ECLATd FP OP kDCI 100 55 50 45 Support (%) 100 Total Execution Time (sec) Total Execution Time (sec) 1000 40 10 1 0.1 ECLATd FP OP kDCI 10 1 0.1 65 70 75 85 80 Support (%) 90 95 25 30 40 35 Support (%) 45 50 Figure 5. Total execution time of OP, FP, Eclatd, and kDCI on various datasets as a function of the support threshold. say P , is a non-key pattern, kDCI searches in Fk−1 the pattern Q \ {pdiff }, where pdiff is stored with P . If Q \ {pdiff } is found, then Q is a frequent nonkey pattern (see Theorem 2), its support is supp(Q \ {pdiff }), and the item to store with Q is exactly pdiff . In fact, Q′ = Q \ {pdiff } ∈ [Q], i.e. pdiff is one of the items that we can remove from Q to obtain a subset Q′ belonging to the same equivalence class. The worst case is when all the subsets of Q in Fk−1 are key patterns and the support of Q cannot be inferred from its subsets. In this case kDCI counts the support of Q as usual, and applies Theorem 1 to determine if Q is a non-key-pattern. If Q is a non-key-pattern, its support becomes supp(Q) = minp∈Q (supp(Q \ {p})) (see Theorem 1), while the item to be stored with Q is pdiff , i.e. the item to be subtracted from Q to obtain the pattern with the minimum support. The impact of this counting inference technique on the performance of an FSC algorithm becomes evident if you consider the Apriori-like candidate generation strategy adopted by kDCI. From the combination of every pair of itemsets Pa and Pb ∈ Fk−1 , that share the same (k-2)-prefix (we called them generators), kDCI generates a candidate k-itemset Q. For very dense datasets, most of the frequent patterns belonging to Fk−1 are nonkey patterns. Therefore one or both patterns Pa and Pb used to generate Q ∈ Ck are likely to be non-key patterns. In such cases, in order to find a non-key pattern and then apply Corollary 2, it is not necessary to check the existence of further subsets of Q. For most of the candidates, a single binary search in Fk−1 , to look for the pattern Q \ {pdiff }, is thus sufficient to compute supp(Q). Moreover, often Q \ {pdiff } is exactly equal to one of the two k-1-itemsets belonging to the generating pair (Pa , Pb ): in this case kDCI does not need to perform any search at all to compute supp(Q). We conclude this section with some examples of how the counting inference technique works. Let us consider Figure 4. Itemset Q = {A, B, E} is a non-key pattern because P = {B, E} is a non-key pattern as well. So, if P ′ = {B}, kDCI will store pdiff = E with P . We have that the supp({A, B, E}) = supp({A, B, E}\({B, E}\{B})) = supp({A, B, E}\ {pdiff } = supp({A, B}). From the Figure you can see that {A, B, E} and {A, B} both belong to the same Dataset = mushroom Dataset = BMS_View_1 1000 ECLATd FP OP kDCI 100 Total Execution Time (sec) Total Execution Time (sec) 1000 10 1 0.1 2 4 6 8 10 12 14 Support (%) 16 18 100 10 1 0.1 0.055 20 ECLATd FP OP kDCI 0.06 Dataset = T25I10D10K ECLATd FP OP kDCI 10 1 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Support (%) 0.07 0.075 Support (%) 0.08 0.085 0.09 Dataset = T30I16D400K 1000 Total Execution Time (sec) Total Execution Time (sec) 100 0.065 0.8 0.9 1 ECLATd FP OP kDCI 100 10 0.4 0.6 0.8 1.2 1 Support (%) 1.4 1.6 Figure 6. Total execution time of OP, FP, Eclatd, and kDCI on various datasets as a function of the support threshold. equivalence class. Another example is itemset Q = {A, B, C, E}, that is generated by the two non-key patterns {A, B, C} and {A, B, E}. Suppose that P = {A, B, C}, i.e. the first generator, while P ′ = {A, B}. In this case kDCI will store pdiff = C with P . We have that the supp({A, B, C, E}) = supp({A, B, C, E} \ ({A, B, C} \ {A, B})) = supp({A, B, C, E} \ {pdiff } = supp({A, B, E}, where {A, B, E} is exactly the second generator. In this case, no search is necessary to find {A, B, E}. Looking at the Figure, it is possible to verify that {A, B, C, E} and {A, B, E} both belong to the same equivalence class. 3 Experimental Results We experimentally evaluated kDCI performances by comparing its execution time with respect to the original implementations of state of the art FSC algorithms, namely FP-growth (FP) [6], Opportunistic Projection (OP) [7] and Eclat with diffsets (Eclatd) [14], provided by their respective authors. We used a MS-WindowsXP workstation equipped with a Pentium IV 1800 MHz processor, 368MB of RAM memory and an eide hard disk. For the tests, we used both synthetic and real-world datasets. All the synthetic datasets used were created with the IBM dataset generator [1], while all the real-world datasets but one were downloaded from the UCI KDD Archive (http://kdd.ics.uci.edu/). We also extracted a real-world dataset from the TREC WT10g corpus [2]. The original corpus contained about 1.69 millions of WEB documents. The dataset for our tests was built by considering the set of all the terms contained in each document as a transaction. Before generating the transactional dataset, the collection of documents was filtered by removing HTML tags and the most common words (stopwords), and by applying a stemming algorithm. The resulting trec dataset is huge. It is about 1.3GB, and contains 1.69 millions of short and long transactions, where the maximum length of a transaction is 71, 473 items. kDCI performance and comparisons. Figure 5 and 6 report the total execution time obtained running FP, Eclatd, OP, and kDCI on various datasets as a func- 4 Conclusions and Future Work Due to the complexity of the problem, a good algorithm for FSC has to implement multiple strategies and some level of adaptiveness in order to be able to succesfully manage diverse and differently featured inputs. kDCI uses different approaches for extracting frequent patterns: count-based during the first iterations and intersection-based for the following ones. Moreover, a new counting inference strategy, together with, adaptiveness and resource awareness are the main innovative features of the algorithm. On the basis of the characteristics of the mined dataset, kDCI chooses which optimization to adopt for Dataset = trec Total Execution Time (sec) 10000 kDCI 1000 100 7 8 9 12 11 10 Support (%) 13 14 15 88 90 (b) Dataset = USCensus1990 1000 Total Execution Time (sec) tion of the support threshold s. On all datasets in Figure 5, connect, chess pumsb and pumsb star, kDCI runs faster than the others algorithms. On pumsb its execution time is very similar to the one of OP. For high support thresholds kDCI can drastically prune the dataset, and build a compact vertical dataset, whose tidlists presents large similarities. Such similarity of tidlists is effectively exploited by our strategy for compact datasets. For smaller supports, the benefits introduced by the counting inference strategy become more evident, particularly for the pumsb star and connect datasets. In these cases the number of frequent itemsets is much higher than the number of key-patterns, thus allowing kDCI to drastically reduce the number of intersections needed to determine candidate supports. On the datasets mushroom and T30I16D400K (see Figure 6), kDCI outperforms the other competitors, and this also holds on the real-world dataset BMS View 1 when mined with very small support thresholds (see Figure 6). On only a dataset, namely T25I10D10K, FP and OP are faster than kDCI for all the supports. The reason of this behavior is the size of the candidate set C3 , which for this dataset is much larger than F3 . While kDCI has to carry out a lot of useless work to determine the support of many candidate itemsets which are not frequent, FP-growth and OP take advantage of the fact that they do not require candidate generation. Furthermore, differently from FP, Eclatd, and OP, kDCI can efficiently mine huge datasets such as trec and USCensus1990. Figure 7 reports the total execution time required by kDCI to mine these datasets with different support thresholds. The other algorithms failed in mining these datasets due to memory shortage, also when very large support thresholds were used. On the other hand, kDCI was able to mine such huge datasets since it adapts its behavior to both the size of the dataset and the main memory available. kDCI 100 10 74 76 78 84 82 80 Support (%) 86 (a) Figure 7. Total execution time of kDCI: on datasets trec (a) and USCensus1990 (b) as a function of the support. reducing the cost of mining at run–time. Dataset pruning and effective out-of-core techniques are exploited during the count-based phase, while the intersection-based phase, which starts only when the pruned dataset can fit into the main memory, exploits a novel technique based on the notion of key-pattern that in many cases allows to infer the support of an itemset without any counting. kDCI also adopts compressed data structures and dynamic type selection to adapt itself to the characteristics of the dataset being mined. The experimental evaluation demonstrated that kDCI outperforms FP, OP, and Eclatd in most cases. Moreover, differently from the other FSC algorithms tested, kDCI can efficiently manage very large datasets, also on machines with limited physical memory. Although the variety of datasets used and the large amount of tests conducted permit us to state that the performance of kDCI is not highly influenced by dataset characteristics, and that our optimizations are very effective and general, some further optimizations and future work will reasonably improve kDCI performance. More optimized data structures could be used to store itemset collections in order to make faster searches in such collections. Note that such fast searches are very important in kDCI, which bases its count inference technique at level k on searching for frequent itemset in Fk−1 . Finally, we can benefit from a higher level of adaptiveness to the available memory on the machine, either with fully memory mapped data structures or with outof-core ones, depending on the data size. This should allow a better scalability and a wider applicability of the algorithm. 5 Acknowledgments We acknowledge J. Han, Y. Pan, M.J. Zaki and J. Liu for kindly providing us the latest versions of their FSC software. References [1] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proc. of the 20th VLDB Conf., pages 487–499, 1994. [2] P. Bailey, N. Craswell, and D. Hawking. Engineering a multi-purpose test collection for Web retrieval experiments. Information Processing and Management. to appear. [3] Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal. Mining frequent patterns with counting inference. ACM SIGKDD Explorations Newsletter, 2(2):66– 75, December 2000. [4] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In J. Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA. ACM Press, 05. [5] B. Goethals. Efficient Frequent Itemset Mining. PhD thesis, Limburg University, Belgium, 2003. [6] J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1–12, Dallas, Texas, USA, 2000. [7] J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic Projection. In Proc. 2002 Int. Conf. on Knowledge Discovery in Databases (KDD’02), Edmonton, Canada, 2002. [8] S. Orlando, P. Palmerini, and R. Perego. Enhancing the Apriori Algorithm for Frequent Set Counting. In Proc. of 3rd Int. Conf. on Data Warehousing and Knowledge Discovery (DaWaK 01) - Munich, Germany, volume 2114 of LNCS, pages 71–82. Springer, 2001. [9] S. Orlando, P. Palmerini, and R. Perego. On Statistical Properties of Transactional Datasets. In 2004 ACM Symposium on Applied Computing (SAC 2004), 2004. To appear. [10] S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. Adaptive and Resource-Aware Mining of Frequent Sets. In Proc. The 2002 IEEE International Conference on Data Mining (ICDM’02), pages 338–345, 2002. [11] J. S. Park, M.-S. Chen, and P. S. Yu. An Effective Hash Based Algorithm for Mining Association Rules. In Proc. of the 1995 ACM SIGMOD Int. Conf. on Management of Data, pages 175–186, 1995. [12] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. Lecture Notes in Computer Science, 1540:398–416, 1999. [13] J. Pei, J. Han, H. Lu, S. Nishio, and D. Tang, S. amd Yang. H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. In Proc. The 2001 IEEE International Conference on Data Mining (ICDM’01), San Jose, CA, USA, 2000. [14] M. J. Zaki and K. Gouda. Fast Vertical Mining Using Diffsets. In 9th Int. Conf. on Knowledge Discovery and Data Mining, Washington, DC, 2003. APRIORI, A Depth First Implementation Walter A. Kosters Leiden Institute of Advanced Computer Science Universiteit Leiden P.O. Box 9512, 2300 RA Leiden The Netherlands kosters@liacs.nl Abstract We will discuss  , the depth first implementation of A PRIORI as devised in 1999 (see [8]). Given a database, this algorithm builds a trie in memory that contains all frequent itemsets, i.e., all sets that are contained in at least minsup transactions from the original database. Here minsup is a threshold value given in advance. In the trie, that is constructed by adding one item at a time, every path corresponds to a unique frequent itemset. We describe the algorithm in detail, derive theoretical formulas, and provide experiments. 1 Introduction In this paper we discuss the depth first (  , see [8]) implementation of A PRIORI (see [1]), one of the fastest known data mining algorithms to find all frequent itemsets in a large database, i.e., all sets that are contained in at least minsup transactions from the original database. Here minsup is a threshold value given in advance. There exist many implementations of A PRIORI (see, e.g., [6, 11]). We would like to focus on algorithms that assume that the whole database fits in main memory, this often being the state of affairs; among these,  and  (the FP-growth implementation of A PRIORI, see [5]) are the fastest. In most papers so far little attention has been given to theoretical complexity. In [3, 7] a theoretical basis for the analysis of these two algorithms was presented. The depth first algorithm is a simple algorithm that proceeds as follows. After some preprocessing, which involves reading the database and a sorting of the single items with respect to their support,  builds a trie in memory, where every path from the root downwards corresponds to a unique frequent itemset; in consecutive steps items are added to this trie one at a time. Both the database and the trie Wim Pijls Department of Computer Science Erasmus University P.O. Box 1738, 3000 DR Rotterdam The Netherlands pijls@few.eur.nl are kept in main memory, which might cause memory problems: both are usually very large, and in particular the trie gets much larger as the support threshold decreases. Finally the algorithm outputs all paths in the trie, i.e., all frequent itemsets. Note that once completed, the trie allows for fast itemset retrieval in the case of online processing. We formerly had two implementations of the algorithm, one being time efficient, the other being memory efficient (called dftime.cc and dfmemory.cc, respectively), where the time efficient version could not handle low support thresholds. The newest version (called dffast.cc) combines them into one even faster implementation, and runs for all support thresholds. In this paper we first describe  , we then give some formal definitions and theoretical formulas, we discuss the program, provide experimental results, and conclude with some remarks. 2 The Algorithm An appropriate data structure to store the frequent itemsets of a given database is a trie. As a running example in this section we use the dataset of Figure 1. Each line represents a transaction. The trie of frequent patterns is shown in Figure 2. The entries (or cells) in a node of a trie are usually called buckets, as is also the case for a hash-tree. Each bucket can be identified with its path to the root and hence with a unique frequent itemset. The example trie has 9 nodes and 18 buckets, representing 18 frequent itemsets. As an example, the frequent itemset       can be seen as the leftmost path in the trie; and a set as      is not present. One of the oldest algorithms for finding frequent patterns is A PRIORI, see [1]. This algorithm successively finds all frequent 1-itemsets, all frequent 2-itemsets, all frequent 3itemsets, and so on. (A  -itemset has  items.) The frequent -itemsets are used to generate candidate ´ · ½µ-itemsets, so on. So, A PRIORI can be thought of as an algorithm that builds the pattern trie in a breadth first way. We propose an algorithm that builds the trie in a depth first way. We will explain the depth first construction of the trie using the dataset of Figure 1. Note that the trie grows from right to left. The algorithm proceeds as follows. In a preprocessing step, the support of each single item is counted and the infrequent items are eliminated. Let the  frequent items be denoted by       . Next, the code from Figure 3 is executed. Dataset items transaction number 1 2 3 4 5 6 BCD ABEF ABEF ABCF ABCEF CDEF Frequent itemsets when minsup   support 5 4 3 frequent itemsets B, F A, AB, AF, ABF, BF, C, E, EF AE, ABE, ABEF, AEF, BC, BE, BEF, CF Figure 1. An example of a dataset along with its frequent itemsets. E F       F        C E F    F (5) (6) (7) the trie including only bucket   ;    downto 1 do ¼  ;  ¼ with  added to the left and a copy of ¼ appended to   ;   ¼ (= the subtrie rooted in   ); count   ; delete the infrequent itemsets from ;  for (9) procedure count     :: (10) for every transaction including item   do (11) for every itemset  in do (12) if supports  then  .support ; Figure 3. The algorithm. A B C E F B E F (1) (2) (3) (4) F F F Figure 2. An example of a trie (without support counts). where the candidates are only known to have two frequent subsets with  elements. After a pruning step, where candidates still having infrequent subsets are discarded, the support of the candidates is determined. The way A PRIORI finds the frequent patterns implies that the trie is built layer by layer. First the nodes in the root (depth  ) are constructed, next the trie nodes at depth 1 are constructed, and The procedure count    determines the support of each itemset (bucket) in the subtrie . This is achieved by a database pass, in which each transaction including item  is considered. Any such transaction is one at a time “pushed” through , where it only traverses a subtrie if it includes the root of this subtrie, meanwhile updating the support fields in the buckets. In the last paragraph from Section 4 a refinement of this part of the algorithm is presented. On termination of the algorithm, exactly contains the frequent itemsets. Figure 4 illustrates the consecutive steps of the algorithm applied to our example. The single items surpassing the minimum support threshold 3 are             and    . In the figure, the shape of after each iteration of the for loop is shown. Also the infrequent itemsets to be deleted at the end of an iteration are mentioned. At the start of the iteration with index , the root of trie consists of the 1-itemsets    . (We denote a 1-itemset by the name of its only item, omitting curly braces and commas as in Figure 1 and Figure 4.) By the statement in line (3) from Figure 3, this trie may also be referred to as ¼ . A new trie is composed by adding bucket   to the root and by appending a copy of ¼ (the former value of ) to  . The newly added buckets are the new candidates and they make up a subtrie . In Figure 4, the candidate set is in the left part of each trie and is drawn in bold. Notice that    E F     C E F CE and CEF are infrequent and hence deleted         E F F   F B C E F C E      F F F   F F    F BCF is infrequent and hence deleted A B C E F      F F F   B C E F C E         C E F F F F   F ABC, AC and ACF are infrequent and hence deleted Figure 4. Illustrating the algorithm. the final trie (after deleting infrequent itemsets) is identical to Figure 2. The number of iterations in the for loop is one less than the number of frequent 1-itemsets. Consequently, the number of database passes is one less than the number of frequent 1-itemsets. This causes the algorithm to be tractable only if the database under consideration is memory-resident. Given the present-day memory sizes, this is not a real constraint any more. As stated above, our algorithm has a preprocessing step which counts the support for each single item. After this preprocessing step, the items may be re-ordered. The most favorable execution time is achieved if we order the items by increasing frequency (see Section 3 for a more formal motivation). It is better to have low support at the top of the deeper side (to the left bottom) of the trie and hence, high support at the top of the shallow part (to the upper right) of the trie. We may distinguish between “dense” data sets and “sparse” datasets. A dense dataset has many frequent patterns of large size and high support, as is the case for test sets such as chess and mushroom (see Section 5). In those datasets, many transactions are similar to each other. Datasets with mainly short patterns are called sparse. Longer patterns may exist, but with relatively small support. Real-world transaction databases of supermarkets mostly belong to this category. Also the synthetic datasets from Section 5 have similar properties: interesting support thresholds are much lower than in the dense case. Algorithms for finding frequent patterns may be divided into two types: algorithms respectively with and without candidate generation Any A PRIORI-like instance belongs to the first type. Eclat (see [9]) may also be considered as an instance of this type. The FP-growth algorithm  from [5] is the best-known instance of the second type (though one can also defend the point of view that it does generate candidates). For dense datasets,  performs better than candidate generating algorithms.  stores the dataset in a way that is very efficient especially when the dataset has many similar transactions. In case of algorithms that do apply candidate generation, dense sets produce a large number of candidates. Since each new candidate has to be related to each transaction, the database passes take a lot of time. However, for sparse datasets, candidate generation is a very suitable method for finding frequent patterns. To our experience, the instances of the A PRIORI family are very useful when searching transaction databases. According to the results in [7] the depth first algorithm  outperforms FPgrowth  in the synthetic transaction sets (see Section 5 for a description of these sets). Finally, note that technically speaking  is not a full implementation of A PRIORI, since every candidate itemset is known to have only one frequent subset (resulting from the part of the trie which has already been completed) instead of two. Apart from this, its underlying candidate generation mechanism strongly resembles the one from A PRI ORI . 3 Theoretical Complexity Let denote the number of transactions (also called customers), and let  denote the number of products (also called items). Usually is much larger than . For a nonempty itemset      we define:      is the support of : the number of customers that buy all products from (and possibly more), or equivalently the number of transactions that contain ;     is the smallest number in ; is the largest number in . In line with this we let    . We also put     and      . A set      is called frequent if     , where the so-called support threshold minsup is a fixed number given in advance. We assume every 1-itemset to be frequent; this can be effected by the first step of the algorithms we are looking at, which might be considered as preprocessing. A “database query” is defined as a question of the form “Does customer  buy product  ?” (or “Does transaction has item  ?”), posed to the original database. Note that we have  database queries in the “preprocessing” phase in which the supports of the 1-itemsets are computed and ordered: every field of the database is inspected once. (By the way, the sorting, in which the items are assigned the  time.) The number numbers    , takes   of database queries for  equals:                 (1) For a proof, see [3]. It relies on the fact that in order for a node to occur in the trie the path to it (except for the root) should be frequent, and on the observation that this particular node is “questioned” every time a transaction follows this same path. In [3] a simple version of  is described in a similar style, leading to      (2)        database queries in “local databases” (FP-trees), except for the preprocessing phase. Note the extra condition on the inner summation (which is “good” for  : we have less summands there), while on the other hand the summands are larger (which is “good” for  : we have a smaller contribution there). It makes also sense to look at the total number of nodes of the trie during its construction, which is connected to the effort of maintaining and using the datastructures. Counting each trie-node with the number of buckets it contains, the total is computed to be:               ℄ (3) When the trie is finally ready the number of remaining buckets equals the number of frequent sets, each item in a node being the end of the path that represents the corresponding itemset. Notice that the complexity heavily depends on the sorting order of the items at the top level. It turns out that an increasing order of items is beneficial here. This is suggested by the contribution of the 1-itemsets in Equation (1):        (4) which happens to be minimal in that case. This 1-itemset contribution turns out to be the same for both  and  : see [3, 7], where also results for  are presented in more detail. 4 Implementation Issues In this section we discuss some implementation details of our program. As mentioned in Section 2, the database is traversed many times. It is therefore necessary that the database is memory-resident. Fortunately, only the occurrences of frequent items need to be stored. The database is represented by a two-dimensional boolean array. For efficiency reasons, one array entry corresponds to one bit. Since the function count in the algorithm considers the database transaction by transaction, a horizontal layout is chosen, cf. [4]. We have four preprocessing steps before the algorithm of Section 2 actually starts. 1 The range of the item values is determined. This is necessary, because some test sets, e.g., the BMS-WebView sets, have only values   . 2 This is an essential initial step. First, for each item the support is counted. Next, the frequent items are selected and sorted by frequency. This process is relevant, since the frequency order also prescribes the order in the root of the trie, as stated before. The sorted frequent items along with their supports are retained in an array. 3 If a transaction has zero or one frequent item, it needs not to be stored into the memory-resident representation of the database. The root of the trie is constructed As mentioned in the introduction, we used to have two implementations, one being time efficient, the other being memory efficient. These two have been used in the overall FIMI’03 comparisons. The newest implementation (called dffast.cc) combines these versions by using the following refinement. Instead of appending a copy ¼ of to  (see Figure 3 in Section 2), first the counting is done in auxiliary fields in the original , after which only the frequent buckets are copied underneath   . This makes the deletion of infrequent itemsets (line (7) from Figure 3) unnecessary and leads to better memory management. Another improvement might be achieved by using more auxiliary fields while adding two root items simultaneously in each iteration, thereby halving the number of database passes at the cost of more bookkeeping. 5 Experiments Using the relatively small database chess (342 kB, with 3,196 transactions; available from the FIMI’03 website at http://fimi.cs.helsinki.fi/testdata.html), the database mushroom (570 kB, with 8,124 transactions; also available from the FIMI’03 website) and the well-known IBM-Almaden synthetic databases (see [2]) we shall examine the complexity of the algorithm. These databases have either few, but coherent records (chess and mushroom), or many records (the synthetic databases). The parameters for generating a synthetic database are the number of transactions  (in thousands), the average transaction size and the average length  of so-called maximal potentially large execution time DF number of frequent sets (scale on right axis) 5 4 600 3 400 2 200 number of sets in 1,000,000s 800 1 0 0 65 60 55 50 45 40 relative support (%) Figure 5. Experimental results for database . Database mushroom 200 execution time DF number of frequent sets (scale on right axis) 5 150 4 3 100 2 number of sets in 1,000,000s After this preparatory work, which in practice usually takes a few seconds, the code as described in Section 2 is executed. The cells of the root are constructed using the result of initial step 2. In line (12) from Figure 3 in Section 2, backtracking is applied to inspect each path  of . Inspecting a path  is aborted as soon as an item  with  outside the current transaction is found. Obviously, processing one transaction during the count procedure is a relatively expensive task, which is unfortunately inevitable, whichever version of A PRIORI is used. Database chess 1000 runtime (seconds) 4 During this step the databases is stored into a twodimensional array with horizontal layout. Each item is given a new number, according to its rank in the frequency order. The length of the array equals the result of step 3; the width is determined by the number of frequent items. itemsets. The number of items was set to   , following the design in [2]. We use T10I4D100K (4.0 MB) and T40I10D100K (15.5 MB), both also available from the FIMI’03 website mentioned above; they both contain 100,000 transactions. runtime (seconds) according to the information gathered in step 2. For constructing the other buckets, only transactions with at least two frequent items are relevant. In this step, we count the relevant transactions. 50 1 0 0 14 12 10 8 6 4 relative support (%) Figure 6. Experimental results for database . The experiments were conducted at a Pentium-IV machine with 512 MB memory at 2.8 GHz, running Red Hat Linux 7.3. The program was developed under the G NU C  compiler, version 2.96. The following statistics are plotted in the graphs: the execution time in seconds of the algorithm (see Section 4), and the total number of frequent itemsets: in all figures the corresponding axis is on the right hand side and scales 0– ). The execution 5,500,000 (0–8,000,000 for  time excludes preprocessing: in this phase the database is read three times in order to detect the frequent items (see Database T10I4D100K 100 8 execution time DF number of frequent sets (scale on right axis) 7 80 5 60 4 40 3 number of sets in 1,000,000s runtime (seconds) 6 2 20 1 0 0.045 0 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 relative support (%) Figure 7. Experimental results for database  . Database T40I10D100K 500 execution time DF number of frequent sets (scale on right axis) 5 450 4 runtime (seconds) 350 300 3 250 200 2 150 100 number of sets in 1,000,000s 400 1 50 0 0 2 1.5 1 0.5 0 relative support (%) Figure 8. Experimental results for database  . before); also excluded is the time needed to print the resulting itemsets. These actions together usually only take a few seconds. The number of frequent 1-itemsets ( from the previous sections, where we assumed all 1-itemsets to be frequent) has range 31–39 for the experiments on the database chess, 54–76 for mushroom, 844–869 for T10I4D100K and 610–862 for T40I10D100K. Note the very high support thresholds for mushroom (at least 5%) and chess (at least 44%); for T10I4D100K a support threshold as low as 0.003% was even feasible. The largest output files produced are of size 110.6 MB (chess, minsup = 1,400, having 3,771,728 frequent sets with 13 frequent 17-itemsets), 121.5 MB (mushroom, minsup = 400, having 3,457,747 frequent sets with 24 frequent 17-itemsets), 131.1 MB (T10I4D100K, minsup = 3, having 6,169,854 frequent sets with 30 frequent 13-itemsets and 1 frequent 14-itemset) and 195.9 MB (T40I10D100K, minsup = 300, having 5,058,313 frequent sets, with 21 frequent 19-itemsets and 1 frequent 20-itemset). The final trie in the T40I10D100K case occupies approximately 65 MB of memory — the output file in this case being 3 times as large. Note that the 3,457,747 sets for the chess database with minsup = 1,400 require 829 seconds to find, whereas the 3,771,728 frequent itemsets for mushroom with minsup = 400 take 158 seconds — differing approximately a factor 5 in time. This difference in runtime is probably caused by the difference in the absolute minsup value. Each cell corresponding to a frequent itemset is visited at least 1400 times in the former case against 400 times in the latter case. A similar phenomenon is observed when comparing T40I10D100K with absolute minsup value 300 and T10I4D100K with minsup = 3: this takes 378 versus 88 seconds. Although the outputs have the same orders of magnitude, the runtimes differ substantially. We see that, besides the number of frequent itemsets and the sizes of these sets, the absolute minsup value is a major factor determining the runtime. 6 Conclusions In this paper, we addressed  , a depth first implementation of A PRIORI. To our experience,  competes with any other well-known algorithm, especially when applied to large databases with transactions. Storing the database in the primary memory is no longer a problem. On the other hand, storing the candidates causes trouble in situations, where a dense database is considered with a small support threshold. This is the case for any algorithm using candidates. Therefore, it would be desirable to look for a method which stores candidates in secondary memory. This is an obvious topic for future research. To our knowledge,  is the only algorithm that can cope with memory limitations. However, for real world retail databases this algorithm is surpassed by  , as we showed in [7]. Other optimizations might also be possible. Besides improving the C  code, ideas from, e.g., [10] on diffsets with vertical layouts might be used. Our conclusion is that  is a simple, practical, straightforward and fast algorithm for finding all frequent itemsets. References [1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo. Fast discovery of association rules. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI/MIT Press, 1996. [2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In J.B. Bocca, M. Jarke, and C. Zaniolo, editors, Proceedings 20th International Conference on Very Large Data Bases, VLDB, pages 487– 499. Morgan Kaufmann, 1994. [3] J.M. de Graaf, W.A. Kosters, W. Pijls, and V. Popova. A theoretical and practical comparison of depth first and FP-growth implementations of Apriori. In H. Blockeel and M. Denecker, editors, Proceedings of the Fourteenth Belgium-Netherlands Artificial Intelligence Conference (BNAIC 2002), pages 115–122, 2002. [4] B. Goethals. Survey on frequent pattern mining. Helsinki, 2003.        . [5] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00), pages 1–12, 2000. [6] J. Hipp, U. Günther, and G. Nakhaeizadeh. Mining association rules: Deriving a superior algorithm by analyzing today’s approaches. In D.A. Zighed, J. Komorowski, and J. Żytkov, editors, Principles of Data Mining and Knowledge Discovery, Proceedings of the 4th European Conference (PKDD 2000), Springer Lecture Notes in Computer Science 1910, pages 159– 168. Springer Verlag, 2000. [7] W.A. Kosters, W. Pijls, and V. Popova. Complexity analysis of depth first and FP-growth implementations of Apriori. In P. Perner and A. Rosenfeld, editors, Machine Learning and Data Mining in Pattern Recognition, Proceedings MLDM 2003, Springer Lecture Notes in Artificial Intelligence 2734, pages 284– 292. Springer Verlag, 2003. [8] W. Pijls and J.C. Bioch. Mining frequent itemsets in memory-resident databases. In E. Postma and M. Gyssens, editors, Proceedings of the Eleventh Belgium-Netherlands Conference on Artificial Intelligence (BNAIC1999), pages 75–82, 1999. [9] M.J. Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12:372–390, 2000. [10] M.J. Zaki and K. Gouda. Fast vertical mining using diffsets. In Proceedings 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003. [11] Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In F. Provost and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001), pages 401–406, 2001. AFOPT: An Efficient Implementation of Pattern Growth Approach∗ Guimei Liu Hongjun Lu Department of Computer Science Hong Kong University of Science & Technology Hong Kong, China {cslgm, luhj}@cs.ust.hk Jeffrey Xu Yu Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong Hong Kong, China {yu}@se.cuhk.edu.hk Abstract Wei Wang Xiangye Xiao Department of Computer Science Hong Kong University of Science & Technology Hong Kong, China {fervvac, xiaoxy}@cs.ust.hk candidate itemsets, many of which are proved to be infrequent after scanning the database; and (3) subset checking is a cost operation, especially when itemsets are very long. The pattern growth approach avoids the cost of generating and testing a large number of candidate itemsets by growing a frequent itemset from its prefix. It constructs a conditional database for each frequent itemset t such that all the itemsets that have t as prefix can be mined only using the conditional database of t. The basic operations in the pattern growth approach are counting frequent items and new conditional databases construction. Therefore, the number of conditional databases constructed during the mining process, and the mining cost of each individual conditional database have a direct effect on the performance of a pattern growth algorithm. The total number of conditional databases mainly depends on in what order the search space is explored. The traversal cost and construct cost of a conditional database depends on the size, the representation format (tree-based or array-based) and construction strategy (physical or pseudo) of the conditional database. If the conditional databases are represented by tree structure, the traversal strategy of the tree structure also matters. In this paper, we investigate various aspects of the pattern growth approach, and try to find out what are good strategies for a pattern growth algorithm. The rest of the paper is organized as follows: Section 2 revisits the FIM problem and introduces some related works; In Section 3, we describe an efficient pattern growth algorithm—AFOPT; Section 4 and Section 5 extend the AFOPT algorithm to mine frequent closed itemsets and maximal frequent itemsets respectively; Section 6 shows experiment results; finally, Section 7 concludes this paper. In this paper, we revisit the frequent itemset mining (FIM) problem and focus on studying the pattern growth approach. Existing pattern growth algorithms differ in several dimensions: (1) item search order; (2) conditional database representation; (3) conditional database construction strategy; and (4) tree traversal strategy. They adopted different strategies on these dimensions. Several adaptive algorithms were proposed to try to find good strategies for general situations. In this paper, we described the implementation techniques of an adaptive pattern growth algorithm, called AFOPT, which demonstrated good performance on all tested datasets. We also extended the algorithm to mine closed and maximal frequent itemsets. Comprehensive experiments were conducted to demonstrate the efficiency of the proposed algorithms. 1 Introduction Since the frequent itemset mining problem (FIM) was first addressed [2], a large number of FIM algorithms have been proposed. There is a pressing need to completely characterize and understand the algorithmic performance space of FIM problem so that we can choose and integrate the best strategies to achieve good performance in general cases. Existing FIM algorithms can be classified into two categories: the candidate generate-and-test approach and the pattern growth approach. In each iteration of the candidate generate-and-test approach, pairs of frequent k-itemsets are joined to form candidate (k+1)-itemsets, then the database is scanned to verify their supports. The resultant frequent (k+1)-itemsets will be used as the input for next iteration. The drawbacks of this approach are: (1) it needs scan database multiple times, in worst case, equal to the maximal length of the frequent itemsets; (2) it needs generate lots of 2 Problem Revisit and Related Work In this section, we first briefly review FIM problem and the candidate generate-and-test approach, then focus on studying the algorithmic performance space of the pattern growth approach. ∗ This work was partly supported by the Research Grant Council of the Hong Kong SAR, China (Grant HKUST6175/03E, CUHK4229/01E). 1 2.1 Problem revisit that are estimated as frequent or themselves are not estimated as frequent but all of its subsets are frequent or estimated as frequent into next block. The problem with AIS algorithm is that it does not fully utilize the pruning power of the Apriori property, thus many unnecessary candidate itemsets are generated and tested. DIC algorithm [3] makes improvements based on Apriori algorithm. It starts counting the support of an itemset shortly after all the subsets of that itemset are determined to be frequent rather than wait until next pass. However, DIC algorithm cannot guarantee the full utilization of memory. The candidate generate-andtest approach faces a trade-off: on one hand, the memory is not fully utilized and it is desirable to put as many as possible candidate itemsets into memory to reduce the number of database scans; on the other hand, set containment test is a costly operation, putting itemsets into memory in earlier stage has the risk of counting support for many unnecessary candidate itemsets. Given a transactional database D, let I be the set of items appearing in it. Any combination of the items in I can be frequent in D, and they form the search space of FIM problem. The search space can be represented using set enumeration tree [14, 1, 4, 5, 7]. For example, given a set of items I = {a, b, c, d, e} sorted in lexicographic order, the search space can be represented by a tree as shown in Figure 1. The root of the search space tree represents the empty set, and each node at level l (the root is at level 0, and its children are at level 1, and so on) represents an l-itemset. The candidate extensions of an itemset p is defined as the set of items after the last item of p. For example, items d and e are candidate extensions of ac, while b is not a candidate extension of ac because b is before c. The frequent extensions of p are those candidate extensions of p that can be appended to p to form a longer frequent itemset. In the rest of this paper, we will use cand exts(p) and f req exts(p) to denote the set of candidate extensions and frequent extensions of p respectively. 2.3 The pattern growth approach adopts the divide-andconquer methodology. The search space is divided into disjoint sub search spaces. For example, the search space shown in Figure 1 can be divided into 5 disjoint sub search spaces: (1) itemsets containing a; (2) itemsets containing b but no a; (3) itemsets containing c but no a, b; (4) itemsets containing d but no a, b and c; and (5) itemsets containing only e. Accordingly, the database is divided into 5 partitions, and each partition is called a conditional database. The conditional database of item i, denoted as Di , includes all the transactions containing item i. All the items before i are eliminated from each transaction. All the frequent itemsets containing i can be mined from Di without accessing other information. Each conditional database is divided recursively following the same procedure. The pattern growth approach not only reduces the number of database scans, but also avoids the costly set-containment-test operation. Two basic operations in pattern growth approach are counting frequent items and new conditional databases construction. Therefore, the total number of conditional databases constructed and the mining cost of each individual conditional database are key factors that affect the performance of a pattern growth algorithm. The total number of conditional databases mainly depends on in what order the search space is explored. This order is called item search order in this paper. Some structures for representing conditional databases can also help reduce the total number of conditional databases. For example, if a conditional database is represented by tree-structure and there is only one branch, then all the frequent itemsets in the conditional database can be enumerated directly from the branch. There is no need to construct new conditional databases. The mining cost of a conditional database depends on the size, the NULL{a,b ,c,d ,e} a ab ab c ab cd b c e d ac ad ae bc bd be cd ab d ab e acd ace ad e b cd b ce bde ab ce ab d e acd e ce de cd e b cd e ab cd e Figure 1. Search space tree 2.2 Pattern growth approach Candidate generate-and-test approach Frequent itemset mining can be viewed as a set containment join between the transactional database and the search space of FIM. The candidate generate-and-test approach essentially uses block nested loop join, i.e. the search space is the inner relation and it is divided into blocks according to itemset length. Different from simple block nested loop join, in candidate generate-and-test approach the output of the previous pass is used as seeds to generate next block. For example, in the k-th pass of the Apriori algorithm, the transaction database and the candidate k-itemsets are joined to generate frequent k-itemsets. The frequent kitemsets are then used to generate next block—candidate (k+1)-itemsets. Given the large amount of memory available nowadays, it is a waste of memory to put only a single length of itemsets into memory. It is desirable to fully utilize available memory by putting some longer and possibly frequent itemsets into memory in earlier stage to reduce the number of database scans. The first FIM algorithm AIS [2] tries to estimate the frequencies of longer itemsets using the output of current pass, and includes those itemsets 2 Datasets T10I4D100k (0.01%) T40I10D100k (0.5%) BMS-POS (0.05%) BMS-WebView-1 (0.06%) chess (45%) connect-4 (75%) mushroom (5%) pumsb (70%) #cdb 53688 311999 115202 33186 312202 12242 9838 272373 Asc time 4.52s 30.42s 27.83s 0.69s 2.68s 1.31s 0.34s 3.87s max mem 5199 kb 17206 kb 17294 kb 731 kb 574 kb 38 kb 1072 kb 383 kb #cdb 47799 310295 53495 65378 617401 245663 258068 649096 Lex time 4.89s 33.83s 127.45s 1.12s 8.46s 2.65s 3.11s 12.22s max mem 5471 kb 20011 kb 38005 kb 901 kb 1079 kb 57 kb 676 kb 570 kb #cdb 36725 309895 39413 79571 405720 266792 464903 469983 Des time 5.32s 43.37s 147.01s 2.16s 311.19s 14.27s 272.30s 16.62s max mem 5675 kb 21980 kb 40206 kb 918 kb 2127 kb 113 kb 2304 kb 1225 kb Table 1. Comparison of Three Item Search Orders (Bucket Size=0) representation and construction strategy of the conditional database. The traversal strategy also matters if the conditional database is represented using a tree-structure. Item Search Order. When we divide the search space, all items are sorted in some order. This order is called item search order. The sub search space of an item contains all the items after it in item search order but no item before it. Two item search orders were proposed in literature: static lexicographic order and dynamic ascending frequency order. Static lexicographic order is to order the items lexicographically. It is a fixed order—all the sub search spaces use the same order. Tree projection algorithm [15] and HMine algorithm[12] adopted this order. Dynamic ascending frequency order reorders frequent items in every conditional database in ascending order of their frequencies. The most infrequent item is the first item, and all the other items are its candidate extensions. The most frequent item is the last item and it has no candidate extensions. FP-growth [6], AFOPT [9] and most of maximal frequent itemsets mining algorithms [7, 1, 4, 5] adopted this order. The number of conditional databases constructed by an algorithm can differ greatly using different item search orders. Ascending frequency order is capable of minimizing the number and/or the size of conditional databases constructed in subsequent mining. Intuitively, an itemset with higher frequency will possibly have more frequent extensions than an itemset with lower frequency. We put the most infrequent item in front, though the candidate extension set is large, the frequent extension set cannot be very large. The frequencies of successive items increase, at the same time the size of candidate extension set decreases. Therefore we only need to build smaller and/or less conditional databases in subsequent mining. Table 1 shows the total number of conditional databases constructed (#cdb column), total running time and maximal memory usage when three orders are adopted in the framework of AFOPT algorithm described in this paper. The three item search orders compared are: dynamic ascending frequency order (Asc column), lexicographic order (Lex column) and dynamic descending frequency order (Des column). The minimum support threshold on each dataset is shown in the first column. On the first three datasets, ascending frequency order needs to build more conditional databases than the other two orders, but its total running time and maximal memory usage is less than the other two orders. It implies that the conditional databases constructed using ascending frequency order are smaller. On the remaining datasets, ascending frequency order requires to build less conditional databases and needs less running time and maximal memory usage, especially on dense datasets connect-4 and mushroom. Agrawal et al proposed an efficient support counting technique, called bucket counting, to reduce the total number of conditional databases[1]. The basic idea is that if the number of items in a conditional database is small enough, we can maintain a counter for every combination of the items instead of constructing a conditional database for each frequent item. The bucket counting can be implemented very efficiently compared with conditional database construction and traversal operation. Conditional Database Representation. The traversal and construction cost of a conditional database heavily depends on its representation. Different data structures have been proposed to store conditional databases, e.g. tree-based structures such as FP-tree [6] and AFOPT-tree [9], and array-based structure such as Hyper-structure [12]. Tree-based structures are capable of reducing traversal cost because duplicated transactions can be merged and different transactions can share the storage of their prefixes. But they incur high construction cost especially when the dataset is sparse and large. Array-based structures incur little construction cost but they need much more traversal cost because the traversal cost of different transactions cannot be shared. It is a trade-off in choosing tree-based structures or array-based structures. In general, tree-based structures are suitable for dense databases because there can be lots of prefix sharing among transactions, and array-based structures are suitable for sparse databases. Conditional Database Construction Strategy Constructing every conditional database physically can be expensive especially when successive conditional databases do not shrink much. An alternative is to pseudo-construct them, i.e. using pointers pointing to transactions in upper 3 Algorithms Tree-Projection [15] FP-growth [6] H-mine [12] OP [10] PP-mine [17] AFOPT [9] CLOSET+ [16] Item Search Order static lexicographic dynamic frequency static lexicographic adaptive static lexicographic dynamic frequency dynamic frequency CondDB Format array FP-tree hyper-structure adaptive PP-tree adaptive FP-tree CondDB Construction adaptive physical pseudo adaptive pseudo physical adaptive Tree Traversal bottom-up bottom-up top-down top-down adaptive Table 2. Pattern Growth Algorithms level conditional databases. However, pseudo-construction cannot reduce traversal cost as effectively as physical construction. The item ascending frequency search order can make the subsequent conditional databases shrink rapidly, consequently it is beneficial to use physical construction strategy with item ascending frequency order together. Tree Traversal Strategy The traversal cost of a tree is minimal using top-down traversal strategy. FP-growth algorithm [6] uses ascending frequency order to explore the search space, while FP-tree is constructed according to descending frequency order. Hence FP-tree has to be traversed using bottom-up strategy. As a result, FP-tree has to maintain parent links and node links at each node for bottom-up traversal. which increases the construction cost of the tree. AFOPT algorithm [9] uses ascending frequency order both for search space exploration and prefix-tree construction, so it can use the top-down traversal strategy and do not need to maintain additional pointers at each node. The advantage of FP-tree is that it can be more compact than AFOPT-tree because descending frequency order increases the possibility of prefix sharing. The ascending frequency order adopted by AFOPT may lead to many single branches in the tree. This problem was alleviated by using arrays to store single branches in AFOPT-tree. Existing pattern growth algorithms mainly differ in the several dimensions aforementioned. Table 2 lists existing pattern growth algorithms and their strategies on four dimensions. AFOPT [9] is an efficient FIM algorithm developed by our group. We will discuss its technical details in next three sections. above three implications into consideration. The distinct features of our AFOPT algorithm include: (1) It uses three different structures to represent conditional databases: arrays for sparse conditional databases, AFOPT-tree for dense conditional databases, and buckets for counting frequent itemsets containing only top-k frequent items, where k is a parameter to control the number of buckets used. Several parameters are introduced to control when to use arrays or AFOPT-tree. (2) It adopts the dynamic ascending frequency order. (3) The conditional databases are constructed physically on all levels no matter whether the conditional databases are represented by AFOPT-tree or arrays. 3.1 Framework Given a transactional database D and a minimum support threshold, AFOPT algorithm scans the original database twice to mine all frequent itemsets. In the first scan, all frequent items in D are counted and sorted in ascending order of their frequencies, denoted as F = {i1 , i2 , · · · , im }. We perform another database scan to construct a conditional database for each ij ∈ F , denoted as Dij . During the second scan, infrequent items in each transaction t are removed and the remaining items are sorted according to their orders in F . Transaction t is put into Dij if the first item of t after sorting is ij . The remaining mining will be performed on conditional databases only. There is no need to access the original database. We first perform mining on Di1 to mine all the itemsets containing i1 . Mining on individual conditional database follows the same process as mining on the original database. After the mining on Di1 is finished, Di1 can be discarded. Because Di1 also contains other items, the transactions in it will be inserted into the remaining conditional databases. Given a transaction t in Di1 , suppose the next item after i1 in t is ij , then t will be inserted into Dij . This step is called push-right. Sorting the items in ascending order of their frequencies ensures that every time, a small conditional database is pushed right. The pseudo-code of AFOPT-all algorithm is shown in Algorithm 1. 3 Mining All Frequent Itemsets We discussed several trade-offs faced by a pattern growth algorithm in last section. Some implications from above discussions are: (1) Use tree structure on dense database and use array structure on sparse database. (2) Use dynamic ascending frequency order on dense databases and/or when minimum support threshold is low. It can dramatically reduce the number and/or the size of the successive conditional databases. (3) If dynamic ascending frequency order is adopted, then use physical construction strategy because the size of conditional databases will shrink quickly. In this section, we describe our algorithm AFOPT which takes the 3.2 Conditional database representation Algorithm 1 is independent of the representation of conditional databases. We choose proper representations ac4 ets of size 4 (=2bucket size ). The conditional databases of f and p are represented by AFOPT-tree. The conditional databases of item c and d are represented using arrays. From our experience, the bucket size parameter can choose a value around 10. A value between 20 and 200 will be safe for tree alphabet size parameter. We set tree min sup to 5% and tree avg sup to 10% in our experiments. Table 3 shows the size, construction time (build column) and push-right time if applicable, of the initial structure constructed from original database by AFOPT, H-Mine and FP-growth algorithms. We set bucket size to 8 and tree alphabet size to 20 for AFOPT algorithm. The initial structure of AFOPT includes all three structures. The array structure in AFOPT algorithm simply stores all items in a transaction. Each node in hyper-structure stores three pieces of information: an item, a pointer pointing to the next item in the same transaction and a pointer pointing to the same item in another transaction. Therefore the size of hyper-structure is approximately 3 times larger than the array structure used in AFOPT. A node in AFOPT-tree maintains only a child pointer and a sibling pointer, while a FPtree node maintains two more pointers for bottom-up traversal: a parent pointer and a node link. AFOPT consumes the least amount of space on almost all tested datasets. Algorithm 1 AFOPT-all Algorithm Input: p is a frequent itemset Dp is the conditional database of p min sup is the minimum support threshold; Description: 1: Scan Dp count frequent items, F ={i1 , i2 ,· · ·, in }; 2: Sort items in F in ascending order of their frequencies; 3: for all item i ∈ F do 4: Dp S{i} = φ; 5: for all transaction t ∈ Dp do 6: remove infrequent items from t, and sort remaining items according to their orders in F ; 7: let i be the first item of t, insert t into Dp S{i} . 8: for all item i ∈ S F do 9: Output s = p {i}; 10: AFOPT-all(s, Ds , min sup); 11: PushRight(Ds ); TID Transactions TID Transactions 1 a, b, c, f, m, p 1 c, p, f, m, a 2 a, d, e, f, g 2 d, f, a 3 a, b, f, m, n 3 f, m, a 4 a, c, e, f, m, p 4 c, p, f, m, a 5 d, f, n, p 5 d, p, f 6 a, c, h, m, p 6 c, p, m, a 7 a, d, m, s 7 d, m, a (a) D (b) header table c:3 d:3 p:4 f:5 m:5 4 4 3 2 2 2 p p f p f m f a a f f m m m a a a a:6 m:1 a:1 (c) Figure 2. Conditional DB Representation 4 Mining Frequent Closed Itemsets The complete set of frequent itemsets can be very large. It has been shown that it contains many redundant information [11, 18]. Some works [11, 18, 13, 16, 8] put efforts on mining frequent closed itemsets to reduce output size. An itemset is closed if all of its supersets have a lower support than it. The set of frequent closed itemsets is the minimum informative set of frequent itemsets. In this section, we describe how to extend Algorithm 1 to mine only frequent closed itemsets. For more details, please refer to [8]. cording to the density of conditional databases. Three structures are used: (1) array, (2) AFOPT-tree, and (3) buckets. As aforementioned, these three structures are suitable for different situations. Bucket counting technique is proper and extremely efficient when the number of distinct frequent items is around 10. Tree structure is beneficial when conditional databases are dense. Array structure is favorable when conditional databases are sparse. We use four parameters to control when to use these three structures as follows: (1) frequent itemsets containing only top-bucket size frequent items are counted using buckets; (2) if the minimum support threshold is greater than tree min sup or average support of all frequent items is no less than tree avg sup, then all the rest conditional databases are represented using AFOPT-tree; otherwise (3) the conditional databases of the next tree alphabet size most frequent items are represented using AFOPT-tree, and the rest conditional databases are represented using arrays. Figure 2 shows a transactional database D and the initial conditional databases constructed with min sup=40%. There are 6 frequent items {c:3, d:3, p:4, f :5, m:5,a:6}. Figure 2(b) shows the projected database after removing infrequent items and sorting. The values of the parameters for conditional database construction are set as follows: bucket size=2, tree alphabet size=2, tree min sup=50%, tree avg sup=60%. The frequent itemsets containing only m and a are counted using buck- 4.1 Removing non-closed itemsets Non-closed itemsets can be removed either in a postprocessing phase, or during mining process. The second strategy can help avoid unnecessary mining cost. Non-closed frequent itemsets are removed based on the following two lemmas (see [8] for proof of these two lemmas). Lemma 1 In Algorithm 1, an itemset p is closed if and only if two conditions hold: (1) no existing frequent itemsets is a superset of p and is as frequent as p; (2) all the items in Dp have a lower support than p. Lemma 2 In Algorithm 1, if a frequent itemset p is not closed because condition (1) in Lemma 1 does not hold, then none of the itemsets mined from Dp can be closed. We check whether there exists q such that p ⊂ q and sup(p)=sup(q) before mining Dp . If such q exists, then there is no need to mine Dp based on Lemma 2 (line 10). 5 Datasets AFOPT build 0.55s 1.85s 2.11s 0.12s 0.04s 0.73s 0.08s 0.82s size 5116 kb 16535 kb 17264 kb 711 kb 563 kb 35 kb 1067 kb 375 kb T10I4D100k (0.01%) T40I10D100k (0.5%) BMS-POS (0.05%) BMS-WebView-1 (0.06%) chess (45%) connect-4 (75%) mushroom (5%) pumsb (70%) pushright 0.37s 1.91s 1.43s 0.01s 0.01s 0.01s 0.04s 0.02s Size 11838 kb 46089 kb 38833 kb 1736 kb 1150 kb 22064 kb 2120 kb 17374 kb H-Mine build 0.68s 2.10s 2.58s 0.17s 0.05s 1.15s 0.10s 1.15s pushright 0.19s 1.42s 1.00s 0.01s 0.03s 0.55s 0.03s 0.43s FP-growth size 20403 kb 104272 kb 47376 kb 1682 kb 1339 kb 92 kb 988 kb 1456 kb build 1.83s 6.16s 6.64s 0.27s 0.12s 2.08s 0.17s 2.26s Table 3. Comparison of Initial Structures Thus the identification of a non-closed itemsets not only reduces output size, but also avoids unnecessary mining cost. Based on pruning condition (2) in Lemma 1, we can check whether an item i ∈ F appears in every transaction of Dp . If such i exists, then there is no need to consider the frequent itemsets that do not contain i when mining Dp . In other words, we can directly perform mining on Dp S{i} instead of Dp (line 3-4). The efforts for mining Dp S{j} , j 6= i are saved. The pseudo-code for mining frequent closed itemsets is shown in Algorithm 2. CFP-tree to check whether an itemset is closed. An example of CFP-tree is shown in Figure 3(b) which stores all the frequent closed itemsets in Figure 3(a). They are mined from the database shown in Figure 2(a) with support 40%. Each CFP-tree node is a variable-length array, and all the items in the same node are sorted in ascending order of their frequencies. A path in the tree starting from an entry in the root node represents a frequent itemset. The CFPtree has two properties: the left containment property and the Apriori property. The Apriori Property is that the support of any child of a CFP-tree entry cannot be greater than the support of that entry. The Left Containment Property is that the item of an entry E can only appear in the subtrees pointed by entries before E or in E itself. The superset of an itemset p with support s can be efficiently searched in the CFP-tree based on these two properties. The apriori property can be exploited to prune subtrees pointed by entries with support less than s. The left containment property can be utilized to prune subtrees that do not contain all items in p. We also maintain a hash-bitmap in each entry to indicate whether an item appears in the subtree pointed by that entry to further reduce searching cost. The superset search algorithm is shown in Algorithm 3. BinarySearch(cnode, s) returns the first entry in a CFP-tree node with support no less than s. Algorithm 3 do not require the whole CFP-tree to be in main memory because it is also very efficient on disk. Moreover, the CFP-tree structure is a compact representation of the frequent closed itemsets, so it has a higher chance to be held in memory than flat representation. Algorithm 2 AFOPT-close Algorithm Input: p is a frequent itemset Dp is the conditional database of p min sup is the minimum support threshold; Description: 1: Scan Dp count frequent items, F ={i1 , i2 ,· · ·, in }; 2: Sort items in F in ascending order S of their frequencies; 3: I = {i|i ∈ F and support(p {i}) = support(p)}; S 4: F = F − I; p = p I; 5: for all transaction t ∈ Dp do 6: remove infrequent items from t, and sort remaining items according to their orders in F ; 7: let i be the first item of t, insert t into Dp S{i} . 8: for all item S i ∈ F do 9: s = p {i}; 10: if s is closed then 11: Output s; 12: AFOPT-close(s, Ds , min sup); 13: PushRight(Ds ); Frequent Closed Itemsets c:3 d:3 p:4 f:5 m:5 a:6 d:3, p:4, f:5, a:6 pf:3, fa:4, ma: 5 fma: 3 f:3 pma:3 cpma:3 (a) m:3 a:4 a:5 a:3 (b) CFP-tree c 4 d 1 p 4 1 f 2 0 0 m 4 0 0 a 4 0 0 Although searching in CFP-tree is very efficient, it is still costly when CFP-tree is large. Inspired by the twolayer structure adopted by CLOSET+ algorithm[16] for subset checking, we use a two-layer hash map to check whether an itemset is closed before searching in CFPtree. The two-layer hash map is shown in Figure 3(c). We maintain a hash map for each item. The hash map of item i is denoted by i.hashmap. The length of the hash map of an item i is set to min{sup(i)-min sup, max hashmap len}, where max hashmap len is a parameter to control the maximal size of the hash maps and min sup=min{sup(i)|i is f requent}. Given an itemset 0 (c) Figure 3. CFP-tree and Two-layer Hash Map 4.2 Closed itemset checking During the mining process, we store all existing frequent closed itemsets in a tree structure, called Condensed Frequent Pattern tree or CFP-tree for short [8]. We use the 6 S Lemma 4 Given a frequent itemset p, if p f req exts(p) is frequent but not maximal, then none of the frequent itemsets mined from S Dp can be maximal because all of them are subsets of p f req exts(p). Algorithm 3 Search Superset Algorithm Input: l is a frequent itemset cnode the CFP+-tree node pointed by l s is the minimum support threshold I is a set of items to be contained in the superset Description: 1: if I = φ then 2: return true; 3: Ē = the first entry of cnode such that Ē.item ∈ I; 4: E ′ = BinarySearch(cnode, s); 5: for all S entry E ∈ cnode, E between E ′ and Ē do 6: l′ =l {E.item}; 7: if E.child 6= NULL AND all items in I − {E.item} are in E.subtree then 8: found = Search Superset(l′ , E.child, s, I − {E.item}); 9: if found then 10: return true; 11: else if I − {E.item} = φ then 12: return true; 13: return false; Based on Lemma S 3, before mining Dp , we can first check whether p cand exts(p) is frequent but not maximal. This can be done by two techniques. Superset Pruning Technique: It is to check whether there exists some S maximal frequent itemset such that it is a superset of p cand exts(p). Like frequent closed itemset mining, subset checking can be challenging when the number of maximal itemsets is large. We will discuss this issue in next subsection. SLookahead Technique: It is to check whether p cand exts(p) is frequent when count frequent items in Dp . If Dp is represented by AFOPT-tree, the lookahead operation can be accomplished by simply S looking at the leftmost branch of AFOPT-tree. If p cand exts(p) is frequent, then the length of the left-most branch is equal to |cand exts(p)|, and the support of the leaf node of the leftmost branch is no less than min sup. If the superset pruning technique and lookahead technique fail, then based on Lemma 4 weScan use superset pruning technique to check whether p f req exts(p) is frequent but not maximal. Two other techniques are adopted in our algorithm. Excluding items appearing in every transaction of Dp from subsequent mining: Like frequent closed itemset mining, if an item i appears in every transaction of Dp , then a frequent itemset q mined from S Dp and not containing i cannot be maximal because q {i} is frequent. Single Path Trimming: If Dp is represented by AFOPTtree and it has only one child i, then we can append i to p and remove it from subsequent mining. p = {i1 , i2 , · · · , il }, p is mapped to ij .hashmap[(sup(p) − min sup)%max hashmap len], j = 1, 2, · · · , l. An entry in a hash map records the maximal length of the itemsets mapped to it. For example, itemset {c, p, m, a} set the first entry of c.hashmap, p.hashmap, m.hashmap and a.hashmap to 4. Figure 3(c) shows the status of the twolayer hash map before mining Df . An itemset p must be closed if any of the entry it mapped to contains a lower value than its length. In such cases there is no need to search in CFP-tree. The hash map of an item i can be released after all the frequent itemsets containing i are mined because they will not be used in later mining. For example, when mining Df , the hash map of items c, d and p can be deleted. 5 Mining Maximal Frequent Itemsets The problem of mining maximal frequent itemsets can be viewed as given a minimum support threshold min sup, finding a border through the search space tree such that all the nodes below the border are infrequent and all the nodes above the border are frequent. The goal of maximal frequent itemsets mining is to find the border by counting support for as less as possible itemsets. Existing maximal algorithms [19, 7, 1, 4, 5] adopted various pruning techniques to reduce the search space to be explored. 5.1 5.2 Subset checking When do superset prunning, to check against all frequent maximal itemsets can be costly when the number of maximal itemsets is large. Zaki et. al proposed a progressive focusing technique for subset checking [5]. The observation behind the progressive focusing technique is that only the maximal p can S frequent itemsets containing S be a superset of p cand exts(p) or p f req exts(p). The set of maximal frequent itemsets containing p is called the local maximal frequent itemsets with respect to p, deS notedSas LMFIp . When check whether p cand exts(p) or p f req exts(p) is a subset of some existing maximal frequent itemsets, we only need to check them against LMFIp . The frequent itemsets in LMFIp can either come from p’s parent’s LMFI, or from p’s left-siblings’ LMFI. The construction of LMFIs is very similar to the construction of conditional databases. The construction consists of two steps: (1) projecting: after all frequent items F in Dp Pruning techniques The most effective techniques are based on the following two lemmas to prune a whole branch from search space tree. S Lemma 3 Given a frequent itemset p, if p cand exts(p) is frequent but not maximal, then none of the frequent itemsets mined from Dp and from p’s right sibling’s conditional databases can be maximal because all of them are subsets S of p cand exts(p). 7 100 Dataset: BMS-WebView-1 (output threshold = 0.006%) 1000 Apriori DCI Eclat H-Mine FP-Growth AFOPT 1000 100 100 10 10 Apriori DCI Eclat H-Mine FP-Growth AFOPT 100 1 0.01 0.02 0.03 0.04 0.05 0.06 0.1 0.2 0.3 Minimum Support(%) (a) T10I4D100k 0.6 0.7 0.8 0.9 10000 0.05 Time(sec) 10 50 60 70 0.065 0.07 0 10000 10 40 50 60 70 10000 10 0.12 0.14 100 10 1 0 2 Minimum Support(%) (e) chess 0.1 Apriori DCI Eclat FP-Growth AFOPT 1000 100 80 0.08 (d) BMS-POS 0.1 30 0.06 Dataset: pumsb (output threshold = 70%) 1 20 0.04 Minimum Support(%) Apriori DCI Eclat FP-Growth AFOPT 1000 100 Minimum Support(%) 0.02 Dataset: mushroom (output threshold = 5%) 0.1 40 0.06 (c) BMS-WebView-1 1 1 30 0.055 Minimum Support(%) Apriori DCI Eclat FP-Growth AFOPT 1000 100 20 10 1 Dataset: connect-4 (output threshold = 75%) Apriori DCI Eclat FP-Growth AFOPT 1000 0.5 (b) T40I10D100k Dataset: chess (output threshold = 45%) 10000 0.4 Minimum Support(%) Time(sec) 0 10 0.1 Time(sec) 1 Time(sec) Dataset: BMS-POS (output threshold = 0.05%) 1000 Apriori DCI Eclat H-Mine FP-Growth AFOPT Time(sec) Time(sec) Time(sec) 1000 Dataset: T40I10D100k (output threshold = 0.5%) 10000 Apriori DCI Eclat H-Mine FP-Growth AFOPT Time(sec) Dataset: T10I4D100k (output threshold = 0.01%) 10000 4 6 8 10 45 50 55 60 Minimum Support(%) (f) connect-4 65 70 75 80 85 Minimum Support(%) (g) mushroom (h) pumsb Figure 4. Performance Comparison of FI Mining Algorithms minimum support=0.1% 100000 Data Sets Size #Trans #Items MaxTL AvgTL T10I4D100k (0.01%) 3.93M 100000 870 30 10.10 T40I10D100k (0.5%) 15.12M 100000 942 78 39.61 BMS-POS (0.05%) 11.62MB 515597 1657 165 6.53 BMS-WebView-1 (0.06%) 1.28M 59601 497 267 2.51 chess (45%) 0.34M 3196 75 37 37.00 connect-4 (75%) 9.11M 67557 129 43 43.00 mushroom (5%) 0.56M 8124 119 23 23.00 pumsb (70%) 16.30M 49046 2113 74 74.00 DCI AFOPT 10000 Time(sec) Time(sec) 10000 1000 100 1000 100 10 10 200 400 600 800 #Transactions(x1000) (a) #Transactions 1000 5 10 15 20 25 30 35 40 45 50 55 Average Transaction Length (b) AvgTransLen Figure 5. Scalability Study Table 4. Datasets are counted, ∀s ∈LMFIp , s is put into LMFIp S{i} , where i is the first item in F appears in s; (2) push-right: after all the maximal frequent itemsets containing p are mined, ∀s ∈LMFIp , s is put into LMFIq if q is the first right sibling of p containing an item in s. In our implementation, we use pseudo projection technique to generate LMFIs, i.e. LMFIp is a collection of pointers pointing to those maximal itemsets containing p. rithms. The Apriori and Eclat algorithms we used are implemented by Christian Borgelt. DCI was downloaded from its web site. We obtained the source code of FP-growth from its authors. H-Mine was implemented by ourselves. We ran H-Mine only on several sparse datasets since it was designed for sparse datasets and it changes to use FP-tree on dense datasets. Figure 4 shows the running time of all algorithms over datasets shown in Table 4. When the minimum support threshold is very low, an intolerable number of frequent itemsets can be generated. So when minimum support threshold reached some very low value, we turned off the output. This minimum support value is called output threshold, and they are shown on top of each figure. With high minimum support threshold, all algorithms showed comparable performance. When minimum support threshold was lowered, the gaps between algorithms increased. The two candidate generate-and-test approaches, Apriori and DCI, showed satisfactory performance on several sparse datasets, but took thousands of seconds to terminate on dense datasets due to high cost for generating and testing a large number of candidate itemsets. HMine demonstrated similar performance with FP-growth on dataset T10I4D100k, but it was slower than FP-growth on the other three sparse datasets. H-Mine uses pseudo con- 6 Experimental results In this section, we compare the performance of our algorithms with other FIM algorithms. All the experiments were conducted on a 1Ghz Pentium III with 256MB memory running Mandrake Linux. Table 4 shows some statistical information about the datasets used for performance study. All the datasets were downloaded from FIMI’03 workshop web site. The fifth and sixth columns are maximal and average transaction length. These statistics provide some rough description of the density of the datasets. 6.1 minimum support=0.1% 100000 DCI AFOPT Mining all frequent itemsets We compared the efficiency of AFOPT-all algorithm with Apriori, DCI, FP-growth, H-Mine and Eclat algo8 10 Dataset: BMS-WebView-1 10000 Apriori-close MAFIA-close AFOPT-close 100 Dataset: BMS-POS 1000 Apriori-close MAFIA-close AFOPT-close 1000 Time(sec) 100 Time(sec) Time(sec) Dataset: T40I10D100k 1000 Apriori-close MAFIA-close AFOPT-close Time(sec) Dataset: T10I4D100k 1000 100 10 Apriori-close MAFIA-close AFOPT-close 100 10 1 1 10 0.02 0.04 0.06 0.08 0.1 0.12 0.8 0.9 Minimum Support(%) (a) T10I4D100k 1.2 0.1 0.04 1.3 1 0.06 1000 Apriori-close MAFIA-close AFOPT-close 0.08 0.1 0.12 0.14 0.1 Minimum Support(%) 10000 0.3 0.4 0.5 0.6 (d) BMS-POS Dataset: mushroom MAFIA-close AFOPT-close 0.2 Minimum Support(%) (c) BMS-WebView-1 Dataset: connect-4 Dataset: pumsb 1000 MAFIA-close AFOPT-close MAFIA-close AFOPT-close 1000 10 100 Time(sec) 100 Time(sec) Time(sec) 1.1 (b) T40I10D100k Dataset: chess 1000 1 Minimum Support(%) 10 Time(sec) 0 100 10 100 10 1 1 1 50 55 60 Minimum Support(%) (e) chess 65 70 0.1 30 35 40 45 50 55 60 1 0 Minimum Support(%) 0.2 0.4 0.6 0.8 1 Minimum Support(%) (f) connect-4 (g) mushroom 65 70 75 80 Minimum Support(%) (h) pumsb Figure 6. Performance Comparison of FCI Mining Algorithms struction strategy, which cannot reduce traversal cost as effective as physical construction strategy. Eclat uses vertical mining techniques. Support counting is performed efficiently by transaction id list join. But Eclat is not scale well with respect to the number of transactions in a database. The running time of AFOPT-all was rather stable over all tested datasets, and it outperformed other algorithms. 6.2 structure to store conditional databases. The tree structure has apparent advantages on dense datasets because many transactions share their prefixes. 6.4 Scalability We studied the scalability of our algorithm by perturbing the IBM synthetic data generator along two dimensions: the number of transactions was varied from 200k to 1000k and the average transaction length was varied from 10 to 50. The default values of these two parameters were set to 1000k and 40 respectively. We compared our algorithm with algorithm DCI. Other algorithms took long time to finish on large datasets, so we exclude them from comparison. Figure 5 shows the results when varying the two parameters. 6.3 Mining maximal frequent itemsets We compared AFOPT-max with MAFIA and Apriori algorithms. The Apriori algorithm also has an option to produce only maximal frequent itemsets. It is denoted as “Apriori-max” in figures. Again we only compare with it on sparse datasets. Apriori-max explores the search space in breadth-first order. It finds short frequent itemsets first. Maximal frequent itemsets are generated in a post-processing phase. Therefore Apriori-max is infeasible when the number of frequent itemsets is large even if it adopts some pruning techniques during the mining process. AFOPT-max and MAFIA generate frequent itemsets in depth-first order. Long frequent itemsets are mined first. All the subsets of a long maximal frequent itemsets can be pruned from further consideration by using the superset pruning and lookahead technique. AFOPT-max uses tree structure to represent dense conditional databases. The AFOPT-tree introduces more pruning capability than tid list or tid bitmap. For example, if a conditional database can be represented by a single branch in AFOPT-tree, then the single branch will be the only one possible maximal itemset in the conditional database. AFOPT-max also benefits from the progressive focusing technique for superset pruning. MAFIA was very efficient on small datasets, e.g chess and mushroom when the length of bitmap is short. Mining frequent closed itemsets We compared AFOPT-close with MAFIA [4] and Apriori algorithms. Both algorithms have an option to generate only closed itemsets. We denoted these two algorithms as Apriori-close and MAFIA-close respectively in figures. MAFIA was downloaded from its web site. We compared with Apriori-close only on sparse datasets because Apriori-close requires a very long time to terminate on dense datasets. On several sparse datasets, AFOPTclose and Apriori-close showed comparable performance. Both of them were orders of magnitude faster than MAFIAclose. MAFIA-close uses vertical mining technique. It uses bitmaps to represent tid lists. AFOPT-close showed better performance on tested dense datasets due to its adaptive nature and the efficient subset checking techniques described in Section 4. On dense datasets, AFOPT-close uses tree 7 Conclusions In this paper, we revisited the frequent itemset mining problem and focused on investigating the algorithmic performance space of the pattern growth approach. We identified four dimensions in which existing pattern growth al9 Dataset: T40I10D100k 10000 Apriori-max MAFIA-max AFOPT-max Dataset: BMS-WebView-1 10000 Apriori-max MAFIA-max AFOPT-max 1000 100 1000 Time(sec) Time(sec) Time(sec) 1000 100 10 Dataset: BMS-POS 10000 Apriori-max MAFIA-max AFOPT-max Time(sec) Dataset: T10I4D100k 10000 100 10 Apriori-max MAFIA-max AFOPT-max 1000 100 1 1 10 0 0.02 0.04 0.06 0.08 0.1 0.1 0.3 0.4 Minimum Support(%) (a) T10I4D100k 0.6 0.7 0.8 0 0.02 1000 0.06 0.08 10 0.04 0.1 0.06 0.08 10000 0.12 0.14 (d) BMS-POS Dataset: mushroom MAFIA-max AFOPT-max 0.1 Minimum Support(%) (c) BMS-WebView-1 Dataset: connect-4 MAFIA-max AFOPT-max 0.04 Minimum Support(%) (b) T40I10D100k Dataset: chess 1000 0.5 Minimum Support(%) Dataset: pumsb 1000 MAFIA-max AFOPT-max MAFIA-max AFOPT-max 10 Time(sec) 10 100 Time(sec) 100 Time(sec) Time(sec) 1000 100 10 100 10 1 1 1 20 25 30 Minimum Support(%) (e) chess 35 40 10 15 20 25 30 35 0.1 0.04 40 Minimum Support(%) 1 0.06 0.08 0.1 0.12 0.14 Minimum Support(%) (f) connect-4 (g) mushroom 50 55 60 65 70 Minimum Support(%) (h) pumsb Figure 7. Performance Comparison of MFI Mining Algorithms gorithms differ: (1) item search order: static lexicographical order or ascending frequency order; (2) conditional database representation: tree-based structure or array-based structure; (3) conditional database construction strategy: physical construction or pseudo construction; and (4) tree traversal strategy: bottom-up or top-down. Existing algorithms adopted different strategies on these four dimensions in order to reduce the total number of conditional databases and the mining cost of each individual conditional database. we described an efficient pattern growth algorithm AFOPT in the paper. It adaptively uses three different structures: arrays, AFOPT-tree and buckets, to represent conditional databases according to the density of a conditional database. Several parameters were introduced to control which structure should be used for a specific conditional database. We showed that the adaptive conditional database representation strategy requires less space than using arraybased structure or tree-based structure solely. We also extended AFOPT algorithm to mine closed and maximal frequent itemsets, and described how to incorporate pruning techniques into AFOPT framework. Efficient subset checking techniques for both closed and maximal frequent itemsets mining were presented. A set of experiments were conducted to show the efficiency of the proposed algorithms. [4] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In ICDE, 2001. [5] K. Gouda and M. J. Zaki. Genmax: Efficiently mining maximal frequent itemsets. In ICDM, 2001. [6] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000. [7] R.J. Bayardo. Jr. Efficiently mining long patterns from databases. In SIGMOD, 1998. [8] G. Liu, H. Lu, W. Lou, and J. X. Yu. On computing, storing and querying frequent patterns. In SIGKDD, 2003. [9] G. Liu, H. Lu, Y. Xu, and J. X. Yu. Ascending frequency ordered prefix-tree: Efficient mining of frequent patterns. In DASFAA, 2003. [10] J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by opportunistic projection. In SIGKDD, 2002. [11] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In ICDT, 1999. [12] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. Hmine: Hyper-structure mining of frequent patterns in large databases. In ICDM, 2001. [13] J. Pei, J. Han, and R. Mao. Closet: An efficient algorithm for mining frequent closed itemsets. In DMKD, 2000. [14] R. Raymon. Search through systematic set enumeration. In Proc. of KR Conf., 1992. [15] R.C.Agarwal, C.C.Aggarwal, and V.V.V.Prasad. A tree projection algorithm for finding frequent itemsets. Journal on Parallel and Distributed Computing, 61(3):350–371, 2001. [16] J. Wang, J. Pei, and J. Han. Closet+: Searching for the best strategies for mining frequent closed itemsets. In SIGKDD, 2003. [17] Y. Xu, J. X. Yu, G. Liu, and H. Lu. From path tree to frequent patterns: A framework for mining frequent patterns. In ICDM, pages 514–521, 2002. [18] M. J. Zaki and C. Hsiao. Charm: An efficient algorithm for closed itemset mining. In SDM, 2002. [19] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In SIGKDD, 1997. References [1] R. Agrawal, C. Aggarwal, and V. Prasad. Depth first generation of long patterns. In SIGKDD, 2000. [2] R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD, 1993. [3] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD, 1997. 10 Efficiently Using Prefix-trees in Mining Frequent Itemsets Gösta Grahne and Jianfei Zhu Concordia University Montreal, Canada {grahne, j zhu}@cs.concordia.ca Abstract Efficient algorithms for mining frequent itemsets are crucial for mining association rules. Methods for mining frequent itemsets and for iceberg data cube computation have been implemented using a prefix-tree structure, known as an FP-tree, for storing compressed information about frequent itemsets. Numerous experimental results have demonstrated that these algorithms perform extremely well. In this paper we present a novel array-based technique that greatly reduces the need to traverse FP-trees, thus obtaining significantly improved performance for FPtree based algorithms. Our technique works especially well for sparse datasets. Furthermore, we present new algorithms for a number of common data mining problems. Our algorithms use the FP-tree data structure in combination with our array technique efficiently, and incorporates various optimization techniques. We also present experimental results which show that our methods outperform not only the existing methods that use the FP-tree structure, but also all existing available algorithms in all the common data mining problems. 1. Introduction A fundamental problem for mining association rules is to mine frequent itemsets (FI’s). In a transaction database, if we know the support of all frequent itemsets, the association rules generation is straightforward. However, when a transaction database contains large number of large frequent itemsets, mining all frequent itemsets might not be a good idea. As an example, if there is a frequent itemset with size ℓ, then all 2ℓ nonempty subsets of the itemset have to be generated. Thus, a lot of work is focused on discovering only all the maximal frequent itemsets (MFI’s). Unfortunately, mining only MFI’s has the following deficiency. From an MFI and its support s, we know that all its subsets are frequent and the support of any of its subset is not less than s, but we do not know the exact value of the support. To solve this problem, another type of a frequent itemset, the Closed Frequent Itemset (CFI), has been proposed. In most cases, though, the number of CFI’s is greater than the number of MFI’s, but still far less than the number of FI’s. In this work we mine FI’s, MFI’s and CFI’s by efficiently using the FP-tree, the data structure that was first introduced in [6]. The FP-tree has been shown to be one of the most efficient data structures for mining frequent patterns and for “iceberg” data cube computations [6, 7, 9, 8]. The most important contribution of our work is a novel technique that uses an array to greatly improve the performance of the algorithms operating on FP-trees. We first demonstrate that the use of our array-based technique drastically speeds up the FP-growth method, since it now needs to scan each FP-tree only once for each recursive call emanating from it. We then use this technique and give a new algorithm FPmax*, which extends our previous algorithm FPmax, for mining maximal frequent itemsets. In FPmax*, we use a variant of the FP-tree structure for subset testing, and give number of optimizations that further reduce the running time. We also present an algorithm, FPclose, for mining closed frequent itemsets. FPclose uses yet another variation of the FP-tree structure for checking the closedness of frequent itemsets. Finally, we present experimental results that demonstrate the fact that all of our FP-algorithms outperform previously known algorithms practically always. The remaining of the paper is organized as follows. In Section 2, we briefly review the FP-growth method, and present our novel array technique that results in the greatly improved method FPgrowth*. Section 3 gives algorithm FPmax*, which is an extension of our previous algorithm FPmax, for mining MFI’s. Here we also introduce our approach of subset testing needed in mining MFI’s and CFI’s. In Section 4 we give algorithm FPclose, for mining CFI’s. Experimental results are given in Section 5. Section 6 concludes, and outlines directions of future research. 2. Discovering FI’s 2.1. The FP-tree and FP-growth method The FP-growth method by Han et al. [6] uses a data structure called the FP-tree (Frequent Pattern tree). The FPtree is a compact representation of all relevant frequency information in a database. Every branch of the FP-tree represents a frequent itemset, and the nodes along the branches are stored in decreasing order of frequency of the corresponding items, with leaves representing the least frequent items. Compression is achieved by building the tree in such a way that overlapping itemsets share prefixes of the corresponding branches. The FP-tree has a header table associated with it. Single items and their counts are stored in the header table in decreasing order of their frequency. The entry for an item also contains the head of a list that links all the corresponding nodes of the FP-tree. Compared with Apriori [1] and its variants which need several database scans, the FP-growth method only needs two database scans when mining all frequent itemsets. The first scan counts the number of occurrences of each item. The second scan constructs the initial FP-tree which contains all frequency information of the original dataset. Mining the database then becomes mining the FP-tree. abcefo acg ei acdeg acegl ej abcefp acd acegm acegn (a) Header table item e:8 c:8 a:8 g:5 b:2 f:2 d:2 root Head of node−links e:8 c:2 c:6 a:2 a:6 g:1 b:2 g:4 f:2 d:1 d:1 (b) Figure 1. An Example FP-tree (minsup=20%) To construct the FP-tree, first find all frequent items by an initial scan of the database. Then insert these items in the header table, in decreasing order of their count. In the next (and last) scan, as each transaction is scanned, the set of frequent items in it are inserted into the FP-tree as a branch. If an itemset shares a prefix with an itemset already in the tree, the new itemset will share a prefix of the branch representing that itemset. In addition, a counter is associated with each node in the tree. The counter stores the number of transactions containing the itemset represented by the path from the root to the node in question. This counter is updated during the second scan, when a transaction causes the insertion of a new branch. Figure 1 (a) shows an example of a database and Figure 1 (b) the FP-tree for that database. Note that there may be more than one node corresponding to an item in the FP-tree. The frequency of any one item i is the sum of the count associated with all nodes representing i, and the frequency of an itemset equals the sum of the counts of the least frequent item in it, restricted to those branches that contain the itemset. For instance, from Figure 1 (b) we can see that the frequency of the itemset {c, a, g} is 5. Thus the constructed FP-tree contains all frequency information of the database. Mining the database becomes mining the FP-tree. The FP-growth method relies on the following principle: if X and Y are two itemsets, the count of itemset X ∪ Y in the database is exactly that of Y in the restriction of the database to those transactions containing X. This restriction of the database is called the conditional pattern base of X, and the FP-tree constructed from the conditional pattern base is called X’s conditional FPtree, which we denote by TX . We can view the FP-tree constructed from the initial database as T∅ , the conditional FP-tree for ∅. Note that for any itemset Y that is frequent in the conditional pattern base of X, the set X ∪Y is a frequent itemset for the original database. Given an item i in the header table of an FP-tree TX , by following the linked list starting at i in the header table of TX , all branches that contain item i are visited. These branches form the conditional pattern base of X ∪ {i}, so the traversal obtains all frequent items in this conditional pattern base. The FP-growth method then constructs the conditional FP-tree TX∪{i} , by first initializing its header table based on the found frequent items, and then visiting the branches of TX along the linked list of i one more time and inserting the corresponding itemsets in TX∪{i} . Note that the order of items can be different in TX and TX∪{i} . The above procedure is applied recursively, and it stops when the resulting new FP-tree contains only one single path. The complete set of frequent itemsets is generated from all single-path FP-trees. 2.2. An array technique The main work done in the FP-growth method is traversing FP-trees and constructing new conditional FP-trees after the first FP-tree is constructed from the original database. From numerous experiments we found out that about 80% of the CPU time was used for traversing FP-trees. Thus, the question is, can we reduce the traversal time so that the method can be sped up? The answer is yes, by using a simple additional data structure. Recall that for each item i in the header of a conditional FP-tree TX , two traversals of TX are needed for constructing the new conditional FP-tree TX∪{i} . The first traversal finds all frequent items in the conditional pattern base of X ∪ {i}, and initializes the FP-tree TX∪{i} by constructing its header table. The second traversal constructs the new tree TX∪{i} . We can omit the first scan of TX by constructing an array AX while building TX . The following example will explain the idea. In Figure 1 (a), supposing that the minimum support is 20%, after the first scan of the original database, we sort the frequent items as e:8, c:8, a:8, g:5, b:2, f :2, d:2. This order is also the order of items in the header table of T∅ . During the second scan of the database we will construct T∅ , and an array A∅ . This array will store the counts of all 2-itemsets. All cells in the array are initialized as 0. Figure 2. Two array examples In A∅ , each cell is a counter of a 2-itemset, cell A∅ [d, e] is the counter for itemset {d, e}, cell A∅ [d, c] is the counter for itemset {d, c}, and so forth. During the second scan for constructing T∅ , for each transaction, first all frequent items in the transaction are extracted. Suppose these items form itemset I. To insert I into T∅ , the items in I are sorted according to the order in header table of T∅ . When we insert I into T∅ , at the same time A∅ [i, j] is incremented by 1 if {i, j} is contained in I. For example, for the first transaction, {a, b, c, e, f } is extracted (item o is infrequent) and sorted as e, c, a, b, f . This itemset is inserted into T∅ as usual, and at the same time, A∅ [f, e], A∅ [f, c], A∅ [f, a], A∅ [f, b], A∅ [b, a], A∅ [b, c], A∅ [b, e], A∅ [a, e], A∅ [a, c], A∅ [c, e] are all incremented by 1. After the second scan, array A∅ keeps the counts of all pairs of frequent items, as shown in table (a) of Figure 2. Next, the FP-growth method is recursively called to mine frequent itemsets for each item in header table of T∅ . However, now for each item i, instead of traversing T∅ along the linked list starting at i to get all frequent items in i’s conditional pattern base, A∅ gives all frequent items for i. For example, by checking the third line in the table for A∅ , frequent items e, c, a for the conditional pattern base of g can be obtained. Sorting them according to their counts, we get a, c, e. Therefore, for each item i in T∅ the array A∅ makes the first traversal of T∅ unnecessary, and T{i} can be initialized directly from A∅ . For the same reason, from a conditional FP-tree TX , when we construct a new conditional FP-tree for X ∪ {i}, for an item i, a new array AX∪{i} is calculated. During the construction of the new FP-tree TX∪{i} , the array AX∪{i} is filled. For instance, in Figure 1, the cells of array A{g} is shown in table (b) of Figure 2. This array is constructed as follows. From the array A∅ , we know that the frequent items in the conditional pattern base of {g} are, in order, a, c, e. By following the linked list of g, from the first node we get {e, c, a} : 4, so it is inserted as (a : 4, c : 4, e : 4) into the new FP-tree T{g} . At the same time, A{g} [e, c], A{g} [e, a] and A{g} [c, a] are incremented by 4. From the second node in the linked list, {c, a} : 1 is extracted, and it is inserted as (a : 1, c : 1) into T{g} . At the same time, A{g} [c, a] is incremented by 1. Since there are no other nodes in the linked list, the construction of T{g} is finished, and array A{g} is ready to be used for construction of FP-trees in next level of recursion. The construction of arrays and FP-trees continues until the FP-growth method terminates. Based on above discussion, we define a variation of the FP-tree structure in which besides all attributes given in [6], an FP-tree also has an attribute, array, which contains the corresponding array. Now let us analyze the size of an array. Suppose the number of frequent items in the first FP-tree is n. Then Pn−1 the size of the associated array is i=1 i = n(n − 1)/2. We can expect that FP-trees constructed from the first FPtree have fewer frequent items, so the sizes of the associated arrays decrease. At any time, since an array is an attribute of an FP-tree, when the space for the FP-tree is freed, the space for the array is also freed. 2.3. Discussion The array technique works very well especially when the dataset is sparse. The FP-tree for a sparse dataset and the recursively constructed FP-trees will be big and bushy, due to the fact that they do not have many shared common prefixes. The arrays save traversal time for all items and the next level FP-trees can be initialized directly. In this case, the time saved by omitting the first traversals is far greater than the time needed for accumulating counts in the associated array. However, when a dataset is dense, the FP-trees are more compact. For each item in a compact FP-tree, the traversal is fairly rapid, while accumulating counts in the associated array may take more time. In this case, accumulating counts may not be a good idea. Even for the FP-trees of sparse datasets, the first levels of recursively constructed FP-trees are always conditional FPtrees for the most common prefixes. We can therefore expect the traversal times for the first items in a header table to be fairly short, so the cells for these first items are unnecessary in the array. As an example, in Figure 2 table (a), since e, c, and a are the first 3 items in the header table, the first two lines do not have to be calculated, thus saving counting time. Note that the datasets (the conditional pattern bases) change during the different depths of the recursion. In order to estimate whether a dataset is sparse or dense, during the construction of each FP-tree we count the number of nodes in each level of the tree. Based on experiments, we found that if the upper quarter of the tree contains less than 15% of the total number of nodes, we are most likely dealing with a dense dataset. Otherwise the dataset is likely to be sparse. If the dataset appears to be dense, we do not calculate the array for the next level of the FP-tree. Otherwise, we calculate array for each FP-tree in the next level, but the cells for the first several (say 5) items in its header table are not set. 2.4. FPgrowth* : an improved FP-growth method Figure 3 contains the pseudocode for our new method FPgrowth*. The procedure has an FP-tree T as parameter. The tree has attributes: base, header and array. T.base contains the itemset X, for which T is a conditional FP-tree, the attribute header contains the head table, and T.array contains the array AX . Procedure FPgrowth*(T ) Input: A conditional FP-tree T Output: The complete set of FI’s corresponding to T . Method: 1. if T only contains a single path P 2. then for each subpath Y of P 3. output pattern Y ∪ T.base with count = smallest count of nodes in Y 4. else for each i in T.header 5. output Y = T.base ∪ {i} with i.count 6. if T.array is not NULL 7. construct a new header table for Y ’s FP-tree from T.array 8. else construct a new header table from T ; 9. construct Y ’s conditional FP-tree TY and its array AY ; 10. if TY 6= ∅ 11. call FPgrowth*(TY ); Figure 3. Algorithm FPgrowth* In FPgrowth*, line 6 tests if the array of the current FPtree is NULL. If the FP-tree corresponds to a sparse dataset, its array is not NULL, and line 7 will be used to construct the header table of the new conditional FP-tree from the array directly. One FP-tree traversal is saved for this item compared with the FP-growth method in [6]. In line 9, during the construction, we also count the nodes in the different levels of the tree, in order to estimate whether we shall really calculate the array, or just set TY .array = N U LL. From our experimental results we found that an FP-tree could have millions of nodes, thus, allocating and deallocating those nodes takes plenty of time. In our implementation, we used our own main memory management for allocating and deallocating nodes. Since all memory for nodes in an FP-tree is deallocated after the current recursion ends, a chunk of memory is allocated for each FP-tree when we create the tree. The chunk size is changeable. After generating all frequent itemsets from the FP-tree, the chunk is discarded. Thus we successfully avoid freeing nodes in the FP-tree one by one, which is more time-consuming. 3. FPmax*: Mining MFI’s In [5] we developed FPmax, a variation of the FP-growth method, for mining maximal frequent itemsets. Since the array technique speeds up the FP-growth method for sparse datasets, we can expect that it will be useful in FPmax too. This gives us an improved method, FPmax*. Compared to FPmax, the improved method FPmax* also has a more efficient subset test, as well as some other optimizations. It turns out that FPmax* outperforms GenMax[4] and MAFIA [3] for all cases we discussed in [5]. 3.1. The MFI-Tree Since FPmax is a depth-first algorithm, a frequent itemset can be a subset only of an already discovered MFI. In FPmax we introduced a global data structure, the Maximal Frequent Itemset tree (MFI-tree), to keep the track of MFI’s. A newly discovered frequent itemset is inserted into the MFI-tree, unless it is a subset of an itemset already in the tree. However, for large datasets, the MFI-tree will be quite large, and sometimes one itemset needs thousands of comparisons for subset testing. Inspired by the way subset checking is done in [4], in FPmax*, we still use the MFItree structure, but for each conditional FP-tree TX , a small MFI-tree MX is created. The tree MX will contain all maximal itemsets in the conditional pattern base of X. To see if a local MFI Y generated from a conditional FP-tree TX is maximal, we only need to compare Y with the itemsets in MX . This achieves a significant speedup of FPmax. Each MFI-tree is associated with a particular FP-tree. Children of the root of the MFI-tree are item prefix subtrees. In an MFI-tree, each node in the subtree has three fields: item-name, level and node-link. The level-field will be useful for subset testing. All nodes with same item-name are linked together, as in an FP-tree. The MFI-tree also has a header table. However, unlike the header table in an FP-tree, which is constructed from traversing the previous FP-tree or using the associated array, the header table of an MFI-tree is constructed based on the item order in the table of the FP-tree it is associated with. Each entry in the header table consists of two fields, item-name and head of a linked list. The head points to the first node with the same item-name in the MFI-tree. root:0 Header table item e c a g b f d Head of node−links c:1 e:1 a:2 c:2 d:3 a:3 b:4 f:5 (a) root:0 Header table item e c a g b f d Head of node−links c:1 e:1 a:2 c:2 d:3 a:3 b:4 g:4 f:5 (b) Figure 4. Construction of MFI-Tree The insertion of an MFI into an MFI-tree is similar to the insertion of a frequent set into an FP-tree. Figure 4 shows the insertions of three MFI’s into an MFI-tree associated with the FP-tree in Figure 1 (b). In Figure 4, a node x : ℓ means that the node is for item x and its level is ℓ. Figure 4 (a) shows the tree after (c, a, d) and (e, c, a, b, f ) have been inserted. In Figure 4 (b), since new MFI (e, c, a, b, g) shares prefix (e, c, a) with (e, c, a, b, f ), only one new node for g is inserted. 3.2. FPmax* Figure 5 gives algorithm FPmax*. The first call will be for the FP-tree constructed from the original database, and it will have an empty MFI-tree. Before a recursive call FPmax*(T,M), we already know from line 10 that the set containing T.base and the items in the current FP-tree is not a subset of any existing MFI. During the recursion, if there is only one single path in T , this single path together with T.base is an MFI of the database. In line 2, the MFI is inserted into M . If the FP-tree is not a single-path tree, then for each item i in the header table, we start preparing for the recursive call FPmax*(TY , MY ), for Y = T.base ∪ {i}. The items in the header table of T are processed in increasing order of frequency, so that maximal frequent itemsets will be found before any of their frequent subsets. Lines 5 to 8 use the array technique, and line 10 calls function subset checking to check if Y together with all frequent items in Y ’s conditional pattern base is a subset of any existing MFI in M (thus we do superset pruning here). If subset checking return false, FPmax* will be called recursively, with (TY , MY ). The implementation of function subset checking will be explained shortly. Note that before and after calling subset checking, if Y ∪ T ail is not subset of any MFI, we still do not know whether Y ∪ T ail is frequent. If, by constructing Y ’s conditional Procedure FPmax*(T, M ) Input: T , an FP-tree M , the MFI-tree for T.base Output: Updated M Method: 1. if T only contains a single path P 2. insert P into M 3. else for each i in T.header 4. set Y = T.base ∪ {i}; 5. if T.array is not NULL 6. Tail={frequent items for i in T.array} 7. else 8. Tail={frequent items in i’s conditional pattern base} 9. sort Tail in decreasing order of the items’ counts 10. if not subset checking(Y ∪ T ail, M ) 11. construct Y ’s conditional FP-tree TY and its array AY ; 12. initialize Y ’s conditional MFI-tree MY ; 13. call FPmax*(TY , MY ); 14. merge MY with M Figure 5. Algorithm FPmax* FP-tree TY , we find out that TY only has a single path, we can conclude that Y ∪ T ail is frequent. Since Y ∪ T ail was not a subset of any previously discovered MFI, it is a new MFI and will be inserted into MY . 3.3. Implementation of subset testing The function subset checking works as follows. Suppose Tail = i1 i2 , . . . ik , in decreasing order of frequency according to the header table of M . By following the linked list of i, for each node n in the list, we test if Tail is a subset of the ancestors of n. Here, the level of n can be used for saving comparison time. First we test if the level of n is smaller than k. If it is, the comparison stops because there are not enough ancestors of n for matching the rest of Tail. This pruning technique is also applied as we move up the branch and towards the front of Tail. Unlike an FP-tree, which is not changed during the execution of the algorithm, an MFI-tree is dynamic. At line 12, for each Y , a new MFI-tree MY is initialized from the predecessor MFI-tree M . Then after the recursive call, M is updated on line 14 to contain all newly found frequent itemsets. In the actual implementation, we however found that it was more efficient to update all MFI-trees along the recursive path, instead of merging only at the current level. In other words, we omitted line 14, and instead on line 2, P is inserted into the current M , and also into all predecessor MFI-trees that the implementation of the recursion needs to keep in main memory in any case. Since FPmax* is a depth-first algorithm, it is straightforward to show that the above subset checking is correct. Based on the correctness of the FP-growth method, we can conclude that FPmax* returns all and only the maximal frequent itemsets in a given dataset. global prefix-tree for keeping track of all closed itemsets. As we pointed out before, one global tree will be quite big, and thus slows down searches. In FPclose we will therefore use multiple, conditional CFI-trees for checking closedness of itemsets. We can thus expect that FPclose outperforms CLOSET+. 3.4. Optimizations Similar to an MFI-tree, a CFI-tree is related to an FP-tree and an itemset X, and we will denote the CFI-tree as CX . The CFI-tree CX always stores all already found CFI’s containing itemset X, and their counts. A newly found frequent itemset Y that contains X only needs to be compared with the CFI’s in CX . If in CX , there is no superset of Y with same count as Y , Y is closed. In a CFI-tree, each node in the subtree has four fields: item-name, count, node-link and level. Here, the count field is needed because when comparing a Y with a set Z in the tree, we are trying to verify that it is not the case that Y ⊂ Z, and Y and Z have the same count. The order of the items in a CFI-tree’s header table is same as the order of items in header table of its corresponding FP-tree. In the method FPmax*, one more optimization is used. Suppose, that at some level of the recursion, the header table of the current FP-tree is i1 , i2 , . . . , im . Then starting from im , for each item in the header table, we may need to do the work from line 4 to line 14. If for any item, say ik , where k ≤ m, its maximal frequent itemset contains items i1 , i2 , . . . , ik−1 , i.e., all the items that have not yet called FPmax* recursively, these recursive calls can be omitted. This is because for those items, their tails must be subsets of {i1 , i2 , . . . , ik−1 }, so subset checking(Y ∪T ail) would always return true. FPmax* also uses the memory management described in Section 2.4, for allocating and deallocating space for FPtrees and MFI-trees. 4.1. The CFI-tree and algorithm FPclose root Header table 3.5. Discussion One may wonder if the space required for all the MFItrees of a recursive branch is too large. Actually, before the first call of FPmax*, the first FP-tree has to fit in main memory. This is also required by the FP-growth method. The corresponding MFI-tree is initialized as empty. During recursive calls of FPmax*, new conditional FP-trees are constructed from the first FP-tree or from an ancestors FPtree. From the experience of [6], we know the recursively constructed FP-trees are relatively small. We can expect that the total size of these FP-trees is not greater than the final size of the MFI-tree for ∅. Similarly, the MFI-trees constructed from ancestors are also small. All MFI-trees grow gradually. Thus we can conclude that the total main memory requirement for running FPmax* on a dataset is proportional to the sum of the size of the FP-tree and the MFI-tree for ∅. 4. FPclose: Mining CFI’s For mining frequent closed itemsets, FPclose works similarly to FPmax*. They both mine frequent patterns from FP-trees. Whereas FPmax* needs to check that a newly found frequent itemset is maximal, FPclose needs to verify that the new frequent itemset is closed. For this we use a CFI-tree, which is another variation of an FP-tree. One of the first attempts to use FP-trees in CFI mining was the algorithm CLOSET+ [9]. This algorithm uses one item e c a g b f d Head of node−links c:1:5 a:2:5 d:3:2 g:3:5 e:1:2 c:2:2 a:3:2 b:4:2 f:5:2 (a) root Header table item e c a g b f d Head of node−links d:3:2 c:1:8 e:1:8 a:2:8 c:2:6 g:3:5 b:4:2 a:3:6 g:4:4 f:5:2 (b) Figure 6. Construction of CFI-Tree The insertion of a CFI into a CFI-tree is similar to the insertion of a transaction into an FP-tree, except now the count of a node is not incremented, it is always replaced by the maximal count up-to-date. Figure 6 shows some snapshots of the construction of a CFI-tree with respect to the FP-tree in Figure 1 (b). The item order in two trees are same because they are both for base ∅. Note that insertions of CFI’s into the top level CFI-tree will occur only after recursive calls have been made. In the following example, the insertions would in actuality be performed during various stages of the execution, not in bulk as the example might suggest. In Figure 6, a node x : ℓ : c means that the node is for item x, its level is ℓ and its count is c. In Figure 6 (a), after inserting (c, a, d) and (e, c, a, b, f ) with count 2, then we insert (c, a, g) with count 5. Since (c, a, g) shares the prefix (c, a) with (c, a, d), only node g is appended, and at the same time, the counts for nodes c and a are both changed to be 5. In part (b) of Figure 6, the CFI’s (e, c, a, g) : 4, (c, a) : 8, (c, a, e) : 6 and (e) : 8 are inserted. At this stage the tree contains all CFI’s for the dataset in Figure 1 (a). Procedure FPclose(T, C) Input: T , an FP-tree C, the CFI-tree for T.base Output: Updated C Method: 1. if T only contains a single path P 2. generate all CFI’s from P 3. for each CFI X generated 4. if not closed checking(X, C) 5. insert X into C 6. else for each i in T.header 7. set Y = T.base ∪ {i}; 8. if not closed checking(Y, C) 9. if T.array is not NULL 10. Tail = {frequent items for i in T.array} 11. else 12. Tail={frequent items in i’s conditional pattern base} 13. sort Tail in decreasing order of items’ counts 14. construct the FP-tree TY and its array AY ; 15. initialize Y ’s conditional CFI-tree CY ; 16. call FPclose(TY , CY ); 17. merge CY with C Figure 7. Algorithm FPclose Figure 7 gives algorithm FPclose. Before calling FPclose with some (T, C), we already know from line 8 that there is no existing CFI X such that T.base ⊂ X, and T.base and X have the same count. If there is only one single path in T , the nodes and their counts in this single path can be easily used to list the T.base-local closed frequent itemsets. These itemsets will be compared with the CFI’s in C. If an itemset is closed, it is inserted into C. If the FP-tree T is not a single-path tree, we execute line 6. Lines 9 to 12 use the array technique. Lines 4 and 8 call function closed checking(Y, C) to check if a frequent itemset Y is closed. If it is, the function returns true, otherwise, false is returned. Lines 14 and 15 construct Y ’s conditional FP-tree and CFI-tree. Then FPclose is called recursively for TY and CY . Note that line 17 is not implemented as such. As in algorithm FPmax*, we found it more efficient to do the insertion of lines 3–5 into all CFI-trees currently in main memory. CFI-trees are initialized similarly to MFI-trees, described in Section 3.3. The implementation of function closed checking is almost the same as the implementation of function subset checking, except now we also consider the count of an itemset. Given an itemset Y = {i1 , i2 , . . . , ik } with count c, suppose the order of the items in header table of the current CFI-tree is i1 , i2 , . . . , ik . Following the linked list of ik , for each node in the list, first we check if its count is equal to or greater than c. If it is, we then test if Y is a subset of the ancestors of that node. The function closed checking returns true only when there is no existing CFI Z in the CFI-tree such that Z is a superset of Y and the count of Y is equal to or greater than the count of Z. Memory management allocating and deallocating space for FP-trees and CFI-trees is similar to the memory management of FPgrowth* and FPmax*. By a similar reasoning as in Section 3.5, we conclude that the total main memory requirement for running FPclose on a dataset is approximately sum of the size of the first FP-tree and its CFI-tree. 5. Experimental Evaluation We now present a performance comparison of our FPalgorithms with algorithms dEclat, GenMax, CHARM and MAFIA. Algorithm dEclat is a depth-first search algorithm proposed by Zaki and Gouda in [10]. dEclat uses a linked list to organize frequent patterns, however, each itemset now corresponds to an array of transaction IDs (the “TIDarray”). Each element in the array corresponds to a transaction that contains the itemset. Frequent itemset mining and candidate frequent itemset generation are done by TIDarray intersections. A technique called diffset, is used for reducing the memory requirement of TID-arrays. The diffset technique only keeps track of differences in the TID’s of a candidate itemsets when it is generating frequent itemsets. GenMax, also proposed by Gouda and Zaki [4], takes an approach called progressive focusing to do maximality testing. CHARM is proposed by Zaki and Hsiao [11] for CFI mining. In all three algorithms, the main operation is the intersection of TID-arrays. Each of them has been shown as one of the best algorithms for mining FI’s, MFI’s or CFI’s. MAFIA is introduced in [3] by Burdick et al. for mining maximal frequent itemsets. It also has options for mining FI’s and CFI’s. We give the results of three different sets of experiments, one set for FI’s, one for MFI’s and one for CFI’s. The source codes for dEclat, CHARM, GenMax and MAFIA were provided by their authors. We ran all algorithms on many synthetic and real datasets. Due to the lack of space, only the results for two synthetic datasets and two real datasets are shown here. These datasets should be representative, as recent research papers [2, 3, 4, 11, 10, 8, 9], use these or similar datasets. The two synthetic datasets, T40I10D100K and T100I20D100K, were generated from the application on the website of IBM 1 . They both use 100,000 transactions and 1000 items. The two real datasets, pumsb* and connect-4, were also downloaded from the IBM website 2 . Dataset connect-4 is compiled from game state information. Dataset pumsb* is produced from census data of Public Use Microdata Sample (PUMS). These two real datasets are both quite dense, so a large number of frequent itemsets can be mined even for very high values of minimum support. All experiments were performed on a 1Ghz Pentium III with 512 MB of memory running RedHat Linux 7.3. All times in the figures refer to CPU time. 5.1. FI Mining In [6], the original FPgrowth method has been shown to be an efficient and scalable algorithm for mining frequent itemsets. FPgrowth is about an order of magnitude faster than the Apriori. Subsequently, it was shown in [10], that the algorithm dEclat outperforms FPgrowth on most datasets. Thus, in the first set of experiments, FP-growth* is compared with the original FP-growth method and with dEclat. The original FP-growth method is implemented on the basis of the paper [6]. In this set of experiments we also included with MAFIA [3], which has an option for mining all FI’s. The results of the first set of experiments are shown in Figure 8. Figure 8 (a) shows the CPU time of the four algorithms running on dataset T40I10D100K. We see that FPgrowth* is the best algorithm for this dataset. It outperforms dEclat and MAFIA at least by a factor of two. Main memory is used up by dEclat when the minimum support goes down to 0.25%, while FPgrowth* can still run for even smaller levels of minimum support. MAFIA is the slowest algorithm for this dataset and its CPU time increases rapidly. Due to the use of the array technique, and the fact that T40I10D100K is a sparse dataset, FPgrowth* turns out to be faster than FPgrowth. However, when the minimum support is very low, we can expect the FP-tree to achieve a good compactification, starting at the initial recursion level. Thus the array technique does not offer a big gain. Consequently, as verified in Figure 8 (a), for very low levels minimum support, FPgrowth* and FPgrowth have almost the same running time. Figure 8 (b) shows the CPU time for running the four algorithms on dataset T100I20D100K. The result is similar to the result in Figure 8 (a). FPgrowth* is again the best. Since the dataset T100I20D100K is sparser than T40I10D100K, the speedup from FPgrowth to FPgrowth* is increased. From Figure 8 (c) and (d), we can see that the FPmethods are faster than dEclat by an order of magnitude 1 http://www.almaden.ibm.com/cs/quest/syndata.html 2 http://www.almaden.ibm.com/cs/people/bayardo/resources.html in both experiments. Since pumsb* and connect-4 are both very dense datasets, FPgrowth* and FPgrowth have almost same running time, as the array technique does not achieve a significant speedup for dense datasets. In Figure 8 (c), the CPU time increases drastically when the minimum support goes down below 25%. However, this is not a problem for FPgrowth and FPgrowth*, which still are able to produce results. The main reason for the nevertheless steeply increased CPU time is that a long time has to be spent listing frequent itemsets. Recall, that if there is a frequent “long” itemset of size ℓ, then we have to generate 2ℓ frequent sets from it. We also ran the four algorithms on many other datasets, and we found that FPgrowth* was always the fastest. To see why FPgrowth* is the fastest, let us consider the main operations in the algorithms. As discussed before, FPgrowth* spends most of its time on constructing and traversing FP-trees. The main operation in dEclat is to generate new candidate FI’s by TID-array intersections. In MAFIA, generating new candidate FI’s by bitvector and-operations is the main work. Since FPgrowth* uses the compact FPtree, further boosted by the array technique, the time it spends constructing and traversing the trees, is less than the time needed for TID-array intersections and bitvector andoperations. Moreover, the main memory space needed for storing FP-trees is far less than that for storing diffsets or bitvectors. Thus FPgrowth* runs faster than the other two algorithms, and it scales to very low levels of minimum support. Figure 11 (a) shows the main memory consumption of three algorithms by running them on dataset connect-4. We can see that FP-growth* always use the least main memory. And even for very low minimum support, it still uses a small amount of main memory. 5.2. MFI Mining In our paper [5], we analyzed and verified the performance of algorithm FPmax. We learned that FPmax outperformed GenMax and MAFIA in some, but not all cases. To see the impact of the new array technique and the new subset checking function that we are using in FPmax*, in the second set of experiments, we compared FPmax* with FPmax, GenMax, and MAFIA. Figure 9 (a) gives the result for running these algorithms on the sparse dataset T40I10D100K. We can see that FPmax is slower than GenMax for all levels of minimum support, while FPmax* outperforms GenMax by a factor of at least two. Figure 9 (b) shows the results for the very sparse dataset T100I20D100K, FPmax is the slowest algorithm, while FPmax* is the fastest algorithm. Figure 9 (c) shows that FPmax* is the fastest algorithm for the dense dataset pumsb*, even though FPmax is the slowest algorithm on this dataset for very low levels of minimum support. In T100I20D100K 100 10 1000 100 100 10 10 100 100 10 10 2 1.75 1.5 1.25 1 0.75 0.5 0.25 0 1000 100 100 10 10 1 1 10 1 1 10000 FP-growth* dEclat MAFIA FP-growth 1000 1000 1 1 2.25 10000 10000 FP-growth* dEclat MAFIA FP-growth 1000 CPU Time(s) 100 Connect-4 10000 10000 FP-growth* dEclat MAFIA FP-growth 1000 1000 CPU Time(s) CPU Time(s) 1000 Pumsb_star 10000 10000 FP-growth* dEclat MAFIA FP-growth CPU Time(s) T40I10D100K 10000 0.1 1 12 10 8 Minimum Support (%) 6 4 2 1 35 30 20 15 0.01 90 80 70 Minimum Support (%) Minimum Support (%) (a) 25 0.1 0.01 100 0.1 40 0 0.1 (b) 60 50 40 30 20 10 Minimum Support (%) (c) (d) Figure 8. Mining FI’s T100I20D100K 1000 1000 100 100 10 CPU Time(s) CPU Time(s) 1000 FPMAX* GenMax MAFIA FPMAX 100 10 10 10 1 1 1 1000 1000 FPMAX* GenMax MAFIA FPMAX 100 1000 100 Connect-4 1000 10000 CPU Time(s) FPMAX* GenMax MAFIA FPMAX Pumsb_star 10000 10000 1000 FPMAX* GenMax MAFIA FPMAX 100 100 10 10 1 CPU Time(s) T40I10D100K 10000 100 10 10 1 1 1 0.1 1 2.25 2 1.75 1.5 1.25 1 0.75 0.5 0.25 0 12 10 8 Minimum Support (%) 6 4 2 0.1 0 40 35 30 Minimum Support (%) (a) 25 20 15 10 5 0.1 0.01 100 0.1 0 0.01 80 60 (b) 40 20 0 Minimum Support (%) Minimum Support (%) (c) (d) Figure 9. Mining MFI’s T100I20D100K 10 2 1.75 1.5 1.25 1 0.75 0.5 0.25 CPU Time(s) 100 1 2.25 1000 1000 100 100 100 10 10 10 1 1 1 0 12 10 8 6 4 2 10 1 1 0.1 0.1 0.01 0.01 40 35 30 20 15 10 5 0 (b) 100 10 10 1 1 0.1 0.1 0.01 100 0.01 90 80 70 Minimum Support (%) Minimum Support (%) (a) 25 1000 FPclose MAFIA Charm 100 100 10 0 1000 1000 FPclose MAFIA Charm 1000 100 Minimum Support (%) 60 50 40 30 20 10 0 Minimum Support (%) (c) (d) Figure 10. Mining CFI’s Connect-4 100 MAFIA-FI GenMax dEclat MAFIA-MFI 100 10 10 10000 16 10 MAFIA-CFI MFI-tree Total 12 10 10000 Charm FP-tree 14 Main Memory (M) 100 Main Memory (M) FPMAX* Connect-4 Connect-4 100 1000 FPgrowth* Main Memory (M) Connect-4 1000 Main Memory (M) CPU Time(s) 1000 FPclose MAFIA Charm Connect-4 1000 10000 CPU Time(s) FPclose MAFIA Charm Pumsb_star 10000 10000 CPU Time(s) T40I10D100K 10000 10 8 6 4 FPclose 1000 1000 100 100 10 10 2 1 100 1 90 80 70 60 50 40 Minimum Support (%) (a) 30 20 10 1 100 1 90 80 70 60 50 40 30 20 10 0 Minimum Support (%) (b) 0 0 2 4 6 8 10 CPU Time (s) (c) 12 1 100 1 90 80 70 60 50 40 30 20 10 0 Minimum Support (%) (d) Figure 11. Main Memory used by the algorithms Figure 9 (d), FPmax outperforms GenMax and MAFIA for high levels of minimum support, but it is slow for very low levels. FPmax*, on the other hand is about one to two orders of magnitude faster than GenMax and MAFIA for all levels of minimum support. All experiments in this second set show that the array technique and the new subset checking function are indeed very effective. Figure 11 (b) shows the main memory used by three algorithms when running them on dataset connect4. From the figure, we can see that FPmax* uses less main memory than the other algorithms. Figure 11 (c) shows the main memory used by FP-trees, MFI-trees and the whole algorithm when running FPmax* on dataset connect-4. The minimum support was set as 10%. In the figure, the last point of the line for FP-tree is for the main memory of the first FP-tree (T∅ ), since at this point the space for all condi- tional FP-trees has been freed. The last point of the line for MFI-tree is for the main memory of the MFI-tree that contains whole set of MFI’s, i.e., M∅ . The figure confirms our analysis of main memory used by FPmax* in Section 3.5. We also run these four algorithms on many other datasets, and we found that FPmax* always was the fastest algorithm. 5.3. CFI Mining In the third set of experiments, the performances of FPclose, CHARM and MAFIA, with the option of mining closed frequent itemset, were compared. Figure 10 shows the results of running FPclose, CHARM and MAFIA on datasets T40I10D100K, T100I20D100K, pumsb* and connect-4. FPclose shows good performance on all datasets, due to the fact that it uses the compact FPtree and the array technique. However, for very low levels of minimum support FPclose has performance similar to CHARM and MAFIA. By analyzing the three algorithms, we found that FPclose generates more non-closed frequent itemsets than the other algorithms. For each of the generated frequent itemsets, the function closed checking must be called. Although the closed checking function is very efficient, the increased number of calls to it means higher total running time. For high levels of minimum support, the time saved by using the compact FP-tree and the array technique compensates for the time FPclose spends on closed checking. In all cases, FPclose uses less main memory for mining CFI’s than CHARM and MAFIA. Figure 11 (d) shows the memory used by three algorithms by running them on dataset connnect-4. We can see that for very low levels of minimum support, CHARM and MAFIA were aborted because they ran out of memory, while FPclose was still able to run and produce output. 6. Conclusions We have introduced a novel array-based technique that allows using FP-trees more efficiently when mining frequent itemsets. Our technique greatly reduces the time spent traversing FP-trees, and works especially well for sparse datasets. Furthermore, we presented new algorithms for mining maximal and closed frequent itemsets. The FPgrowth* algorithm, which extends original FPgrowth method, also uses the novel array technique to mine all frequent itemsets. For mining maximal frequent itemsets, we extended our earlier algorithm FPmax to FPmax*. FPmax* not only uses the array technique, but also a new subset-testing algorithm. For the subset testing, a variation of the FP-tree, an MFItree, is used for storing all already discovered MFI’s. In FPmax*, a newly found FI is always compared with a small set of MFI’s that are kept in an MFI-tree, thus making subsettesting much more efficient. For mining closed frequent itemsets we give the FPclose algorithm. In the algorithm, a CFI-tree —another variation of a FP-tree— is used for testing the closedness of frequent itemsets. For all of our algorithms we have presented several optimizations that further reduce their running time. Our experimental results showed that FPgrowth* and FPmax* always outperforms existing algorithms. FPclose also demonstrates extremely good performance. All of the algorithms need less main memory because of the compact FP-trees, MFI-trees, and CFI-trees. Though the experimental results given in this paper show the success of our algorithms, in the future we will test them on more applications to further study their performance. We are also planning to explore ways to improve the FPclose algorithm by reducing the number of closedness-tests needed. References [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of VLDB’94, pages 487–499, 1994. [2] R. J. Bayardo, Jr. Efficiently mining long patterns from databases. In Proceedings of ACM SIGMOD’98, pages 85– 93, 1998. [3] D. Burdick, M. Calimlim, and J. Gehrke. MAFIA: A maximal frequent itemset algorithm for transactional databases. In Proceedings of ICDE’01, pages 443–452, Apr. 2001. [4] K. Gouda and M. J. Zaki. Efficiently mining maximal frequent itemsets. In Proceedings of ICDM’01, San Jose, CA, Nov. 2001. [5] G. Grahne and J. Zhu. High performance mining of maximal frequent itemsets. In SIAM’03 Workshop on High Performance Data Mining: Pervasive and Data Stream Mining, May 2003. [6] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of ACM SIGMOD’00, pages 1–12, May 2000. [7] J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining top-k frequent closed patterns without minimum support. In Proceedings of ICDM’02, pages 211–218, Dec. 2002. [8] J. Pei, J. Han, and R. Mao. CLOSET: An efficient algorithm for mining frequent closed itemsets. In ACM SIGMOD’00 Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 21–30, 2000. [9] J. Wang, J. Han, and J. Pei. Closet+: Searching for the best strategies for mining frequent closed itemsets. In Proceedings of ACM SIGKDD’03, Washington, DC, 2003. [10] M. Zaki and K. Gouda. Fast vertical mining using diffsets. In Proceedings of ACM SIGKDD’03, Washington, DC, Aug. 2003. [11] M. Zaki and C. Hsiao. Charm: An efficient algorithm for closed itemset mining. In Proceedings of SIAM’02, Arlington, Apr. 2002. COFI-tree Mining: A New Approach to Pattern Growth with Reduced Candidacy Generation Mohammad El-Hajj Department of Computing Science University of Alberta Edmonton, AB, Canada mohammad@cs.ualberta.ca Abstract Existing association rule mining algorithms suffer from many problems when mining massive transactional datasets. Some of these major problems are: (1) the repetitive I/O disk scans, (2) the huge computation involved during the candidacy generation, and (3) the high memory dependency. This paper presents the implementation of our frequent itemset mining algorithm, COFI, which achieves its efficiency by applying four new ideas. First, it can mine using a compact memory based data structures. Second, for each frequent item assigned, a relatively small independent tree is built summarizing co-occurrences. Third, clever pruning reduces the search space drastically. Finally, a simple and non-recursive mining process reduces the memory requirements as minimum candidacy generation and counting is needed to generate all relevant frequent patterns. 1 Introduction Frequent pattern discovery has become a common topic of investigation in the data mining research area. Its main theme is to discover the sets of items that occur together more than a given threshold defined by the decision maker. A well-known application domain that counts on the frequent pattern discovery is the market basket analysis. In most cases when the support threshold is low and the number of frequent patterns “explodes”, the discovery of these patterns becomes problematic for reasons such as: high memory dependencies, huge search space, and massive I/O required. However, recently new studies have been proposed to reduce the memory requirements [8], to decrease the I/O dependencies [7], still more promising issues need to be investigated such as pruning techniques to reduce the search space. In this paper we introduce a new method for frequent pattern discovery that is based on the CoOccurrence Frequent Item tree concept [8, 9]. The new pro- Osmar R. Zaı̈ane Department of Computing Science University of Alberta Edmonton, AB, Canada zaiane@cs.ualberta.ca posed method uses a pruning technique that dramatically saves the memory space. These relatively small trees are constructed based on a memory-based structure called FPTrees [11]. This data structure is studied in detail in the following sections. In short, we introduced in [8] the COFItree stucture and an algorithm to mine it. In [7] we presented a disk based data structure, inverted matrix, that replaces the memory-based FP-tree and scales the interactive frequent pattern mining significantly. Our contributions in this paper are the introduction of a clever pruning technique based on an interesting property drawn from our top-down approach, and some implementation tricks and issues. We included the pruning in the algorithm of building the tree so that the pruning is done on the fly. 1.1 Problem Statement The problem of mining association rules over market basket analysis was introduced in [2]. The problem consists of finding associations between items or itemsets in transactional data. The data could be retail sales in the form of customer transactions or even medical images [16]. Association rules have been shown to be useful for other applications such as recommender systems, diagnosis, decision support, telecommunication, and even supervised classification [5]. Formally, as defined in [3], the problem is stated as follows: Let I = {i1 , i2 , ...im } be a set of literals, called items and m is considered the dimensionality of the problem. Let D be a set of transactions, where each transaction T is a set of items such that T ⊆ I. A unique identifier TID is given to each transaction. A transaction T is said to contain X, a set of items in I, if X ⊆ T . An association rule is an implication of the form “X ⇒ Y ”, where X ⊆ I, Y ⊆ I, and X ∩ Y = ∅. An itemset X is said to be large or frequent if its support s is greater or equal than a given minimum support threshold σ. An itemset X satisfies a constraint C if and only if C(X) is true. The rule X ⇒ Y has a support s in the transaction set D if s% of the transactions in D contain X ∪ Y . In other words, the support of the rule is the probability that X and Y hold together among all the possible presented cases. It is said that the rule X ⇒ Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y . In other words, the confidence of the rule is the conditional probability that the consequent Y is true under the condition of the antecedent X. The problem of discovering all association rules from a set of transactions D consists of generating the rules that have a support and confidence greater than a given threshold. These rules are called strong rules. This association-mining task can be broken into two steps: 1. A step for finding all frequent k-itemsets known for its extreme I/O scan expense, and the massive computational costs; 2. A straightforward step for generating strong rules. In this paper and our attached code, we focus exclusively on the first step: generating frequent itemsets. 1.2 Related Work Several algorithms have been proposed in the literature to address the problem of mining association rules [12, 10]. One of the key algorithms, which seems to be the most popular in many applications for enumerating frequent itemsets, is the apriori algorithm [3]. This apriori algorithm also forms the foundation of most known algorithms. It uses an anti-monotone property stating that for a k-itemset to be frequent, all its (k-1)-itemsets have to be frequent. The use of this fundamental property reduces the computational cost of candidate frequent itemset generation. However, in the cases of extremely large input sets with big frequent 1items set, the Apriori algorithm still suffers from two main problems of repeated I/O scanning and high computational cost. One major hurdle observed with most real datasets is the sheer size of the candidate frequent 2-itemsets and 3-itemsets. TreeProjection is an efficient algorithm presented in [1]. This algorithm builds a lexicographic tree in which each node of this tree presents a frequent pattern. The authors report that their algorithm is one order of magnitude faster than the existing techniques in the literature. Another innovative approach of discovering frequent patterns in transactional databases, FP-Growth, was proposed by Han et al. in [11]. This algorithm creates a compact tree-structure, FP-Tree, representing frequent patterns, that alleviates the multi-scan problem and improves the candidate itemset generation. The algorithm requires only two full I/O scans of the dataset to build the prefix tree in main memory and then mines directly this structure. The authors of this algorithm report that their algorithm is faster than the Apriori and the TreeProjection algorithms. Mining the FP-tree structure is done recursively by building conditional trees that are of the same order of magnitude in number as the frequent patterns. This massive creation of conditional trees makes this algorithm not scalable to mine large datasets beyond few millions. In [14] the same authors propose a new algorithm, H-mine, that invokes FP-Tree to mine condensed data. This algorithm is still not scalable as reported by its authors in [13]. 1.3 Preliminaries, Motivations and Contributions The Co-Occurrence Frequent Item tree (or COFI-tree for short) and the COFI algorithm presented in this paper are based on our previous work in [7, 8]. The main motivation of our current research is the pruning technique that reduces the memory space needed by the COFI-trees. The presented algorithm is done in two phases in which phase 1 requires two full I/O scans of the transactional database to build the FP-Tree structure[11]. The second phase starts by building small Co-Occurrence Frequent trees for each frequent item. These trees are pruned first to eliminate any non-frequent items with respect to the COFI-tree based frequent item. Finally the mining process is executed. The remainder of this paper is organized as follows: Section 2 describes the Frequent Pattern tree, design and construction. Section 3 illustrates the design, constructions, pruning, and mining of the Co-Occurrence Frequent Item trees. Section 4 presents the implementation procedure of this algorithm. Experimental results are given in Section 5. Finally, Section 6 concludes by discussing some issues and highlighting our future work. 2 Frequent Pattern Tree: Design and Construction The COFI-tree approach we propose consists of two main stages. Stage one is the construction of a modified Frequent Pattern tree. Stage two is the repetitive building of small data structures, the actual mining for these data structures, and their release. 2.1 Construction of the Frequent Pattern Tree The goal of this stage is to build the compact data structure called Frequent Pattern Tree [11]. This construction is done in two phases, where each phase requires a full I/O scan of the dataset. A first initial scan of the database identifies the frequent 1-itemsets. The goal is to generate an ordered list of frequent items that would be used when building the tree in the second phase. This phase starts by enumerating the items appearing in the transactions. After enumeration these items (i.e. after reading the whole dataset), infrequent items with a support less than the support threshold are weeded out and the remaining frequent items are sorted by their frequency. This tween this item-node in the tree and its entry in the header table. The header table holds as one pointer per item that points to the first occurrences of this item in the FP-Tree structure. Table 1. Transactional database T.No. Items T1 A G D C B T2 B C H E D T3 B D E A M T4 C E F A N T5 A B N O P T6 A C Q R G T7 A C H I G T8 L E F K B T9 A F M N O T10 C F P G R T11 A D B H I T12 D E B K L T13 M D C G O T14 C F P Q J T15 B D E F I T16 J E B A D T17 A K E F C T18 C D L B A 2.2 Illustrative Example list is organized in a table, called header table, where the items and their respective support are stored along with pointers to the first occurrence of the item in the frequent pattern tree. Phase 2 would construct a frequent pattern tree. Item Counter Item Counter A 11 N 3 B 10 O 3 C 10 P 3 D 9 Q 2 G 4 R 2 E 8 I 3 H 3 K 3 F 7 L 3 M 3 J 3 Step 1 Item Counter A 11 B 10 C 10 D 9 E 8 F 7 Item Counter F 7 E 8 D 9 C 10 B 10 A 11 Step 2 Step 3 Figure 1. Steps of phase 1 Phase 2 of constructing the Frequent Pattern tree structure is the actual building of this compact tree. This phase requires a second complete I/O scan from the dataset. For each transaction read, only the set of frequent items present in the header table is collected and sorted in descending order according to their frequency. These sorted transaction items are used in constructing the FP-Trees as follows: for the first item on the sorted transactional dataset, check if it exists as one of the children of the root. If it exists then increment the support for this node. Otherwise, add a new node for this item as a child for the root node with 1 as support. Then, consider the current item node as the new temporary root and repeat the same procedure with the next item on the sorted transaction. During the process of adding any new item-node to the FP-Tree, a link is maintained be- For illustration, we use an example with the transactions shown in Table 1. Let the minimum support threshold be set to 4. Phase 1 starts by accumulating the support for all items that occur in the transactions. Step 2 of phase 1 removes all non-frequent items, in our example (G, H, I, J, K, L,M, N, O, P, Q and R), leaving only the frequent items (A, B, C, D, E, and F). Finally all frequent items are sorted according to their support to generate the sorted frequent 1-itemset. This last step ends phase 1 in Figure 1 of the COFI-tree algorithm and starts the second phase. In phase 2, the first transaction (A, G, D, C, B) is filtered to consider only the frequent items that occur in the header table (i.e. A, D, C and B). This frequent list is sorted according to the items’ supports (A, B, C and D). This ordered transaction generates the first path of the FP-Tree with all item-node support initially equal to 1. A link is established between each item-node in the tree and its corresponding item entry in the header table. The same procedure is executed for the second transaction (B, C, H, E, and D), which yields a sorted frequent item list (B, C, D, E) that forms the second path of the FP-Tree. Transaction 3 (B, D, E, A, and M) yields the sorted frequent item list (A, B, D, E) that shares the same prefix (A, B) with an existing path on the tree. Item-nodes (A and B) support is incremented by 1 making the support of (A) and (B) equal to 2 and a new sub-path is created with the remaining items on the list (D, E) all with support equal to 1. The same process occurs for all transactions until we build the FP-Tree for the transactions given in Table 1. Figure 2 shows the result of the tree building process. Notice that in our tree structure, contrary to the original FP-tree [11], our links are bi-directional. This, and other differences presented later, are used by our mining algorithm. 3 Co-Occurrence Frequent-Item-trees: Construction, Pruning and Mining Our approach for computing frequencies relies first on building independent, relatively small trees for each frequent item in the header table of the FP-Tree called COFItrees. A pruning technique is applied to remove all nonfrequent items with respect to the main frequent item of the tested COFI-tree. Then we mine separately each one of the trees as soon as they are built, minimizing the candidacy generation and without building conditional sub-trees recursively. The trees are discarded as soon as mined. At Root A 11 F B 4 C 3 7 E 8 D 9 C 10 B 10 A 11 F 1 C 4 E 2 F 2 B C D 6 2 2 C 1 D 3 E 2 D 1 E 1 D2 F 1 E 1 F 2 D 1 E 2 F 1 Figure 2. Frequent Pattern Tree. any given time, only one COFI-tree is present in main memory. In our following examples we always assume that we are building the COFI-trees based on the modified FP-Tree data-structure presented above. 3.1 Pruning the COFI-trees Pruning can be done after building a tree or, even better, while building it. We opted for pruning on the fly since the overhead is minimal but the consequences are drastic reduction in memory requirements. We will discuss the pruning idea, then present the building algorithm that considers the pruning on the fly. In this section we are introducing a new anti-monotone property called global frequent/local non-frequent property. This property is similar to the Apriori one in the sense that it eliminates at the ith level all non-frequent items that will not participate in the (i+1) level of candidate itemsets generation. The difference between the two properties is that we extended our property to eliminate also frequent items which are among the i-itemset and we are sure that they will not participate in the (i+1) candidate set. The Apriori property states that all nonempty subsets of a frequent itemset must also be frequent. An example is given later in this section to illustrate both properties. In our approach, we are trying to find all frequent patterns with respect to one frequent item, which is the base item of the tested COFItree. We already know that all items that participate in the creation of the COFI-tree are frequent with respect to the global transaction database, but that does not mean that they are also locally frequent with respect to the based item in the COFI-tree. The global frequent/local non-frequent property states that all nonempty subsets of a frequent itemset with respect to the item A of the A-COFI-tree , must also be frequent with respect to item A. For each frequent item A we traverse the FP-Tree to find all frequent items that occur with A in at least one transaction (or branch in the FP-Tree) with their number of occurrences. All items that are locally frequent with item A will participate in building the A-COFI-tree, other global frequent items, locally non-frequent items will not participate in the creation of the A-COFI-tree. In our example we can find that all items that participate in the creation of the F-COFI-tree are lo- cally not frequent with respect to item F as the support for all these items are not greater than the support threshold σ which is equal to 4, Figure 3. From knowing this, there will be no need to mine the F-COFI-tree, we already know that no frequent patterns other than the item F will be generated. We can extend our knowledge at this stage to know that item F will not appear in any of the frequent patterns. The COFI-tree for item E indicates that only items D, and B are frequent with respect to item E, which means that there will be no need to test patterns as EC, and EA. The COFI-tree for item D indicates that item C will be eliminated, as it is not frequent with respect to item D. C-COFItree ignores item B for the same reason. To sum up the Apriori property states in our example of 6 1-frequent itemset that we need to generate 15 2-Candidate itemset which are (A,B), (A,C), (A,D), (A,E), (A,F), (B,C), (B,D), (B,E), (B,F), (C,D), (C,E), (C,F), (D,E), (D,F), (E,F), using our property we have eliminated (not generated or counted) 9 patterns which are (A,E), (A,F), (B,C), (B,F), (C,D), (C,E), (C,F), (D,F), (E,F) leaving only 6 patterns to test which are (A,B), (A,C), (A,D), (B,D), (B,E), (D,E). 3.2 Construction of the Co-Occurrence FrequentItem-trees The small COFI-trees we build are similar to the conditional FP-Trees [11] in general in the sense that they have a header with ordered frequent items and horizontal pointers pointing to a succession of nodes containing the same frequent item, and the prefix tree per se with paths representing sub-transactions. However, the COFI-trees have bidirectional links in the tree allowing bottom-up scanning as well, and the nodes contain not only the item label and a frequency counter, but also a participation counter as explained later in this section. The COFI-tree for a given frequent item x contains only nodes labeled with items that are more frequent or as frequent as x. To illustrate the idea of the COFI-trees, we will explain step by step the process of creating COFI-trees for the FPTree of Figure 2. With our example, the first Co-Occurrence Frequent Item tree is built for item F as it is the least frequent item in the header table. In this tree for F, all frequent items, which are more frequent than F, and share transac- tions with F, participate in building the tree. This can be found by following the chain of item F in the FP-Tree structure. The F-COFI-tree starts with the root node containing the item in question, then a scan of part of the FP-Tree is applied following he chain of the F item in the FP-Tree. The first branch FA has frequency of 1, as the frequency of the branch is the frequency of the test item, which is F. The goal of this traversal is to count the frequency of each frequent item with respect to item F. By doing so we can find that item E occurs 4 times, D occurs 2 times, C occurs 4 times, B 2 times, and A 3 times, by applying the anti-monotone constraint property we can predict that item F will never appear in any frequent pattern except itself. Consequently there will be no need to continue building the F-COFI-tree. The next frequent item to test is E. The same process is done to compute the frequency of each frequent items with respect to item E. From this we can find that only two globally frequent items are also locally frequent which are (D:5 and B:6). For each sub-transaction or branch in the FP-Tree containing item E with other locally frequent items that are more frequent than E which are parent nodes of E, a branch is formed starting from the root node E. the support of this branch is equal to the support of the E node in its corresponding branch in FP-Tree. If multiple frequent items share the same prefix, they are merged into one branch and a counter for each node of the tree is adjusted accordingly. Figure 3 illustrates all COFI-trees for frequent items of Figure 2. In Figure 3, the rectangle nodes are nodes from the tree with an item label and two counters. The first counter is a support-count for that node while the second counter, called participation-count, is initialized to 0 and is used by the mining algorithm discussed later, a horizontal link which points to the next node that has the same itemname in the tree, and a bi-directional vertical link that links a child node with its parent and a parent with its child. The bi-directional pointers facilitate the mining process by making the traversal of the tree easier. The squares are actually cells from the header table as with the FP-Tree. This is a list made of all frequent items that participate in building the tree structure sorted in ascending order of their global support. Each entry in this list contains the item-name, itemcounter, and a pointer to the first node in the tree that has the same item-name. To explain the COFI-tree building process, we will highlight the building steps for the E-COFI-tree in Figure 3. Frequent item E is read from the header table and its first location in the FP-Tree is located using the pointer in the header table. The first location of item E indicate that it shares a branch with items CA, with support = 2, since none of these items are locally frequent then only the support of the E root node is incremented by 2. the second node of item E indicates that it shares items DBA with support equals to 2 for this branch as the support of the E-item is considered the F COFI-tree F E 4 (7 0) D 2 C 4 B 2 A 3 E COFI-tree E (8 0) B (1 D 5 C 3 B 6 D 5 A 4 B 6 D B (5 (5 0) 0) 0) D COFI-tree D C 4 B 8 B 8 A 5 B (8 (5 0) (9 0) 0) A 5 A C COFI-tree C B 3 A 6 ( 10 0 ) A 6 A (6 0) B COFI-tree B A 6 ( 10 0 ) A 6 A (6 0) Figure 3. COFI-trees support for this branch (following the upper links for this item). Two nodes are created, for items D and B with support equals to 2, D is a child node of B, and B is a child node of E. The third location of E indicate having EDB:1, which shares an existing branch in the E-COFI-tree, all counters are adjusted accordingly. A new branch of EB: 1 is created as the support of E=1 for the fourth occurrences of E. The final occurrence EDB: 2 uses an existing branch and only counters are adjusted. Like with FP-Trees, the header constitutes a list of all frequent items to maintain the location of first entry for each item in the COFI-tree. A link is also made for each node in the tree that points to the next location of the same item in the tree if it exists. The mining process is the last step done on the E-COFI-tree before removing it and creating the next COFI-tree for the next item in the header table. Pattern E COFI-tree STEP1 E D D 5 (5 0) (8 B 0) (1 0) E (8 5) D (5 1) B (5 5) E (8 6) B 6 Pattern D COFI-tree STEP1 D E D B 5 E D 5 B 8 E B 5 A 5 B (8 (5 0) (9 0) 0) D (9 5) D B A 5 B (8 5) D B A 5 D B 5 D A 5 A (5 5) D (9 5) D B B (8 5) D B A 5 D B 8 D A 5 E D B 5 B (5 0) A E COFI-tree STEP2 E D D 5 (5 5) (8 B 5) (1 Pattern 0) B B 6 (1 1) 1 E D 5 E B 6 E D B 5 B (5 D (9 5) B 8 A 5 B (8 (5 5) 5) A 5) 3 Frequent Patterns are: DBA:5, DB: 8, DA: 5 E COFI-tree STEP3 E D D 5 Pattern D COFI-tree STEP2 E B (5 5) (8 6) B (1 B 6 Pattern E (8 7) D (6 6) E D 0 1) Figure 5. Steps needed to generate frequent patterns related to item D Frequent Patterns are: B (5 5) ED:5, EB: 6, EDB: 5 Figure 4. Steps needed to generate frequent patterns related to item E 3.3 Mining the COFI-trees The COFI-trees of all frequent items are not constructed together. Each tree is built, mined, then discarded before the next COFI-tree is built. The mining process is done for each tree independently with the purpose of finding all frequent k-itemset patterns in which the item on the root of the tree participates. Steps to produce frequent patterns related to the E item for example, as the F-COFI-tree will not be mined based on the pruning results we found on the previous step, are illustrated in Figure 4. From each branch of the tree, using the support-count and the participation-count, candidate frequent patterns are identified and stored temporarily in a list. The non-frequent ones are discarded at the end when all branches are processed. The mining process for the E-COFI-tree starts from the most locally frequent item in the header table of the tree, which is item B. Item B exists in two branches in the E-COFI-tree which are (B:5, D:5 and E:8), and (B:1, and E:8). The frequency of each branch is the frequency of the first item in the branch minus the participation value of the same node. Item B in the first branch has a frequency value of 5 and participation value of 0 which makes the first pattern EDB frequency equals to 5. The participation values for all nodes in this branch are incremented by 5, which is the frequency of this pattern. In the first pattern EDB: 5. We need to generate all sub-patterns that item E participates in, which are ED: 5, EB: 5, and EDB: 5. The second branch that has B gener- ates the pattern EB: 1. EB already exists and its counter is adjusted to become 6. The COFI-tree of Item E can be removed at this time and another tree can be generated and tested to produce all the frequent patterns related to the root node. The same process is executed to generate the frequent patterns. The D-COFI-tree (Figure 5) is created after the E-COFI-tree. Mining this tree generates the following frequent patterns: DBA: 5, DA: 5, and DB:8. The same process occurs for the remaining trees that would produce AC: 6 for the C-COFI-tree and BA:6 for the B-COFI-tree. The following is our algorithm for building and mining the COFI-trees with pruning. Algorithm COFI: Creating with pruning and Mining COFI-trees Input: modified FP-Tree, a minimum support threshold σ Output: Full set of frequent patterns Method: 1. A = the least frequent item on the header table of FP-Tree 2. While (There are still frequent items) do 2.1 count the frequency of all items that share item (A) a path. Frequency of all items that share the same path are the same as of the frequency of the (A) items 2.2 Remove all non-locally frequent items for the frequent list of item (A) 2.3 Create a root node for the (A)-COFI-tree with both frequency-count and participation-count = 0 2.3.1 C is the path of locally frequent items in the path of item A to the root 2.3.2 Items on C form a prefix of the (A)-COFI-tree. 2.3.3 If the prefix is new then Set frequency-count= frequency of (A) node and participationcount= 0 for all nodes in the path Else 4 Experimental Studies To study the COFI-tree mining strategies we have conducted several experiments on a variety of data sizes comparing our approach with the well-known FP-Growth [11] algorithm written by its original authors. The experiments were conducted on 2.6 GHz CPU machine with 2 Gbytes of memory using Win2000 operating system. Transactions were generated using IBM synthetic data generator [4]. We have conducted several types of experiments to test the effect of changing the support, transaction size, dimension, and transaction length. The first set of experiments were tested on a transaction database of 500K transactions, 10K the dimension, and the average transaction length was 12. We have varied the support from absolute value of 500 to 2 in which frequent patterns generated varied from 15K to 3400K patterns. FP-Growth could not mine the last experiment in this set as it used all available memory space. In all experiments the COFI-tree approach outperforms the FPGrowth approach. The major accomplishment of our ap- F P -G row th T im e in S ec on ds 30 0 25 0 20 0 15 0 10 0 50 0 S up p ort % 0.1 0.05 0.02 0.01 0.00 5 0.02 0.00 1 0.00 04 (A) Runtime COFI F P -G row th 16 00 M e m or y U sag e in K .By te Function: MineCOFI-tree (A) 1. nodeA = select next node //Selection of nodes starts with the node of most locally frequent item and following its chain, then the next less frequent item with its chain, until we reach the least frequent item in the Header list of the (A)-COFI-tree 2. while there are still nodes do 2.1 D = set of nodes from nodeA to the root 2.2 F = nodeA.frequency-count-nodeA. participation-count 2.3 Generate all Candidate patterns X from items in D. Patterns that do not have A will be discarded. 2.4 Patterns in X that do not exist in the A-Candidate List will be added to it with frequency = F otherwise just increment their frequency with F 2.5 Increment the value of participation-count by F for all items in D 2.6 nodeA = select next node 3. Goto 2 4. Based on support threshold σ remove non-frequent patterns from A Candidate List. COFI 35 0 14 00 12 00 10 00 80 0 60 0 40 0 20 0 0 S up p ort % 0.1 0.05 0.02 0.01 0.00 5 0.02 0.00 1 0.00 04 (B) Total Memory requirement COFI F P -G row th 16 00 M e m or y U sag e in K .By te 2.3.4 Adjust the frequency-count of the already exist part of the path. 2.3.5 Adjust the pointers of the Header list if needed 2.3.6 find the next node for item A in the FP-tree and go to 2.3.1 2.4 MineCOFI-tree (A) 2.5 Release (A) COFI-tree 2.6 A = next frequent item from the header table 3. Goto 2 14 00 12 00 10 00 80 0 60 0 40 0 20 0 0 S up p ort % 0.1 0.05 0.02 0.01 0.00 5 0.02 0.00 1 0.00 04 (C) Memory requirement without FP-tree No.of transactions = 500K, Dimension= 10K, Average no. of items / transaction = 12 Figure 6. Mining dataset of 500K transactions proach is in the memory space saved. Our algorithm outperforms the FP-Growth by one order of magnitude in terms of memory space requirements. We have also tested the memory space used during the mining process only, (i.e, isolating the memory space used to create the FP-Tree by both FP-growth and COFI-tree FP-Tree based algorithms). We have found also that the COFI-tree approach outperforms the FP-tree by one order of magnitude in terms of memory space used by the COFI-tree compared with the conditional trees used by FP-Growth during the mining process. Figure 6A presents the time needed to mine 500K transactions using different support levels. Figure 6B depicts the memory needed during the mining process of the previous experiments. Figure 6C illustrates the memory needed by Table 2. Time and Memory Scalability with respect to support on the T10I4D100K dataset Support % 0.50 0.25 0.10 0.05 Time in Seconds COFI FP-Growth 1.5 3.0 1.7 5.2 2.7 12.3 14.0 20.9 Memory in KB COFI FP-Growth 18 173 19 285 26 289 19 403 the COFI-trees and Conditional trees during the mining process. Other experiments were conducted to test the effect of changing the dimension, transaction size, transaction length using the same support which is 0.05%. Some of these experiments are represented in Figure 7. Figures 7A and 7B represent the time needed during the mining process. Figures 7C and 7D represent the memory space needed during the whole mining process. Figures 7E and 7F represent the memory space needed by the COFI-trees or conditional trees during the mining process. In these experiments we have varied the dimension, which is the number of distinct items from 5K to 10K, the average transaction length from 12 to 24 items in one transaction, and the number of transactions from 10K to 500K. All these experiments depicted the fact that our approach is one order of magnitude better than the FP-Growth approach in terms of memory usage. We also run experiments using the public UCI datasets provided on the FIMI workshop website, which are Mushroom, Chess, Connect, Pumsb, T40I10D100K, and T10I4D100K. The COFI algorithm scales relatively well vis-à-vis the support threshold with these datasets. Results are not reported here for lack of space. Our approach revealed good results with high support value on all datasets. However, like with other approaches, in cases of low support value, where the number of frequent patterns increases significantly, our approach faces some difficulties. For such cases it is recommended to consider discovering closed itemsets or maximal patterns instead of just frequent itemsets. The sheer number of frequent itemsets becomes overwhelming, and some argue even useless. Closed itemsets and maximal itemsets represent all frequent patterns by eliminating the redundant ones. For illustration, Table 2 compares the CPU time and memory requirement for COFI and FP-Growth on the T10I4D100K dataset. 5 Implementations The COFI-tree program submitted with this paper is a C++ code. The executable of this code runs with 3 parameters, which are: (1) the path to the input file name. (2) a positive integer that presents the absolute support. (3) An optional file name for the out patterns. This code generates ALL frequent patterns from the provided input file. The code scans the database twice. The goal of the first database scan is to find the frequency of each item in this transactional database. These frequencies are stored in a data structure called Candidate-Items. Each entry of this candidate items is a structure called ItemsStructure that is made of two long integers representing the item and its frequency. All frequent items are then stored in a special data structure called F1-Items. This data structure is sorted in descending order based on the frequency of each item. To access the location of each item we map it with a specific location using a new data structure called FindInHashTable. In brief, since we do not know the number of unique items at runtime, and thus can’t create an array for counting the items, rather than having a linked list of items, we create blocks of p items. The number p could arbitrarily be 100 or 1000. Indeed, following links in a linked list each time to find and increment a counter could be expensive. Instead, blocs of items are easily indexed. In the worst case, we could lose the space of p − 1 unused items. The second scan starts by eliminating all non frequent items from each transaction read and then sort this transaction based on the frequency of each frequent item. This process occurred in the Sort-Transaction method. The FPtree is built based on the sub-transaction made of the frequent items. The FP-tree data structure is a tree of n children. The structure struct FPTTree { long Element; long counter; FPTTree* child; FPTTree* brother; FPTTree* father; FPTTree* next; } has been used to create each node of this tree, where a link is created between each node and its first child, and the brother link is maintained to create a linked list of all children of the same node. This linked list is built ordered based on the frequency of each item. The header list is maintained using the structure FrequentStruc { long Item; long Frequency; long COFIFrequency; long COFIFrequency1; FPTTree* first; COFITree* firstCOFI; }; After building the FP-tree we start building the first COFItree by selecting the item with least frequency from the frequent list. A scan is made of the FP-tree starting from the linked list of this item to find the frequency of other items with respect to this item. After that, the COFI-tree is created based on only the locally frequent items. Finally frequent patterns are generated and stored in the FrequentTree data structure. All nodes that have support greater or equal than the given support present a frequent pattern. The COFI-tree and the FrequentTree are removed from memory and the next COFI-tree is created until we mine all frequent trees. One interesting implementation improvement is the fact that the participation counter was also added to the header table of the COFI-tree this counter cumulates the participation of the item in all paterns already discovered in the current COFI-tree. The difference between the participa- tion in the node and the participation in the header is that the counter in the node counts the participation of the node item in all paths where the node appears, while the new counter in the COFI-tree header counts the participation of the item globally in the tree. This trick does not compromise the effectiveness and usefulness of the participation counting. One main advantage of this counter is that it looks ahead to see if all nodes of a specific item have already been traversed or not to reduce the unneeded scans of the COFI-tree. 6 Conclusion and future work The COFI algorithm, based on our COFI-tree structure, we propose in this paper is one order of magnitude better than the FP-Growth algorithm in terms of memory usage, and sometimes in terms of speed. This Algorithm achieves this results thanks to: (1) the non recursive technique used during the mining process, in which with a simple traversal of the COFI-tree a full set of frequent patterns can be generated. (2) The pruning method that is used to remove all locally non frequent patterns, leaving the COFI-tree with only locally frequent items. The major advantage of our algorithm COFI over FPGrowth is that it needs a significantly smaller memory footprint, and thus can mine larger transactional databases with smaller main memory available. The fundamental difference, is that COFI tries to find a compromise between a fully pattern growth approach, that FP-Growth adopts, and a total candidacy generation approach that apriori is known for. COFI grows targeted patterns but performs a reduced and focused generation of candidates during the mining. This is to avoid the recursion that FP-growth uses, and notorious to blow the stack with large datasets. We have developed algorithms for closed itemset mining and maximal itemset mining based on our COFI-tree approach. However, their efficient implementations were not ready by the deadline of this workshop. These efficient algorithms and experimental results will be compared to existing algorithms such as CHARM[17], MAFIA[6] and CLOSET+[15], and will be reported in the future. 7 Acknowledgments This research is partially supported by a Research Grant from NSERC, Canada. References [1] R. Agarwal, C.Aggarwal, and V. Prasad. A tree projection algorithm for generation of frequent itemsets. Parallel and distributed Computing, 2000. [2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data, pages 207–216, Washington, D.C., May 1993. [3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases, pages 487–499, Santiago, Chile, September 1994. [4] I. Almaden. Quest synthetic data generation code. http://www.almaden.ibm.com/cs/quest/syndata.html. [5] M.-L. Antonie and O. R. Zaı̈ane. Text document categorization by term association. In IEEE International Conference on Data Mining, pages 19–26, December 2002. [6] C. Burdick, M. Calimlim, and J. Gehrke. MAFIA: A maximal frequent itemset algorithm for transactional databases. In IEEE International Conference on Data Mining (ICDM 01), April 2001. [7] M. El-Hajj and O. R. Zaı̈ane. Inverted matrix: Efficient discovery of frequent items in large datasets in the context of interactive mining. In In Proc. 2003 Int’l Conf. on Data Mining and Knowledge Discovery (ACM SIGKDD), August 2003. [8] M. El-Hajj and O. R. Zaı̈ane. Non recursive generation of frequent k-itemsets from frequent pattern tree representations. In In Proc. of 5th International Conference on Data Warehousing and Knowledge Discovery (DaWak’2003), September 2003. [9] M. El-Hajj and O. R. Zaı̈ane. Parallel association rule mining with minimum inter-processor communication. In Fifth International Workshop on Parallel and Distributed Databases (PaDD’2003) in conjunction with the 14th Int’ Conf. on Database and Expert Systems Applications DEXA2003, September 2003. [10] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufman, San Francisco, CA, 2001. [11] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM-SIGMOD, Dallas, 2000. [12] J. Hipp, U. Guntzer, and G. Nakaeizadeh. Algorithms for association rule mining - a general survey and comparison. ACM SIGKDD Explorations, 2(1):58–64, June 2000. [13] J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by oppotunistic projection. In Eight ACM SIGKDD Internationa Conf. on Knowledge Discovery and Data Mining, pages 229–238, Edmonton, Alberta, August 2002. [14] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. Hmine: Hyper-structure mining of frequent patterns in large databases. In ICDM, pages 441–448, 2001. [15] J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In 9th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining, July 2003. [16] O. R. Zaı̈ane, J. Han, and H. Zhu. Mining recurrent items in multimedia with progressive resolution refinement. In Int. Conf. on Data Engineering (ICDE’2000), pages 461–470, San Diego, CA, February 2000. [17] M. Zaki and C.-J. Hsiao. ChARM: An efficient algorithm for closed itemset mining. In 2nd SIAM International Conference on Data MIning, April 2002. D = Dimension, L = Average number of items in one transaction Support = 0.05% COFI FP-Growth COFI FP-Growth 25 400 350 Time in Seconds Time in Seconds 20 300 250 200 150 15 10 100 5 50 0 10 Number of Transactions in K 50 100 0 10 Number of Transactions in K 50 (A) D=5K, L=12 FP-Growth COFI 400 800 350 700 300 250 200 150 100 50 FP-Growth 600 500 400 300 200 100 0 10 Number of Transactions in K 50 100 0 10 Number of Transactions in K 50 (C) D=5K, L=12 COFI FP-Growth 500 COFI FP-Growth 600 Memory Usage in K.Byte 300 250 200 150 100 50 0 10 Number of Transactions in K 100 (D) D=10K, L=24 350 Memory Usage in K.Byte 500 (B) D=10K, L=24 Memory Usage in K.Byte Memory Usage in K.Byte COFI 100 50 100 500 400 300 200 100 0 10 Number of Transactions in K (E) D=5K, L=12 Figure 7. Mining dataset of different sizes 50 100 500 (F) D=10K, L=24 Mining Frequent Itemsets using Patricia Tries ∗ Andrea Pietracaprina and Dario Zandolin Department of Information Engineering University of Padova andrea.pietracaprina@unipd.it, dzandol@tin.it Abstract We present a depth-first algorithm, PatriciaMine, that discovers all frequent itemsets in a dataset, for a given support threshold. The algorithm is main-memory based and employs a Patricia trie to represent the dataset, which is space efficient for both dense and sparse datasets, whereas alternative representations were adopted by previous algorithms for these two cases. A number of optimizations have been introduced in the implementation of the algorithm. The paper reports several experimental results on real and artificial datasets, which assess the effectiveness of the implementation and show the better performance attained by PatriciaMine with respect to other prominent algorithms. 1. Introduction In this work, we focus on the problem of finding all frequent itemsets in a dataset D of transactions over a set of items I, that is, all itemsets X ⊆ I contained in a number of transactions greater than or equal to a certain given threshold [2]. Several algorithms proposed in the literature to discover all frequent itemsets follow a depth-first approach by considering one item at a time and generating (recursively) all frequent itemsets which contain that item, before proceeding to the next item. A prominent member of this class of algorithms is FP-Growth proposed in [7]. It represents the dataset D through a standard trie (FP-tree) and, for each frequent itemset X, it materializes a projection DX of the dataset on the transactions containing X, which is used to recursively discover all frequent supersets Y ⊃ X. This approach is very effective for dense datasets, where the trie achieves high compression, but becomes space inefficient when the dataset is sparse, and incurs high costs due to the frequent projections. ∗ This research was supported in part by MIUR of Italy under project “ALINWEB: Algorithmics for Internet and the Web”. Improved variants of FP-Growth appeared in the literature, which avoid physical projections of the dataset (TopDown FP-Growth [14]), or employ two alternative arraybased and trie-based structures to cope, respectively, with sparse and dense datasets, switching adaptively from one to the other (H-mine [12]). The most successful ideas developed in these works have been gathered and further refined in OpportuneProject [9] which opportunistically selects the best strategy based on the characteristics of the dataset. In this paper, we present an algorithm, PatriciaMine, which further improves upon the aforementioned algorithms stemmed from FP-Growth. Our main contribution is twofold: • We use a compressed (Patricia) trie to store the dataset, which provides a space-efficient representation for both sparse and dense datasets, without resorting to two alternative structures, namely array-based and triebased, as was suggested in [12, 9]. Indeed, by featuring a smaller number of nodes than the standard trie, the Patricia trie exhibits lower space requirements, especially in the case of sparse datasets, where it becomes comparable to the natural array-based representation, and reduces the amount of bookkeeping operations. Both theoretical and experimental evidence of these facts is given in the paper. • A number of optimizations have been introduced in the implementation of PatriciaMine. In particular, a heuristic has been employed to limit the number of physical projections of the dataset during the course of execution, with the intent to avoid the time and space overhead incurred by projection, when not beneficial. Moreover, novel mechanisms have been developed for directly generating groups of itemsets supported by the same subset of transactions, and for visiting the trie without traversing individual nodes multiple times. The effectiveness of these optimizations is discussed in the paper. We coded PatriciaMine in C, and compared its performance with that of a number of prominent algorithms, whose source/object code was made available to us, on several real and artificial datasets. The experiments provide clear evidence of the higher performance of PatriciaMine with respect to these other algorithms on both dense and sparse datasets. It must be remarked that our focus is on main-memory execution, in the sense PatriciaMine works under the assumption that the employed representation of the dataset fits in main memory. If this is not the case, techniques such as those suggested in [13, 9] could be employed, but this is beyond the scope of this work. The rest of the paper is organized as follows. Section 2 introduces some notation and illustrates the datasets used in the experiments. Section 3 presents the main iterative strategy adopted by PatriciaMine, which can be regarded as a reformulation (with some modifications) of the recursive strategies adopted in [7, 12, 14, 9]. Sections 4 and 5 describe the most relevant features of the algorithm implementation, while the experimental results are reported and discussed in Section 6. TID 1 2 3 4 5 6 Items A B B C A B A B B G A B D E D C H D E L F D L F F G H H F L G L I I Figure 1. Sample dataset (items in bold are frequent for min sup = 3) 2. Preliminaries Let I be a set of items, and D a set of transactions, where each transaction t ∈ D consists of a distinct identifier tid and a subset of items tset ⊆ I. For an itemset X ⊆ I, its support in D, denoted by suppD (X), is defined as the number of transactions t ∈ D such that X ⊆ tset . Given an absolute support threshold min sup, with 0 < min sup ≤ |D|, an itemset X ⊆ I is frequent w.r.t. D and min sup, if suppD (X) ≥ min sup. With a slight abuse of notation, we call an item i ∈ I frequent if {i} is frequent, and refer to suppD ({i}) as the support of i1 . We study the problem of determining the set of all frequent itemsets for given D and min sup, which we denote by F(D, min sup). For an itemset X ⊆ I, we denote by DX the subset of D projected on those transactions that contain X. Let I ′ = {i1 , i2 , . . .} ⊆ I denote the subset of frequent items ordered by increasing support, and assume that the items in each frequent itemset are ordered accordingly. As observed in [1, 9], the set F(D, min sup) can be conveniently represented through a standard trie [8], called Frequent ItemSet Tree (FIST), whose nodes are in one-to-one correspondence with the frequent itemsets. Specifically, each node v is labelled with an item i and a support value σv , so that the itemset associated with v is given by the sequence of items labelling the path from the root to v, and has support σv . The root is associated with the empty itemset and is labelled with (·, |D|). The children of every node are arranged right-to-left consistently with the ordering of their labelling items. 1 When clear from the context, we will refer to frequent items or itemsets, omitting D and min sup. Figure 2. FIST for the sample dataset with min sup = 3 A sample dataset and the corresponding FIST for min sup = 3 are shown in Figures 1 and 2. Notice that a different initial ordering of the items in I ′ would produce a different FIST. Most of the algorithms that compute F(D, min sup) perform either a breadth-first or a depth-first exploration of some FIST. In particular, our algorithm performs a depth-first exploration of the FIST defined above. 2.1. Datasets used in the experiments The experiments reported in this paper have been conducted on several real and artificially generated datasets, frequently used in previous works. We briefly describe them below and refer the reader to [4, 16] for more details (see also Table 1). Pos: From Blue-Martini Software Inc., contains years worth of point-of-sale data form an electronics retailer. WebView1, WebView2: From Blue-Martini Software Inc., contain several months of clickstream data from ecommerce web sites. Pumsb, Pumsb*: derived by [4] from census data. Mushroom: It contains characteristics of various species of mushrooms. Connect-4, Chess: are relative to the respective games. 1. Determine I ′ and D′ ; 2. Create IL and link it to D′ ; X ← ∅; h ← 0; ℓ ← 0; while (ℓ < |IL|) do 3. if (IL[ℓ].count < min sup) then ℓ ← ℓ + 1; else 4. if ((h > 0) AND (IL[ℓ].item = X[h − 1])) 5. then ℓ ← ℓ + 1; h ← h − 1; else 6. X[h] ← IL[ℓ].item; 7. h ← h + 1; 8. Generate itemset X; 9. for i ← ℓ − 1 downto 0 do ′ make IL[i].ptr point to head of t-list(i, DX ); ′ IL[i].count ← support of IL[i].item in DX ; ℓ ← 0; Figure 3. Main Strategy IBM-Artificial: a class of artificial datasets obtained using the generator developed in [3]. A dataset in this class is denoted through the parameters used by the generator, namely as Dx.Ty.Iw.Lu.Nz, where x is the number of transactions, y the average transaction size, w the average size of maximal potentially large itemsets, u the number of maximal potentially large itemsets, and z the number of items. Datasets from Blue-Martini Software Inc. and (usually) the artificial ones are regarded as sparse, while the other ones as dense. 3. The main strategy The main strategy adopted by PatriciaMine is described by the pseudocode in Figure 3 and is based on a depth-first exploration of the FIST, similar to the one employed by the algorithms in [7, 12, 14, 9]. However, it must be remarked that while previous algorithms were expressed in a recursive fashion, PatriciaMine follows an iterative exploration strategy, which avoids the burden of managing recursion. A first scan of the dataset D is performed to determine the set I ′ of frequent items, and a pruned instance D′ of the original dataset where non-frequent items and empty transactions are removed (Step 1). Then, an Item List (IL) vector is created (Step 2), where each entry IL[ℓ] consists of three fields: IL[ℓ].item, IL[ℓ].count, and IL[ℓ].ptr, which store, respectively, a distinct item of I ′ , its support and a pointer. The entries are sorted by decreasing value of the support field, hence the most frequent items are positioned to the top of the IL. The IL is linked to D′ as follows. For each entry IL[ℓ], the pointer IL[ℓ].ptr points to a list that threads together all occurrences of IL[ℓ].item in D′ . We call such a list the threaded list for IL[ℓ].item with respect to D′ , and denote it by t-list(ℓ, D′ ). The initial IL for the sample Figure 4. Initial IL and t-lists for the sample dataset dataset and the t-lists built on a natural representation of the dataset, are shown in Figure 4. (The actual data structure used to represent D′ will be discussed in the next section.) Then, a depth-first exploration of the FIST is started visiting the children of each node by decreasing support order (i.e., left-to-right with respect to Figure 2). This exploration is performed by the while-loop in the pseudocode. A vector X and an integer h are used to store, respectively, the itemset associated with the last visited node of the FIST and its length (initially, X is empty and h = 0, meaning that the root has just been visited). Let us consider the beginning of a generic iteration of the while-loop and let v be the last visited node of the FIST, associated with itemset X = (a1 , a2 , . . . , ah ), where ah is the item labelling v, and, for j < h, aj is the item labelling the ancestor wj of v at distance h−j from it. For 1 ≤ j ≤ h, let ℓj be the IL index such that IL[ℓj ].item = aj , and note that ℓh < ℓh−1 < · · · < ℓ1 ; also denote by Xj the prefix (a1 , a2 , . . . , aj ) of X, which is the itemset associated with wj (clearly, X = Xh ). The following invariant holds at the beginning of the iteration. Let ℓ′ be an arbitrary index of the IL, and suppose that ℓj+1 < ℓ′ ≤ ℓj , for some 0 ≤ j ≤ h, setting for convenience ℓ0 = |IL| − 1 and ℓh+1 = −1. Then, ′ , IL[ℓ′ ].count stores the support of item IL[ℓ′ ].item in DX j ′ ), that threads together and IL[ℓ′ ].ptr points to t-list(ℓ′ , DX j ′ (we let X0 = ∅ and all occurrences of IL[ℓ′ ].item in DX j ′ ′ DX = D ). 0 During the current iteration and, possibly, a number of subsequent iterations, the node u which is either the first child of v, if any, or the first unvisited child of one of v’s ancestors is identified (Steps 3÷5). If no such node is found the algorithm terminates. It is easily seen that the item labelling u is the first item IL[ℓ].item found scanning the IL from the top, such that IL[ℓ].count ≥ min sup and ℓ 6= ℓj for every 1 ≤ j ≤ h. If node u is found, its corresponding itemset is generated (Steps 6÷8). (Note that if u is the Figure 5. IL and t-lists after visiting (L,4) child of an ancestor w of v, we have that before Step 6 is executed X[0 . . . h − 1] correctly stores the itemset associated with w.) Then, the first ℓ entries of the IL are updated so to enforce the invariant for the next iteration (for-loop of Step 9). Figure 5 shows the IL and t-lists for the sample dataset at the end of the while-loop iteration where node u=(L,4) is visited and itemset X = (L) is generated. Observe that while the entries for items G and H (respectively, IL[5] and IL[6]) are relative to the entire dataset, all other ′ entries are relative to DX . The correctness of the whole strategy is easily established by noting that the invariant stated before holds with h = 0 at the beginning of the while-loop, i.e., at the end of the visit of the root of the FIST. 4. Representing the dataset as a Patricia trie Crucial to the efficiency of the main strategy presented in the previous section is the choice of the data structure employed to represent the dataset D′ . Some previous works represented the dataset D′ through a standard trie, called FP-tree, built on the set of transactions, with items sorted by decreasing support [7, 14]. The advantage of using the trie is substantial for dense datasets because of the compression achieved by merging common prefixes, but in the worst case, when the dataset is highly sparse, the number of nodes may be close to the size N of the original dataset (i.e., the sum of all transaction lengths). Since each node of the trie stores an item, a count value, which indicates the number of transactions sharing the prefix found along the path from the node to the root, plus other information needed for navigating the trie (e.g., pointers to the children and/or to the father), the overall space taken by the trie may turn out to be αN , where α is a constant greater than 1. For these reasons, it has been suggested in [12, 9] that sparse datasets, for which the trie becomes space inefficient, be stored in a straightforward fashion as arrays of transactions. However, these works also encourage to switch to the trie representation during the course of execution, for por- Figure 6. Standard trie for the sample dataset Figure 7. Patricia trie for the sample dataset tions of the dataset which are estimated to be sufficiently dense. However, an effective heuristic to decide when to switch from one structure to another is hard to find and may be costly to implement. Moreover, even if a good heuristic was found, the overhead incurred in the data movement may reduce the advantages brought by the compression gained. To avoid the need for two alternative data structures to attain space efficiency, our algorithm resorts to a compressed trie, better known as Patricia trie [8]. The Patricia trie for a dataset D′ is a modification of the standard trie: namely, each maximal chain of nodes v1 → v2 → · · · → vk , where all vi ’s have the same count value c and (except for vk ) exactly one child, is coalesced into a single node that inherits count value c, vk ’s children, and stores the sequence of items previously stored in the vi ’s. (A Patricia trie representation of a transaction dataset has been recently adopted by [6] in an dynamic setting where the dataset evolves with time, and on-line queries on frequencies of individual itemsets are supported.) The standard and Patricia tries for the sample dataset are compared in Figure 6 and 7, respectively. As the figure shows, a Patricia trie may still retain some single-child nodes, however these nodes identify boundaries of transactions that are prefixes of other transactions. The following theorem provides an upper bound on the overall size of the Patricia trie. Theorem 1 A dataset D′ consisting of M transactions with aggregate size N can be represented through a Patricia trie of size at most N + O (M ). Proof. Consider the Patricia trie described before. The trie has less than 2M nodes since each node which has either zero or one child accounts for (one or more) distinct transactions, and, by standard properties of trees, all other nodes are at most one less than the number of leaves. The theorem follows by noting that the total number of items stored at the nodes is at most N .  It is important to remark that even for sparse datasets, which exhibit a moderate sharing of prefixes among transactions, the total number of items stored in the trie may turn out much less than N , and if the number of transactions is M ≪ N , as is often the case, the Patricia trie becomes very space efficient. To provide empirical evidence of this fact, Table 1 compares the space requirements of the representations based on arrays, standard trie, and Patricia trie, for the datasets introduced before, on some fixed support thresholds. For each dataset the table reports: the number of transactions, the average transaction size (AvTS), the chosen support threshold (in percentage), and the sizes in bytes of the various representations (data are relative to datasets pruned of non-frequent items). An item is assumed to fit in one word (4 bytes). For the array-based representation we considered an overhead of 1 word for each transaction, while for the standard and Patricia tries, we considered an overhead per node of 4 and 5 words, respectively, which are needed to store the count, the pointer to the father and other information used by our algorithm (the extra word in each Patricia trie node is used to store the number of items at the node). The data reported in the table show the substantial compression achieved by the Patricia trie with respect to the standard trie, especially in the case of sparse datasets. Also, the space required by the Patricia trie is comparable to, and often much less than that of the simple array-based representation. In the few cases where the former is larger, indicated in bold in the table, the difference between the two is rather small (and can be further reduced through a more compact representation of the Patricia trie nodes). Furthermore, it must be observed that in the execution of the algorithm additional space is required to store the threaded lists connected to the IL. Initially, this space is proportional to the overall number of items appearing in the dataset representation, which is smaller for the Patricia trie due to the sharing of prefixes among transactions. Construction of the Patricia trie Although the Patricia trie provides a space efficient data structure for representing D′ , its actual construction may be rather costly, thus influencing the overall performance of the algorithm especially if, as it will be discussed later, the dataset is projected a number of times during the course of the algorithm. A natural construction strategy starts from an initial empty trie and inserts one transaction at a time into it. To insert a transaction t, the current trie is traversed downwards along the path that corresponds to the prefix shared by t with previously inserted transactions, suitably updating the count at each node, until either t is entirely covered, or a point in t is reached where the shared prefix ends. In the latter case, the remaining suffix is stored into a new node added as a child of the last node visited. In order to efficiently search the correct child of a node v during the downward traversal of the trie, we employ a hash table whose buckets store pointers to the children of v based on the first items they contain. (A similar idea was employed by the Apriori algorithm [3] in the hash tree.) The number of buckets in the hash table is chosen as a function of the number of children of the node, in order to strike a good trade-off between the space taken by the table and the search time. Moreover, since during the mining of the itemsets the trie is only traversed upwards, the space occupied by the hash table can be freed after the trie is build. 5. Optimizations A number of optimizations have been introduced and tested in the implementation of the main strategy described in Section 3. In the following subsections, we will always make reference to a generic iteration of the while-loop of Figure 3 where a new frequent itemset X is generated in Step 8 after adding, in Step 6, item IL[ℓ].item. Also, we define as locally frequent items those items IL[j].item, with ′ j < ℓ, such that their support in DX is at least min sup. 5.1. Projection of the dataset After frequent itemset X has been generated, the discovery of all frequent supersets Y ⊃ X could proceed either on a physical projection of the dataset (i.e., a materialization of ′ DX ) and on a new IL, both restricted to the locally frequent ′ items, or on the original dataset D′ , with DX is identified by means of the updated t-lists in the IL (in this case, a new IL or the original one can be used). The first approach, which was followed in FP-Growth ′ [7], is effective if the new IL and DX shrink considerably. On the other hand, in the second approach, employed in Top-Down FP-Growth [14], no time and space overheads are incurred for building the projected datasets and maintaining in memory all of the projected datasets along a path of the FIST. Dataset Chess Connect-4 Mushroom Pumsb Pumsb* T10.I4.D100k.N1k.L2k T40.I10.D100k.N1k.L2k T30.I16.D400k.N1k.L2k POS WebView1 WebView2 Transactions 3,196 67,557 8,124 49,046 49,046 100,000 100,000 397,487 515,597 59,601 77,512 AvgTS 35.53 31.79 22.90 33.48 37.26 10.10 39.54 29.30 6.51 2.48 4.62 min sup % 20 60 1 60 20 0.002 0.25 0.5 0.01 0.054 0.004 Array 467,060 8,861,312 776,864 6,765,568 7,506,220 4,440,908 16,217,064 48,175,824 15,497,908 831,156 1,742,516 Trie 678,560 69,060 532,720 711,800 5,399,120 14,294,760 71,134,380 163,079,980 32,395,740 1,110,960 4,547,380 Patricia 250,992 55,212 380,004 349,180 2,177,044 5,129,212 16,935,176 41,023,616 13,993,508 618,292 1,998,316 Table 1. Space requirements of array-based, standard trie, and Patricia trie representations Ideally, one should implement a hybrid strategy allowing for physical projections only when they are beneficial. This was attempted in OpportuneProject [9] where physical projections are always performed when the dataset is represented as an array of transactions (and if sufficient memory is available), while they are inhibited when the dataset is represented through a trie, unless sufficient compression can be attained. However, in this latter case, no precise heuristic is provided to decide when physical projection must take place. In fact, the compression rate is rather hard to estimate without doing the actual projection, hence incurring high costs. In our implementation, we experimented several heuristics for limiting the number of projections. Although no heuristic was found superior to all others in every experiment, a rather simple heuristic exhibited very good performance in most cases: namely, to allow for physical projection only at the top s levels of the FIST and when the locally frequent items are at least k (in the experiments, s = 3 and k = 10 seemed to work fairly well). The rationale behind this heuristic is that the cost of projection is justified if the mining of the projected dataset goes on for long enough to take full advantage of the compression it achieves. Moreover, the heuristic limits the memory blowup by requiring at most s projected datasets to coexist in memory. Experimental results regarding the effectiveness of the heuristic, will be presented and discussed in Section 6.1 5.2. Immediate generation of subtrees of the FIST Suppose that at the end of the for-loop every locally frequent item IL[j].item, with j < ℓ, has support IL[j].count = ′ IL[ℓ].count = c in DX . Let Z denote the set of the locally frequent items. Then, for every Z ′ ⊆ Z we have that X ∪Z ′ is frequent with support c. Therefore, we can immediately generate all of these itemsets and set ℓ = ℓ + 1 rather than resetting ℓ = 0 after the for-loop.2 Viewed on the FIST, this is equivalent to generate all nodes in the subtree rooted at the node associated with X, without actually exploring such a subtree. A similar optimization was incorporated in previous implementations, but limited to the case when the ′ t-list(ℓ, DX ), pointed by IL[ℓ].ptr, consists of a single node. Our condition is more general and encompasses also cases ′ when t-list(ℓ, DX ) has more than one node. 5.3. Implementation of the for loop Another important issue concerns the implementation of the for-loop (Step 9), which contributes a large fraction of the overall running time. By the invariant previously stated, we have that, before entering the for-loop, IL[ℓ].ptr points ′ to head of t-list(ℓ, DX ), that is, it threads together all of the occurrences of IL[ℓ].item in nodes of the trie corresponding ′ . Moreover, the algorithm must ensure to transactions in DX ′ that the count of each such node is relative to DX and not to the entire dataset. Let TX denote the portion of the trie ′ whose leaves are threaded together by t-list(ℓ, DX ). ′ The for-loop determines t-list(j, DX ) for every 0 ≤ j < ℓ − 1, and updates IL[j].count to reflect the actual sup′ port of IL[j].item in DX . To do so, one could simply take ′ each occurrence of IL[ℓ].item threaded by t-list(ℓ, DX ) and walk up the trie suitably updating the count of each node encountered, and the count and t-list of each item stored at the node. This is essentially, the strategy implemented by Top Down FP-growth [14] and OpportuneProject (under trie representation) [9]. However, it has the drawback of traversing every node v ∈ TX multiple times, once for each leaf in v’s subtree. It is not difficult to show an example 2 This optimization is inspired by the concept of closed frequent itemset [11] in the sense that only X ∪ Z is closed and would be generated when mining this type of itemsets. where, with this approach, the number of node traversals is quadratic in the size of TX . In our implementation, we adopted an alternative strategy that, rather than traversing each individual leaf-root path in TX , performs a global traversal from the leaves to the root guided by the entries of the IL which are being updated. In this fashion, each node in TX is traversed only once. We refer to this strategy as the item-guided traversal. Specifically, the item-guided traversal starts by walk′ ing through the nodes threaded together in t-list(ℓ, DX ). For each such node v, the count and t-list of each item IL[j].item stored in v, with j < ℓ, are updated, and v ′ is inserted in t-list(j, DX ) marked as visited. Also, the count and t-list of the last item, say IL[j ′ ].item, stored in ′ ) v’s father u are updated and u is inserted in t-list(j ′ , DX ′ marked as unvisited. After all nodes in t-list(ℓ, DX ) have been dealt with, the largest index j < ℓ is found such that ′ t-list(j, DX ) contains some unvisited nodes (which can be conveniently positioned at the front of the list). Then, the item-guided traversal is iterated walking through the unvis′ ited nodes in t-list(j, DX ). It terminates when no threaded list is found that contains unvisited nodes (i.e., the top of the IL is reached). The following theorem is easily proved. Theorem 2 The item-guided traversal correctly visits all nodes in TX . Moreover, each such node with k direct children is touched k times and fully traversed exactly once. 6. Experimental results This section presents the results of several experiments we performed on the datasets described in Section 2.1. Specifically, in Subsection 6.1 we assess the effectiveness of our implementation, while in Subsections 6.2 and 6.3 we compare the performance of PatriciaMine with that of other prominent algorithms. The experiments reported in the first two subsections have been conducted on an IBM RS/6000 SP multiprocessor, using a single 375Mhz POWER3-II processor, with 4GB main memory, and two 9.1 GB SCSI disks under the AIX 4.3.3 operating system. On this platform, running times as well as other relevant quantities (e.g., cache and TLB hits/misses) have been measured with hardware counters, accessed through the HPM performance monitor by [5]. Instead, since for OpportuneProject only the object code for a Windows platform was made available to us by the authors, the experiments in Subsection 6.3 have been performed on a 1.7Ghz Pentium IV PC, with 256MB RAM, and 100GB hard disk, under Windows 2000 Pro. 6.1. Effectiveness of the heuristic for conditional projection A first set of experiments was run to verify whether allowing for physical projections of the dataset improves performance and if the heuristic we implemented to decide when to physically project the dataset is effective. The results of the experiments are reported in Figures 8 and 9 (running times do not include the output of the frequent itemsets). For each dataset, we compared the performance of PatriciaMine using the heuristic (line “WithProjection”) with the performance of a version of PatriciaMine where physical projection is inhibited (line “WithoutProjection”), on four different values of support, indicated in percentage. It is seen that the heuristic yields performance improvements, often very substantial, at low support values (e.g., see Connect4, Pumsb*, WebView1/2, T30.I16.D400k.N1k.L2k, and T40.I10.D100k.N1k.L2k) while it has often no effect or incurs a slight slowdown at higher supports. This can be explained by the fact that at high supports the FIST is shallow and the projection overhead cannot be easily hidden by the subsequent computation. Note that the case of Pos is anomalous. For this dataset the heuristic, and in fact all of the heuristics we tested, slowed down the execution, hence suggesting that physical projection is never beneficial. This case, however, will be further investigated. We also tested the speed-up achieved by immediately generating all supersets of a certain frequent itemset X when the locally frequent items have the same support as X. In particular, we observed that the novelty introduced in our implementation, that is considering also those cases ′ ) consists of more than when the threading list t-list(ℓ, DX one node, yielded a noticeable performance improvement (e.g., a factor 1.4 speed-up was achieved on WebView1 with support 0.054%, and a factor 1.6 speed-up was achieved on WebView2 with support 0.004%). We finally compared the effectiveness of the implementation of the for-loop of Figure 3 based on the novel itemguided traversal, with respect to the straightforward one. Although the item-guided traversal is provably superior in an asymptotic worst-case sense (e.g., see Theorem 2 and the discussion in Section 5.3) , the experiments provided mixed results. For all dense datasets and for Pos, the item-guided traversal turned out faster than the straightforward one up to a factor 1.5 (e.g., for Mushroom with support 5%), while for sparse datasets it resulted actually slower by a factor at most 1.2. This can be partly explained by noting that if the tree to be traversed is skinny (as is probably the case for the sparse datasets, except for Pos) the item-guided traversal cannot provide a substantial improvement while it suffers a slight overhead for the scan of the IL. Moreover, for some sparse datasets, we observed that while the item-guided traversal performs a smaller number of instructions, it exhibits less locality (e.g., it incurs higher TLB misses) which causes the higher running time. We conjecture that a refined implementation could make the item-guided traversal competitive even for sparse datasets. Chess 2 Mushroom 2 10 WebView1 3 10 2 10 WebView2 10 WithProjection WithoutProjection WithProjection WithoutProjection WithProjection WithoutProjection WithProjection WithoutProjection 2 10 1 1 1 0 Time (s) Time (s) 10 Time (s) 10 Time (s) 10 1 10 0 0 10 10 10 0 10 −1 10 60% −1 −1 2 30% 20% 5% 2% 1% Support Support Pumsb Pumsb* 4 10 0.25% 0.067% 0.059% 10 0.065% 0.054% 0.052% Support 3 10 0.004% T40I100D100k 10 WithProjection WithoutProjection WithProjection WithoutProjection 0.045% Support T30I16D400k 4 10 WithProjection WithoutProjection −1 10 0.075% 10 40% WithProjection WithoutProjection 3 3 10 10 2 2 10 Time (s) 1 10 Time (s) Time (s) Time (s) 10 2 10 1 10 1 1 10 0 10 90% 10 0 0 80% 60% 10 50% 50% 40% Support 2 30% 20% 10 10% Support Connect−4 1% 2.5% 0.5% 0.25% Support 2 T10I4D100k 10 10 WithProjection WithoutProjection WithProjection WithoutProjection WithProjection WithoutProjection 10 5% 0.5% Support Pos 3 10 0 5% 2 Time (s) Time (s) Time (s) 10 1 10 1 10 1 10 0 10 70% 0 60% 50% Support 40% 10 0.5% 0 0.1% 0.05% 10 0.050% 0.01% Support Figure 8. Comparison between PatriciaMine with and without projection on Chess, Mushroom, Pumsb, Pumsb*, Connect-4, Pos 6.2. Comparison with other algorithms In this subsection, we compare PatriciaMine with other prominent algorithms whose source code was made available to us: namely FP-Growth [7], which has been mentioned before, DCI [10], and Eclat [15]. DCI (Direct Count & Intersect) performs a breadth-first exploration of the FIST, generating a set of candidate itemsets for each level, computing their support, and then determining the frequent ones. It employs two alternative representations for the dataset, a horizontal and a vertical one, and, respectively, a count-based and intersection-based method to compute the supports, switching adaptively from one to the other based on the characteristics of the dataset. Eclat, instead is based on a depth-first exploration strategy (like FP-Growth and PatriciaMine). It employs a vertical representation of the dataset which stores with each item the list of transaction IDs (TID-list) where it occurs, and determines an itemset’s support through TID-lists intersections. The counting mechanism was successively improved in dEclat [16] by using diffsets, that is, differences between 0.010% 0.005% 0.002% Support Figure 9. Comparison between PatriciaMine with and without projection on WebView1, WebView2, and some artificial datasets TID-lists, in order to avoid managing very long TID-lists. For FP-Growth and Eclat, we used the source code developed by Goethals3 , while for DCI we obtained the source code directly from the authors. The implementation of Eclat we employed includes the use of diffsets. The experimental results are reported in Figures 10 and 11. For each dataset, a graph shows the running times achieved by the algorithms on four support values, indicated in percentages. (Here we included the output time since for DCI the writing on file of frequent itemsets is functional to the algorithm’s operation.) It is easily seen that the performance of PatriciaMine is significantly superior to that of Eclat and FP-Growth on all datasets and supports. We also observed that Eclat features higher locality than FP-Growth, exhibiting in some cases a better running time, though performing a larger number of instructions. Compared to DCI, PatriciaMine is consistently and often substantially faster at low values of support, while at higher supports, where execution time is in the order of a few sec3 Available at http://www.cs.helsinki.fi/u/goethals WebView1 2 3 10 Patricia DCI Eclat FP−growth 1 2 10 Time (s) 10 0 1 10 10 −1 10 0.085% Chess 3 10 Patricia DCI Eclat FP−growth Patricia DCI Eclat FP−growth 0 0.075% 0.067% 10 0.065% 0.060% 4 4 Patricia DCI Eclat FP−growth 3 3 Time (s) Time (s) 1 10 1 10 10 Time (s) 10 2 10 0.045% T40I100D100k 10 Patricia DCI Eclat FP−growth 10 0.052% Support T10I4D100k 10 2 Time (s) 0.058% Support Mushroom 3 10 WebView2 10 Patricia DCI Eclat FP−growth Time (s) onds, the two algorithms exhibit similar performance and sometimes PatriciaMine is slightly slower, probably due to the trie construction overhead. However, it must be remarked that small differences between DCI and Patricia at low execution times could also be due to the different format required of the initial dataset, and different input/output functions employed by the two algorithms. 2 10 2 10 0 10 1 1 10 0 10 60% 10 50% 4 40% 35% 15% 10% 5% Support Pumsb Pumsb* 2 2% 0 10 0.05% Support 10 0 0.01% 0.005% Support 0.002% 10 5% 2.5% 0.5% 0.25% Support 10 Patricia DCI Eclat FP−growth Patricia DCI Eclat FP−growth 3 Time (s) 10 Time (s) 10 −1 2 10 Figure 11. Comparison of PatriciaMine, DCI, Eclat and FP-Growth on WebView1, WebView2, and some artificial datasets 1 10 1 10 0 10 90% 0 80% 70% 10 70% 60% 60% Support 4 50% 40% 0.1% 0.05% Support Connect−4 Pos 3 10 10 Patricia DCI Eclat FP−growth Patricia DCI Eclat FP−growth 3 10 2 Time (s) Time (s) 10 2 10 1 10 1 10 0 10 90% 0 80% 70% Support 60% 10 1% 0.5% Support Figure 10. Comparison of PatriciaMine, DCI, Eclat and FP-Growth on Chess, Mushroom, Pumsb, Pumsb*, Connect-4, Pos 6.3. Comparison with OpportuneProject Particularly relevant for our work is the comparison between PatriciaMine and OpportuneProject [9], which, to the best of our knowledge, represents the latest and most advanced algorithm in the family stemmed from FP-Growth. For lack of space, we postpone a detailed and critical discussion of the strengths and weaknesses of the two algorithms to the full version of the paper. Figures 12 and 13, report the performances exhibited by PatriciaMine and OpportuneProject on the Pentium/Windows platform for a number of datasets and supports. It can be seen that, the performance of Patrici- aMine is consistently superior, up to one order of magnitude (e.g., in Pumsb*). The only exception are Pos (see graph labelled “Pos with projection”) and the artificial dataset T30.I16.D400k.N1k.L2k. For Pos, we have already observed that our heuristic for limiting the number of physical projections does not improve the running time. In fact, it is interesting to note that by inhibiting projections, PatriciaMine becomes faster than OpportuneProject (see graph labelled “Pos without projection”). This suggests that a better heuristic could eliminate this anomalous case. As for T30.I16.D400k.N1k.L2k, some measurements we performed revealed that the time taken by the initialization of the Patricia trie accounts for a significant fraction of the running time at high support thresholds, and such an initial overhead cannot be hidden by the subsequent mining activity. However, at lower support thresholds, where the computation of the frequent itemsets dominates over the trie construction, PatriciaMine becomes faster than OpportuneProject. Finally we report that on WebView1 for absolute support 32 (about 0.054%), OpportuneProject ran out of memory while PatriciaMine successfully completed the execution. References [1] R. Agrawal, C. Aggarwal, and V. Prasad. A tree projection algorithm for generation of frequent itemsets. Journal of Parallel and Distributed Computing, 61(3):350–371, 2001. [2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. Chess 2 Mushroom 2 10 2 10 Pos with projection Patricia Opportune Project 1 0 Time (s) Time (s) Time (s) 10 Time (s) 10 1 10 1 10 0 10 10 −1 0 −1 10 60% 2 10 Patricia Opportune Project Patricia Opportune Project 1 Pos without projection 10 Patricia Opportune Project 10 0.5% 10 40% 2 30% 20% 5% 2% 1% 0.25% 0 0.1% 0.05% 10 0.5% 0.01% 0.1% 0.05% Support Support Support Support Pumsb Pumsb* WebView1 WebView2 4 10 2 10 Patricia Opportune Project 2 10 Patricia Opportune Project 0.01% 10 Patricia Opportune Project Patricia Opportune Project 3 10 1 1 2 10 Time (s) 1 10 10 Time (s) Time (s) Time (s) 10 0 0 10 10 1 10 0 10 90% 2 −1 0 80% 60% 10 50% 50% 40% 30% Support Support Connect−4 T30I16D400k 3 10 −1 0.067% 0.059% Support 0.055% 10 0.065% 0.052% 0.045% 0.004% Support 10 Patricia Opportune Project Patricia Opportune Project Figure 13. Comparison of PatriciaMine and OpportuneProject on Pos, WebView1, WebView2 2 Time (s) 10 Time (s) 10 0.075% 20% 1 10 1 10 0 10 70% 0 60% 50% Support 40% 10 10% 5% 1% 0.5% Support [10] Figure 12. Comparison of PatriciaMine and OpportuneProject on Chess, Mushroom, Pumsb, Pumsb*, Connect-4, and T30I16D400k [3] [4] [5] [6] [7] [8] [9] of the ACM SIGMOD Intl. Conference on Management of Data, pages 207–216, 1993. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th Very Large Data Base Conference, pages 487–499, 1994. R. Bayardo. Efficiently mining long patterns from databases. In Proc. ot the ACM SIGMOD Intl. Conference on Management of Data, pages 85–93, 1998. L. DeRose. Hardware Performance Monitor (HPM) toolkit. version 2.3.1. Technical report, Advanced Computer Technology Center, Nov. 2001. A. Hafez, J. Deogun, and V. Raghavan. The item-set tree: A data structure for data mining. In Proc. of the 1st Int. Conference on Data Warehousing and Knowledge Discovery, LNCS 1676, pages 183–192, 1999. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. of ACM SIGMOD Intl. Conference on Management of Data, pages 1–12, 2000. D. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison Wesley, Reading, MA, 1973. J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by opportunistic projection. In Proc. of the 8th [11] [12] [13] [14] [15] [16] ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, pages 229–238, July 2002. S. Orlando, P. Palmerini, R. Perego, and F. Silvestri. Adaptive resource-aware mining of frequent sets. In Proc. of the IEEE Intl. Conference on Data Mining, pages 338–345, Dec. 2002. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proc. of the 7th Int. Conference on Database Theory, pages 398– 416, Jan. 1999. J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. Hmine: Hyper-structure mining of frequent patterns in large databases. In Proc. of IEEE Intl. Conference on Data Mining, pages 441–448, 2001. A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. of the 21st Very Large Data Base Conference, pages 432–444, Sept. 1995. K. Wang, L. Tang, J. Han, and J. Liu. Top down FP-Growth for association rule mining. In Proc. of the 6th Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining, LNCS 2336, pages 334–340, May 2002. M. Zaki. Scalable algorithms for association mining. IEEE Trans. on Knowledge and Data Engineering, 12(3):372– 390, May-June 2000. M. Zaki and K. Gouda. Fast vertical mining using diffsets. In Proc. of the 9th ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, Aug. 2003.