Abstract
Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single, user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations.
In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Although the artificial data set in this paper and in Zheng et al. (2001) were produced using the same generator (available at http://www.almaden.ibm.com/software/quest/Resources/), there are minimal variations due to differences in the used random number generator initialization.
We used a machine with two Intel Xeon processors (2.4 GHz) running Linux (Debian Sarge). The algorithm was implemented in JAVA and compiled using the gnu ahead-of-time compiler gcj version 3.3.5. CPU time was recorded using the time command and we report the sum of user and system time.
Available at http://fuzzy.cs.uni-magdeburg.de/~borgelt/software.html
References
Agarwal, R.C., Aggarwal, C.C., and Prasad, V.V.V. 2000. Depth first generation of long patterns. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 108–118.
Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, DC, pp. 207–216.
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, J. B. Bocca, M. Jarke, C. Zaniolo (Eds.) Santiago, Chile, pp. 487–499.
Borgelt, C. 2003. Efficient implementations of apriori and eclat. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, B. Goethals, M.J. Zaki (Eds.) Melbourne, FL, USA.
Brin, S., Motwani, R., Ullman, J.D., and Tsur, S. 1997. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, pp. 255–264.
Creighton, C. and Hanash, S. 2003. Mining gene expression databases for association rules. Bioinformatics, 19(1):79–86.
Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 39:1–38.
DuMouchel, W. and Pregibon, D. 2001. Empirical bayes screening for multiitem associations. In Proceedings of the 7th ACM SIGKDD Intentional Conference on Knowledge Discovery in Databases and Data Mining, F. Provost, R. Srikant (Eds.) ACM Press, pp. 67–76.
Geyer-Schulz, A., Hahsler, M., and Jahn, M. 2002. A customer purchase incidence model applied to recommender systems. In WEBKDD 2001—Mining Log Data Across All Customer Touch Points, Third International Workshop, San Francisco, CA, USA, August 26, 2001, Revised Papers, Lecture Notes in Computer Science LNAI 2356, R. Kohavi, B. Masand, M. Spiliopoulou, J. Srivastava (Eds.) Springer-Verlag, pp. 25–47.
Geyer-Schulz, A., Hahsler, M., Neumann, A., and Thede, A. 2003. Behaviorbased recommender systems as value-added services for scienti.c libraries. In Statistical Data Mining & Knowledge Discovery, H. Bozdogan, (Ed.) Chapman & Hall/CRC, pp. 433–454.
Han, J., Pei, J., Yin, Y., and Mao, R. 2004. Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery, 8:53–87.
Johnson, N.L., Kotz, S., and Kemp, A.W. 1993. Univariate Discrete Distributions, 2nd edn. New York: John Wiley & Sons.
Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Z. 2000. KDD-Cup 2000 organizers’ report: Peeling the onion. SIGKDD Explorations, 2(2):86–98.
Kohavi, R. and Provost, F. 1988. Glossary of terms. Machine Learning, 30(2–3):271–274.
Liu, B., Hsu, W., and Ma, Y. 1999. Mining association rules with multiple minimum supports. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 337–341.
Luo, J. and Bridges, S. 2000. Mining fuzzy association rules and fuzzy frequency episodes for intrusion detection. International Journal of Intelligent Systems, 15(8):687–703.
Mannila, H., Toivonen, H., and Verkamo, A.I. 1994. Efficient algorithms for discovering association rules. In AAAI Workshop on Knowledge Discovery in Databases, U.M. Fayyad, R. Uthurusamy (Eds.) Seattle, Washington: AAAI Press, pp. 181–192,
Omiecinski, E.R. 2003. Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering, 15(1):57–69.
Pei, J., Han, J., and Lakshmanan, L.V. 2001. Mining frequent itemsets with convertible constraints. In Proceedings of the 17th International Conference on Data Engineering, April 02–06, 2001, Heidelberg, Germany, pp. 433–442.
Provost, F. and Fawcett, T. 1997. Analysis and visualization of classi.er performance: Comparison under imprecise class and cost distributions. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Heckerman, D., Mannila, H., and Pregibon, D., editors, Newport Beach, CA: AAAI Press, pp. 43–48.
Seno, M. and Karypis, G. 2001. Lpminer: An algorithm for finding frequent itemsets using length decreasing support constraint. In Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November–2 December 2001, N. Cercone, T.Y. Lin, X. Wu (Eds.) San Jose, California, USA: IEEE Computer Society, pp. 505–512.
Silverstein, C., Brin, S., and Motwani, R. 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2:39–68.
Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N. 2000. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2):12–23.
Xiong, H., Tan, P.-N., and Kumar, V. 2003. Mining strong affinity association patterns in data sets with skewed support distribution. In Proceedings of the IEEE International Conference on Data Mining, November 19–22, 2003, B. Goethals, M.J. Zaki (Eds.) Melbourne, Florida, pp. 387–394.
Zheng, Z., Kohavi, R., and Mason, L. 2001. Real world performance of association rule algorithms. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining, F. Provost, R. Srikant (Eds.) ACM Press, pp. 401–406.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hahsler, M. A Model-Based Frequency Constraint for Mining Associations from Transaction Data. Data Min Knowl Disc 13, 137–166 (2006). https://doi.org/10.1007/s10618-005-0026-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-005-0026-2