A Model-Based Frequency Constraint for Mining Associations from Transaction Data

Hahsler, Michael

doi:10.1007/s10618-005-0026-2

A Model-Based Frequency Constraint for Mining Associations from Transaction Data

Published: 12 May 2006

Volume 13, pages 137–166, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Michael Hahsler¹

485 Accesses
Explore all metrics

Abstract

Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single, user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations.

In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

Notes

Although the artificial data set in this paper and in Zheng et al. (2001) were produced using the same generator (available at http://www.almaden.ibm.com/software/quest/Resources/), there are minimal variations due to differences in the used random number generator initialization.
We used a machine with two Intel Xeon processors (2.4 GHz) running Linux (Debian Sarge). The algorithm was implemented in JAVA and compiled using the gnu ahead-of-time compiler gcj version 3.3.5. CPU time was recorded using the time command and we report the sum of user and system time.
Available at http://fuzzy.cs.uni-magdeburg.de/~borgelt/software.html

References

Agarwal, R.C., Aggarwal, C.C., and Prasad, V.V.V. 2000. Depth first generation of long patterns. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 108–118.
Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, DC, pp. 207–216.
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, J. B. Bocca, M. Jarke, C. Zaniolo (Eds.) Santiago, Chile, pp. 487–499.
Borgelt, C. 2003. Efficient implementations of apriori and eclat. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, B. Goethals, M.J. Zaki (Eds.) Melbourne, FL, USA.
Brin, S., Motwani, R., Ullman, J.D., and Tsur, S. 1997. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, pp. 255–264.
Creighton, C. and Hanash, S. 2003. Mining gene expression databases for association rules. Bioinformatics, 19(1):79–86.
Google Scholar
Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 39:1–38.
Google Scholar
DuMouchel, W. and Pregibon, D. 2001. Empirical bayes screening for multiitem associations. In Proceedings of the 7th ACM SIGKDD Intentional Conference on Knowledge Discovery in Databases and Data Mining, F. Provost, R. Srikant (Eds.) ACM Press, pp. 67–76.
Geyer-Schulz, A., Hahsler, M., and Jahn, M. 2002. A customer purchase incidence model applied to recommender systems. In WEBKDD 2001—Mining Log Data Across All Customer Touch Points, Third International Workshop, San Francisco, CA, USA, August 26, 2001, Revised Papers, Lecture Notes in Computer Science LNAI 2356, R. Kohavi, B. Masand, M. Spiliopoulou, J. Srivastava (Eds.) Springer-Verlag, pp. 25–47.
Geyer-Schulz, A., Hahsler, M., Neumann, A., and Thede, A. 2003. Behaviorbased recommender systems as value-added services for scienti.c libraries. In Statistical Data Mining & Knowledge Discovery, H. Bozdogan, (Ed.) Chapman & Hall/CRC, pp. 433–454.
Han, J., Pei, J., Yin, Y., and Mao, R. 2004. Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery, 8:53–87.
Google Scholar
Johnson, N.L., Kotz, S., and Kemp, A.W. 1993. Univariate Discrete Distributions, 2nd edn. New York: John Wiley & Sons.
Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Z. 2000. KDD-Cup 2000 organizers’ report: Peeling the onion. SIGKDD Explorations, 2(2):86–98.
Kohavi, R. and Provost, F. 1988. Glossary of terms. Machine Learning, 30(2–3):271–274.
Google Scholar
Liu, B., Hsu, W., and Ma, Y. 1999. Mining association rules with multiple minimum supports. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 337–341.
Luo, J. and Bridges, S. 2000. Mining fuzzy association rules and fuzzy frequency episodes for intrusion detection. International Journal of Intelligent Systems, 15(8):687–703.
Google Scholar
Mannila, H., Toivonen, H., and Verkamo, A.I. 1994. Efficient algorithms for discovering association rules. In AAAI Workshop on Knowledge Discovery in Databases, U.M. Fayyad, R. Uthurusamy (Eds.) Seattle, Washington: AAAI Press, pp. 181–192,
Omiecinski, E.R. 2003. Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering, 15(1):57–69.
Google Scholar
Pei, J., Han, J., and Lakshmanan, L.V. 2001. Mining frequent itemsets with convertible constraints. In Proceedings of the 17th International Conference on Data Engineering, April 02–06, 2001, Heidelberg, Germany, pp. 433–442.
Provost, F. and Fawcett, T. 1997. Analysis and visualization of classi.er performance: Comparison under imprecise class and cost distributions. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Heckerman, D., Mannila, H., and Pregibon, D., editors, Newport Beach, CA: AAAI Press, pp. 43–48.
Seno, M. and Karypis, G. 2001. Lpminer: An algorithm for finding frequent itemsets using length decreasing support constraint. In Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November–2 December 2001, N. Cercone, T.Y. Lin, X. Wu (Eds.) San Jose, California, USA: IEEE Computer Society, pp. 505–512.
Silverstein, C., Brin, S., and Motwani, R. 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2:39–68.
Google Scholar
Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N. 2000. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2):12–23.
Google Scholar
Xiong, H., Tan, P.-N., and Kumar, V. 2003. Mining strong affinity association patterns in data sets with skewed support distribution. In Proceedings of the IEEE International Conference on Data Mining, November 19–22, 2003, B. Goethals, M.J. Zaki (Eds.) Melbourne, Florida, pp. 387–394.
Zheng, Z., Kohavi, R., and Mason, L. 2001. Real world performance of association rule algorithms. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining, F. Provost, R. Srikant (Eds.) ACM Press, pp. 401–406.

Download references

Author information

Authors and Affiliations

Vienna University of Economics and Business Administration, Vienna, Austria
Michael Hahsler

Authors

Michael Hahsler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Hahsler.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hahsler, M. A Model-Based Frequency Constraint for Mining Associations from Transaction Data. Data Min Knowl Disc 13, 137–166 (2006). https://doi.org/10.1007/s10618-005-0026-2

Download citation

Received: 19 April 2005
Accepted: 20 October 2005
Published: 12 May 2006
Issue Date: September 2006
DOI: https://doi.org/10.1007/s10618-005-0026-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Model-Based Frequency Constraint for Mining Associations from Transaction Data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Comparative Analysis of Algorithms for Mining Frequent Itemsets

Frequent Itemset

CL-MAX: a clustering-based approximation algorithm for mining maximal frequent itemsets

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A Model-Based Frequency Constraint for Mining Associations from Transaction Data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Comparative Analysis of Algorithms for Mining Frequent Itemsets

Frequent Itemset

CL-MAX: a clustering-based approximation algorithm for mining maximal frequent itemsets

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation