Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

A Model-Based Frequency Constraint for Mining Associations from Transaction Data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single, user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations.

In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.
Figure 7.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. Although the artificial data set in this paper and in Zheng et al. (2001) were produced using the same generator (available at http://www.almaden.ibm.com/software/quest/Resources/), there are minimal variations due to differences in the used random number generator initialization.

  2. We used a machine with two Intel Xeon processors (2.4 GHz) running Linux (Debian Sarge). The algorithm was implemented in JAVA and compiled using the gnu ahead-of-time compiler gcj version 3.3.5. CPU time was recorded using the time command and we report the sum of user and system time.

  3. Available at http://fuzzy.cs.uni-magdeburg.de/~borgelt/software.html

References

  • Agarwal, R.C., Aggarwal, C.C., and Prasad, V.V.V. 2000. Depth first generation of long patterns. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 108–118.

  • Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Washington, DC, pp. 207–216.

  • Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Data Bases, J. B. Bocca, M. Jarke, C. Zaniolo (Eds.) Santiago, Chile, pp. 487–499.

  • Borgelt, C. 2003. Efficient implementations of apriori and eclat. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, B. Goethals, M.J. Zaki (Eds.) Melbourne, FL, USA.

  • Brin, S., Motwani, R., Ullman, J.D., and Tsur, S. 1997. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, pp. 255–264.

  • Creighton, C. and Hanash, S. 2003. Mining gene expression databases for association rules. Bioinformatics, 19(1):79–86.

    Google Scholar 

  • Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological), 39:1–38.

    Google Scholar 

  • DuMouchel, W. and Pregibon, D. 2001. Empirical bayes screening for multiitem associations. In Proceedings of the 7th ACM SIGKDD Intentional Conference on Knowledge Discovery in Databases and Data Mining, F. Provost, R. Srikant (Eds.) ACM Press, pp. 67–76.

  • Geyer-Schulz, A., Hahsler, M., and Jahn, M. 2002. A customer purchase incidence model applied to recommender systems. In WEBKDD 2001—Mining Log Data Across All Customer Touch Points, Third International Workshop, San Francisco, CA, USA, August 26, 2001, Revised Papers, Lecture Notes in Computer Science LNAI 2356, R. Kohavi, B. Masand, M. Spiliopoulou, J. Srivastava (Eds.) Springer-Verlag, pp. 25–47.

  • Geyer-Schulz, A., Hahsler, M., Neumann, A., and Thede, A. 2003. Behaviorbased recommender systems as value-added services for scienti.c libraries. In Statistical Data Mining & Knowledge Discovery, H. Bozdogan, (Ed.) Chapman & Hall/CRC, pp. 433–454.

  • Han, J., Pei, J., Yin, Y., and Mao, R. 2004. Mining frequent patterns without candidate generation. Data Mining and Knowledge Discovery, 8:53–87.

    Google Scholar 

  • Johnson, N.L., Kotz, S., and Kemp, A.W. 1993. Univariate Discrete Distributions, 2nd edn. New York: John Wiley & Sons.

  • Kohavi, R., Brodley, C., Frasca, B., Mason, L., and Zheng, Z. 2000. KDD-Cup 2000 organizers’ report: Peeling the onion. SIGKDD Explorations, 2(2):86–98.

  • Kohavi, R. and Provost, F. 1988. Glossary of terms. Machine Learning, 30(2–3):271–274.

    Google Scholar 

  • Liu, B., Hsu, W., and Ma, Y. 1999. Mining association rules with multiple minimum supports. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 337–341.

  • Luo, J. and Bridges, S. 2000. Mining fuzzy association rules and fuzzy frequency episodes for intrusion detection. International Journal of Intelligent Systems, 15(8):687–703.

    Google Scholar 

  • Mannila, H., Toivonen, H., and Verkamo, A.I. 1994. Efficient algorithms for discovering association rules. In AAAI Workshop on Knowledge Discovery in Databases, U.M. Fayyad, R. Uthurusamy (Eds.) Seattle, Washington: AAAI Press, pp. 181–192,

  • Omiecinski, E.R. 2003. Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering, 15(1):57–69.

    Google Scholar 

  • Pei, J., Han, J., and Lakshmanan, L.V. 2001. Mining frequent itemsets with convertible constraints. In Proceedings of the 17th International Conference on Data Engineering, April 02–06, 2001, Heidelberg, Germany, pp. 433–442.

  • Provost, F. and Fawcett, T. 1997. Analysis and visualization of classi.er performance: Comparison under imprecise class and cost distributions. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Heckerman, D., Mannila, H., and Pregibon, D., editors, Newport Beach, CA: AAAI Press, pp. 43–48.

  • Seno, M. and Karypis, G. 2001. Lpminer: An algorithm for finding frequent itemsets using length decreasing support constraint. In Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November–2 December 2001, N. Cercone, T.Y. Lin, X. Wu (Eds.) San Jose, California, USA: IEEE Computer Society, pp. 505–512.

  • Silverstein, C., Brin, S., and Motwani, R. 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2:39–68.

    Google Scholar 

  • Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N. 2000. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2):12–23.

    Google Scholar 

  • Xiong, H., Tan, P.-N., and Kumar, V. 2003. Mining strong affinity association patterns in data sets with skewed support distribution. In Proceedings of the IEEE International Conference on Data Mining, November 19–22, 2003, B. Goethals, M.J. Zaki (Eds.) Melbourne, Florida, pp. 387–394.

  • Zheng, Z., Kohavi, R., and Mason, L. 2001. Real world performance of association rule algorithms. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining, F. Provost, R. Srikant (Eds.) ACM Press, pp. 401–406.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Hahsler.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hahsler, M. A Model-Based Frequency Constraint for Mining Associations from Transaction Data. Data Min Knowl Disc 13, 137–166 (2006). https://doi.org/10.1007/s10618-005-0026-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-005-0026-2

Keywords