Abstract
Association rules are among the most important concepts in data mining. Rules of the form \(X \rightarrow Y\) are simple to understand, simple to act upon, yet can model important local dependencies in data. The problem is, however, that there are so many of them. Both traditional and state-of-the-art frameworks typically yield millions of rules, rather than identifying a small set of rules that capture the most important dependencies of the data. In this paper, we define the problem of association rule mining in terms of the Minimum Description Length principle. That is, we identify the best set of rules as the one that most succinctly describes the data. We show that the resulting optimization problem does not lend itself for exact search, and hence propose Grab, a greedy heuristic to efficiently discover good sets of noise-resistant rules directly from data. Through extensive experiments we show that, unlike the state-of-the-art, Grab does reliably recover the ground truth. On real world data we show it finds reasonable numbers of rules, that upon close inspection give clear insight in the local distribution of the data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
No relation to the first author.
References
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: VLDB, pp. 487–499 (1994)
Bayardo, R.: Efficiently mining long patterns from databases. In: SIGMOD, pp. 85–93 (1998)
Calders, T., Goethals, B.: Non-derivable itemset mining. Data Min. Knowl. Disc. 14(1), 171–206 (2007). https://doi.org/10.1007/s10618-006-0054-6
De Bie, T.: Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min. Knowl. Disc. 23(3), 407–446 (2011). https://doi.org/10.1007/s10618-010-0209-3
Fowkes, J., Sutton, C.: A subsequence interleaving model for sequential pattern mining. In: KDD (2016)
Grünwald, P.: The Minimum Description Length Principle. MIT Press, Cambridge (2007)
Hämäläinen, W.: Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowl. Inf. Syst. 32(2), 383–414 (2012). https://doi.org/10.1007/s10115-011-0432-2
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: SIGMOD, pp. 1–12. ACM (2000)
Jaroszewicz, S., Simovici, D.A.: Interestingness of frequent itemsets using Bayesian networks as background knowledge. In: KDD, pp. 178–186. ACM (2004)
Kontkanen, P., Myllymäki, P.: MDL histogram density estimation. In: AISTATS (2007)
Li, M., Vitányi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Springer, New York (1993). https://doi.org/10.1007/978-1-4757-3860-5
Lucchese, C., Orlando, S., Perego, R.: Mining top-k patterns from binary datasets in presence of noise. In: SDM, pp. 165–176 (2010)
Mampaey, M., Vreeken, J., Tatti, N.: Summarizing data succinctly with the most informative itemsets. ACM TKDD 6, 1–44 (2012)
Mannila, H., Toivonen, H., Verkamo, A.I.: Efficient algorithms for discovering association rules. In: KDD, pp. 181–192 (1994)
Miettinen, P., Vreeken, J.: MDL4BMF: minimum description length for Boolean matrix factorization. ACM TKDD 8(4), A18:1–31 (2014)
Mitchell-Jones, T.: Societas Europaea Mammalogica (1999). http://www.european-mammals.org
Moerchen, F., Thies, M., Ultsch, A.: Efficient mining of all margin-closed itemsets with applications in temporal knowledge discovery and classification by compression. Knowl. Inf. Syst. 29(1), 55–80 (2011). https://doi.org/10.1007/s10115-010-0329-5
Myllykangas, S., Himberg, J., Böhling, T., Nagy, B., Hollmén, J., Knuutila, S.: DNA copy number amplification profiling of human neoplasms. Oncogene 25(55), 7324–7332 (2006)
Papaxanthos, L., Llinares-López, F., Bodenham, D.A., Borgwardt, K.M.: Finding significant combinations of features in the presence of categorical covariates. In: NIPS, pp. 2271–2279 (2016)
Pearl, J.: Causality: Models, Reasoning and Inference, 2nd edn. Cambridge University Press, Cambridge (2009)
Pellegrina, L., Vandin, F.: Efficient mining of the most significant patterns with permutation testing. In: KDD, pp. 2070–2079 (2018)
Rissanen, J.: Modeling by shortest data description. Automatica 14(1), 465–471 (1978)
Rissanen, J.: A universal prior for integers and estimation by minimum description length. Ann. Stat. 11(2), 416–431 (1983)
Tatti, N.: Maximum entropy based significance of itemsets. Knowl. Inf. Syst. 17(1), 57–77 (2008)
Tatti, N., Vreeken, J.: Finding good itemsets by packing data. In: ICDM, pp. 588–597 (2008)
Vreeken, J., Tatti, N.: Interesting patterns. In: Aggarwal, C.C., Han, J. (eds.) Frequent Pattern Mining, pp. 105–134. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07821-2_5
Vreeken, J., van Leeuwen, M., Siebes, A.: KRIMP: mining itemsets that compress. Data Min. Knowl. Disc. 23(1), 169–214 (2011). https://doi.org/10.1007/s10618-010-0202-x
Wang, F., Rudin, C.: Falling rule lists. In: AISTATS (2015)
Webb, G.I.: Discovering significant patterns. Mach. Learn. 68(1), 1–33 (2007). https://doi.org/10.1007/s10994-007-5006-x
Xiang, Y., Jin, R., Fuhry, D., Dragan, F.F.: Succinct summarization of transactional databases: an overlapped hyperrectangle scheme. In: KDD, pp. 758–766 (2008)
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: KDD, August 1997
Zimmermann, A., Nijssen, S.: Supervised pattern mining and applications to classification. In: Aggarwal, C.C., Han, J. (eds.) Frequent Pattern Mining, pp. 425–442. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07821-2_17
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Fischer, J., Vreeken, J. (2020). Sets of Robust Rules, and How to Find Them. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11906. Springer, Cham. https://doi.org/10.1007/978-3-030-46150-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-46150-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46149-2
Online ISBN: 978-3-030-46150-8
eBook Packages: Computer ScienceComputer Science (R0)