Abstract
Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Eleventh international conference on data engineering (pp. 3–14). Taipei, Taiwan.
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining associations between sets of items in massive databases. In Proceedings of the 1993 ACM-SIGMOD international conference on management of data (pp. 207–216). Washington, DC.
Agresti, A. (1992). A survey of exact inference for contingency tables. Statistical Science, 7(1), 131–153.
Aumann, Y., & Lindell, Y. (1999). A statistical theory for quantitative association rules. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 261–270).
Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., & Lakhal, L. (2000). Mining minimal non-redundant association rules using frequent closed itemsets. In First international conference on computational logic—CL 2000 (pp. 972–986). Berlin: Springer.
Bay, S. D., & Pazzani, M. J. (2001). Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5(3), 213–246.
Bayardo, R. J., Jr., Agrawal, R., & Gunopulos, D. (2000). Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery, 4(2/3), 217–240.
Benjamini, Y., & Hochberg, Y. (1995) Controlling the false discovery rate: A new and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.
Benjamini, Y., & Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4), 1165–1188.
Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999). Using association rules for product assortment decisions: A case study. In Knowledge discovery and data mining (pp. 254–260).
Brin, S., Motwani, R. & Silverstein, C. (1997). Beyond market baskets: Generalizing association rules to correlations. In J. Peckham (Ed.), SIGMOD 1997, proceedings ACM SIGMOD international conference on management of data (pp. 265–276). New York: ACM.
Calders, T., & Goethals, B. (2002). Mining all non-derivable frequent itemsets. In Proceedings of the 6th European conference on principles and practice of knowledge discovery in databases, PKDD 2002 (pp. 74–85). Berlin: Springer.
Dong, G., & Li, J. (1999). Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 15–18). New York: ACM.
DuMouchel, W., & Pregibon, D. (2001). Empirical Bayes screening for multi-item associations. In KDD-2001: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 76–76). New York: ACM.
Hettich, S., & Bay, S. D. (2006). The UCI KDD archive. From http://kdd.ics.uci.edu. Irvine, CA: University of California, Department of Information and Computer Science.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.
International Business Machines. (1996). IBM intelligent miner user’s guide, version 1, release 1.
Jaroszewicz, S., & Simovici, D. A. (2004). Interestingness of frequent itemsets using Bayesian networks as background knowledge. In R. Kohavi, J. Gehrke, & J. Ghosh (Eds.), KDD-2004: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 178–186). New York: ACM.
Jensen, D. D., & Cohen, P. R. (2000) Multiple comparisons in induction algorithms. Machine Learning 38(3), 309–338.
Johnson, R., (1984). Elementary statistics. Boston: Duxbury.
Klösgen, W. (1996). Explora: A multipattern and multistrategy discovery assistant. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. 249–271). Menlo Park: AAAI.
Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. In Proceedings of the 2001 IEEE international conference on data mining (ICDM-01) (pp. 313–320).
Liu, B., Hsu, W., & Ma, Y. (1999). Pruning and summarizing the discovered associations. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 125–134). New York: AAAI.
Megiddo, N., & Srikant, R. (1998). Discovering predictive association rules. In Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98) (pp. 27–78). Menlo Park: AAAI.
Michalski, R. S. (1983). A theory and methodology of inductive learning. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (pp. 83–129). Berlin: Springer.
Newman, D. J., Hettich, S., Blake, C., & Merz, C. J. (2006). UCI repository of machine learning databases [Machine-readable data repository]. University of California, Department of Information and Computer Science, Irvine, CA.
Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro, J. Frawley (Eds.), Knowledge discovery in databases (pp. 229–248). Menlo Park: AAAI/MIT Press.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo: Kaufmann.
Quinlan, J. R., & Cameron-Jones, R. M. (1995). Oversearching and layered search in empirical learning. In IJCAI’95 (pp. 1019–1024). Los Altos: Kaufmann.
Scheffer, T. (1995). Finding association rules that trade support optimally against confidence. Intelligent Data Analysis, 9(4), 381–395.
Scheffer, T., & Wrobel, S. (2002). Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research, 3, 833–862.
Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46, 561–584.
Turney, P. D. (2000). Types of cost in inductive concept learning. In Workshop on cost-sensitive learning at the seventeenth international conference on machine learning (pp. 15–21). Stanford University, CA.
Webb, G. I. (1995). OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3, 431–465.
Webb, G. I. (2001). Discovering associations with numeric variables. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2001) (pp. 383–388). New York: The Association for Computing Machinery.
Webb, G. I. (2002). Magnum Opus Version 1.3. Software, G.I. Webb & Associates, Melbourne, Australia.
Webb, G. I. (2003). Preliminary investigations into statistically valid exploratory rule discovery. In Proceedings of the Australasian data mining workshop (AusDM03) (pp. 1–9). University of Technology, Sydney.
Webb, G. I. (2005). Magnum Opus Version 3.0.1. Software, G.I. Webb & Associates, Melbourne, Australia.
Webb, G. I. (2006). Discovering significant rules. In Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining, KDD-2006. (pp. 434–443). New York: ACM.
Webb, G. I., & Zhang, S. (2005). K-optimal rule discovery. Data Mining and Knowledge Discovery, 10(1), 39–79.
Zaki, M. J. (2000). Generating non-redundant association rules. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2000) (pp. 34–43). New York: ACM.
Zhang, H., Padmanabhan, B., & Tuzhilin, A. (2004). On the discovery of significant statistical quantitative rules. In Proceedings of the tenth international conference on knowledge discovery and data mining (KDD-2004) (pp. 374–383). New York: ACM.
Zheng, Z., Kohavi, R., & Mason, L. (2001). Real world performance of association rule algorithms. In Proceedings of the seventh international conference on knowledge discovery and data mining (KDD-2001) (pp. 401–406). New York: ACM.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Johannes Fürnkranz.
An erratum to this article can be found at http://dx.doi.org/10.1007/s10994-008-5045-y
Rights and permissions
About this article
Cite this article
Webb, G.I. Discovering Significant Patterns. Mach Learn 68, 1–33 (2007). https://doi.org/10.1007/s10994-007-5006-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-007-5006-x