Discovering Significant Patterns

Webb, Geoffrey I.

doi:10.1007/s10994-007-5006-x

Discovering Significant Patterns

Published: 14 April 2007

Volume 68, pages 1–33, (2007)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Discovering Significant Patterns

Download PDF

Geoffrey I. Webb¹

3834 Accesses
166 Citations
7 Altmetric
Explore all metrics

A Publisher's Erratum to this article was published on 02 February 2008

Abstract

Pattern discovery techniques, such as association rule discovery, explore large search spaces of potential patterns to find those that satisfy some user-specified constraints. Due to the large number of patterns considered, they suffer from an extreme risk of type-1 error, that is, of finding patterns that appear due to chance alone to satisfy the constraints on the sample data. This paper proposes techniques to overcome this problem by applying well-established statistical practices. These allow the user to enforce a strict upper limit on the risk of experimentwise error. Empirical studies demonstrate that standard pattern discovery techniques can discover numerous spurious patterns when applied to random data and when applied to real-world data result in large numbers of patterns that are rejected when subjected to sound statistical evaluation. They also reveal that a number of pragmatic choices about how such tests are performed can greatly affect their power.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Eleventh international conference on data engineering (pp. 3–14). Taipei, Taiwan.
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining associations between sets of items in massive databases. In Proceedings of the 1993 ACM-SIGMOD international conference on management of data (pp. 207–216). Washington, DC.
Agresti, A. (1992). A survey of exact inference for contingency tables. Statistical Science, 7(1), 131–153.
Article MATH MathSciNet Google Scholar
Aumann, Y., & Lindell, Y. (1999). A statistical theory for quantitative association rules. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 261–270).
Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., & Lakhal, L. (2000). Mining minimal non-redundant association rules using frequent closed itemsets. In First international conference on computational logic—CL 2000 (pp. 972–986). Berlin: Springer.
Google Scholar
Bay, S. D., & Pazzani, M. J. (2001). Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5(3), 213–246.
Article MATH Google Scholar
Bayardo, R. J., Jr., Agrawal, R., & Gunopulos, D. (2000). Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery, 4(2/3), 217–240.
Article Google Scholar
Benjamini, Y., & Hochberg, Y. (1995) Controlling the false discovery rate: A new and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57, 289–300.
MATH MathSciNet Google Scholar
Benjamini, Y., & Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4), 1165–1188.
Article MATH MathSciNet Google Scholar
Brijs, T., Swinnen, G., Vanhoof, K., & Wets, G. (1999). Using association rules for product assortment decisions: A case study. In Knowledge discovery and data mining (pp. 254–260).
Brin, S., Motwani, R. & Silverstein, C. (1997). Beyond market baskets: Generalizing association rules to correlations. In J. Peckham (Ed.), SIGMOD 1997, proceedings ACM SIGMOD international conference on management of data (pp. 265–276). New York: ACM.
Chapter Google Scholar
Calders, T., & Goethals, B. (2002). Mining all non-derivable frequent itemsets. In Proceedings of the 6th European conference on principles and practice of knowledge discovery in databases, PKDD 2002 (pp. 74–85). Berlin: Springer.
Google Scholar
Dong, G., & Li, J. (1999). Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 15–18). New York: ACM.
Google Scholar
DuMouchel, W., & Pregibon, D. (2001). Empirical Bayes screening for multi-item associations. In KDD-2001: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 76–76). New York: ACM.
Google Scholar
Hettich, S., & Bay, S. D. (2006). The UCI KDD archive. From http://kdd.ics.uci.edu. Irvine, CA: University of California, Department of Information and Computer Science.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70.
MathSciNet Google Scholar
International Business Machines. (1996). IBM intelligent miner user’s guide, version 1, release 1.
Jaroszewicz, S., & Simovici, D. A. (2004). Interestingness of frequent itemsets using Bayesian networks as background knowledge. In R. Kohavi, J. Gehrke, & J. Ghosh (Eds.), KDD-2004: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 178–186). New York: ACM.
Chapter Google Scholar
Jensen, D. D., & Cohen, P. R. (2000) Multiple comparisons in induction algorithms. Machine Learning 38(3), 309–338.
Article MATH Google Scholar
Johnson, R., (1984). Elementary statistics. Boston: Duxbury.
Google Scholar
Klösgen, W. (1996). Explora: A multipattern and multistrategy discovery assistant. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in knowledge discovery and data mining (pp. 249–271). Menlo Park: AAAI.
Google Scholar
Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. In Proceedings of the 2001 IEEE international conference on data mining (ICDM-01) (pp. 313–320).
Liu, B., Hsu, W., & Ma, Y. (1999). Pruning and summarizing the discovered associations. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-99) (pp. 125–134). New York: AAAI.
Chapter Google Scholar
Megiddo, N., & Srikant, R. (1998). Discovering predictive association rules. In Proceedings of the fourth international conference on knowledge discovery and data mining (KDD-98) (pp. 27–78). Menlo Park: AAAI.
Google Scholar
Michalski, R. S. (1983). A theory and methodology of inductive learning. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (pp. 83–129). Berlin: Springer.
Google Scholar
Newman, D. J., Hettich, S., Blake, C., & Merz, C. J. (2006). UCI repository of machine learning databases [Machine-readable data repository]. University of California, Department of Information and Computer Science, Irvine, CA.
Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro, J. Frawley (Eds.), Knowledge discovery in databases (pp. 229–248). Menlo Park: AAAI/MIT Press.
Google Scholar
Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo: Kaufmann.
Google Scholar
Quinlan, J. R., & Cameron-Jones, R. M. (1995). Oversearching and layered search in empirical learning. In IJCAI’95 (pp. 1019–1024). Los Altos: Kaufmann.
Google Scholar
Scheffer, T. (1995). Finding association rules that trade support optimally against confidence. Intelligent Data Analysis, 9(4), 381–395.
Google Scholar
Scheffer, T., & Wrobel, S. (2002). Finding the most interesting patterns in a database quickly by using sequential sampling. Journal of Machine Learning Research, 3, 833–862.
Article MathSciNet Google Scholar
Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46, 561–584.
Article Google Scholar
Turney, P. D. (2000). Types of cost in inductive concept learning. In Workshop on cost-sensitive learning at the seventeenth international conference on machine learning (pp. 15–21). Stanford University, CA.
Webb, G. I. (1995). OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3, 431–465.
MATH MathSciNet Google Scholar
Webb, G. I. (2001). Discovering associations with numeric variables. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2001) (pp. 383–388). New York: The Association for Computing Machinery.
Chapter Google Scholar
Webb, G. I. (2002). Magnum Opus Version 1.3. Software, G.I. Webb & Associates, Melbourne, Australia.
Webb, G. I. (2003). Preliminary investigations into statistically valid exploratory rule discovery. In Proceedings of the Australasian data mining workshop (AusDM03) (pp. 1–9). University of Technology, Sydney.
Webb, G. I. (2005). Magnum Opus Version 3.0.1. Software, G.I. Webb & Associates, Melbourne, Australia.
Webb, G. I. (2006). Discovering significant rules. In Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining, KDD-2006. (pp. 434–443). New York: ACM.
Chapter Google Scholar
Webb, G. I., & Zhang, S. (2005). K-optimal rule discovery. Data Mining and Knowledge Discovery, 10(1), 39–79.
Article MathSciNet Google Scholar
Zaki, M. J. (2000). Generating non-redundant association rules. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2000) (pp. 34–43). New York: ACM.
Chapter Google Scholar
Zhang, H., Padmanabhan, B., & Tuzhilin, A. (2004). On the discovery of significant statistical quantitative rules. In Proceedings of the tenth international conference on knowledge discovery and data mining (KDD-2004) (pp. 374–383). New York: ACM.
Chapter Google Scholar
Zheng, Z., Kohavi, R., & Mason, L. (2001). Real world performance of association rule algorithms. In Proceedings of the seventh international conference on knowledge discovery and data mining (KDD-2001) (pp. 401–406). New York: ACM.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, Monash University, PO Box 75, Clayton, Vic., 3800, Australia
Geoffrey I. Webb

Authors

Geoffrey I. Webb
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Geoffrey I. Webb.

Additional information

Editor: Johannes Fürnkranz.

An erratum to this article can be found at http://dx.doi.org/10.1007/s10994-008-5045-y

Rights and permissions

Reprints and permissions

About this article

Cite this article

Webb, G.I. Discovering Significant Patterns. Mach Learn 68, 1–33 (2007). https://doi.org/10.1007/s10994-007-5006-x

Download citation

Received: 23 May 2005
Revised: 13 February 2007
Accepted: 14 February 2007
Published: 14 April 2007
Issue Date: July 2007
DOI: https://doi.org/10.1007/s10994-007-5006-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Discovering Significant Patterns

Abstract

Article PDF

Similar content being viewed by others

A tutorial on statistically sound pattern discovery

Introduction to Pattern Mining

The pattern frequency distribution theory: a mathematic establishment toward rational and reliable pattern mining

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Discovering Significant Patterns

Abstract

Article PDF

Similar content being viewed by others

A tutorial on statistically sound pattern discovery

Introduction to Pattern Mining

The pattern frequency distribution theory: a mathematic establishment toward rational and reliable pattern mining

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation