Article

An iterative hypothesis-testing strategy for pattern discovery

Authors:

Richard J. Bolton,

Niall M. AdamsAuthors Info & Claims

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 49 - 58

https://doi.org/10.1145/956750.956760

Published: 24 August 2003 Publication History

Abstract

Pattern discovery has emerged as a direct result of increased data storage and analytic capabilities available to the data analyst. Without a massive amount of data, we do not have the evidence to support the discovery of the local deterministic structures that we call patterns. As such, pattern discovery is one of the few areas of data mining that cannot be considered simply as a 'scaling-up' of current statistical methodology to analyze large data sets. However, the philosophies of hypothesis testing and modeling in traditional statistics do lend themselves to forming a framework for pattern discovery, and we can also draw from ideas relating to outlier discovery and residual analysis to discover patterns. We illustrate an iterative strategy in a statistical framework by way of its application to one simulated and two real data sets.

References

[1]

Agrawal, R., Imielinski, T., and Swami, A. (1993) Mining association rules between sets of items in large databases, SIGMOD Record (ACM Special Interest Group on Management of Data), 22:207--216.

Digital Library

[2]

Bellman, R. E. Adaptive control processes: a guided tour. Princeton, N.J.: Princeton University Press, 1961.

[3]

Bolton, R. J. and Krzanowski, W. J. (1999) A characterization of principal components for projection pursuit, American Statistician, 53:108--109.

[4]

Bolton, R. J. and Hand, D. J. Significance tests for patterns in continuous data, in Proceedings of the 2001 IEEE International Conference on Data Mining, 29 November - 2 December 2001, San Jose, California, USA, N. Cercone, T. Y. Lin, and X. Wu, Eds.: IEEE Computer Society Press, 2001, pp. 67--74.

Digital Library

[5]

Bolton, R. J., Hand, D. J., and Adams, N. M. Determining Hit Rate in Pattern Search, in Pattern Detection and Discovery, ESF Exploratory Workshop, London, UK, Proceedings, LNAI 2447, D. J. Hand, N. M. Adams, and R. J. Bolton, Eds. Berlin: Springer, 2002, pp. 36--48.

Digital Library

[6]

Breiman, L. (2001) Statistical modeling: The two cultures, Statistical Science, 16:199--215.

[7]

Chudova, D. and Smyth, P. Pattern discovery in sequences under a Markov assumption, Proceedings of Eighth International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002.

Digital Library

[8]

Cook, R. D. and Weisberg, S. Residuals and Influence in Regression. New York: Chapman and Hall, 1982.

[9]

DuMouchel, W. (1999) Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system (with discussion), The American Statistician, 53: 177--202.

[10]

DuMouchel, W. and Pregibon, D. Empirical Bayes Screening For Multi-Item Associations In Massive Datasets, in Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, CA: ACM Press, 2001, pp. 67--76.

Digital Library

[11]

Hand, D. J. (1998) Data mining - reaching beyond statistics, Research in Official Statistics, 2.

[12]

Hand, D. J., Blunt, G., Kelly, M. G., and Adams, N. M. (2000) Data mining for fun and profit, Statistical Science, 15: 111--126.

[13]

Hand, D. J. and Blunt, G. (2001) Prospecting for gems in credit card data, IMA Journal of Management Mathematics, 12: 173--200.

[14]

Hand, D. J., Mannila, H., and Smyth, P. Principles of Data Mining. Cambridge, MA: MIT Press, 2001.

Digital Library

[15]

Hand, D. J. and Bolton, R. J. (2002) Pattern Discovery, Imperial College Technical Report.

[16]

Ingrassia, S. (1992) A comparison between the simulated annealing and the EM algorithms in normal mixture decompositions, Statistics and Computing, 2: 203--211.

[17]

Liu, B., Hsu, W., Mun, L.-F., and Lee, H. (1999) Finding Interesting Patterns using User Expectations, IEEE Transactions on Knowledge and Data Engineering, 11: 817--832.

Digital Library

[18]

Morik, K. Detecting Interesting Instances, in Pattern Detection and Discovery, ESF Exploratory Workshop, London, UK, Proceedings, LNAI 2447, D. J. Hand, N. M. Adams, and R. J. Bolton, Eds. Berlin: Springer, 2002, pp. 13--23.

Digital Library

[19]

Padmanabhan, B. and Tuzhilin, A. A belief-driven method for discovering unexpected patterns, in Proceedings of the Fourth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1998, pp. 94--110.

[20]

Padmanabhan, B. and Tuzhilin, A. (2002) Knowledge refinement based on the discovery of unexpected patterns in data mining, Decision Support Systems, 33: 309--321.

Digital Library

[21]

Pigeot, I. (2000) Basic concepts of multiple tests - a survey, Statistical Papers, 41: 3--36.

[22]

Redner, R. A. and Walker, H. F. (1984) Mixture densities, maximum likelihood and the EM algorithm, SIAM Review, 26: 195--239.

Digital Library

[23]

Schonlau, M., DuMouchel, W., Ju, W.-H., Karr, A. F., Theus, M., and Vardi, Y. (2001) Computer intrusion: detecting masquerades, Statistical Science, 16: 1--17.

[24]

Silberschatz, A. and Tuzhilin, A. (1996) What makes patterns interesting in knowledge discovery systems, IEEE Transactions on Knowledge and Data Engineering, 8: 970--974.

Digital Library

[25]

Venables, W. N. and Ripley, B. D. Modern applied statistics with S-PLUS. New York: Springer-Verlag, 1999.

Digital Library

Cited By

Liu BVinci GSnyder AKass R(2018)Sequential Monte Carlo Method for Bayesian Multiple Testing of Pairwise Interactions among Large Number of Neurons2018 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)10.1109/FSKD.2018.8686862(1115-1121)Online publication date: Jul-2018
https://doi.org/10.1109/FSKD.2018.8686862
Weiβ C(2008)Statistical mining of interesting association rulesStatistics and Computing10.1007/s11222-007-9047-618:2(185-194)Online publication date: 1-Jun-2008
https://dl.acm.org/doi/10.1007/s11222-007-9047-6
Ceglar ARoddick J(2007)GAM: a guidance enabled association mining environmentInternational Journal of Business Intelligence and Data Mining10.1504/IJBIDM.2007.0129442:1(3-28)Online publication date: 1-Mar-2007
https://dl.acm.org/doi/10.1504/IJBIDM.2007.012944
Show More Cited By

Recommendations

Item-centric mining of frequent patterns from big uncertain data
Abstract
High volumes of wide varieties of valuable data of different veracity (e.g., imprecise and uncertain data) can be easily generated or collected at a high velocity for various knowledge-based and intelligent information & engineering systems in ...
Minimal infrequent pattern based approach for mining outliers in data streams

Minimal Infrequent Pattern based Outlier Detection.An algorithm for mining minimal infrequent patterns in data streams.Three simple factors deciding outliers.An algorithm for detecting outliers based on mined minimal infrequent patterns.Experimental ...
Closed frequent similar pattern mining

The concept of closed frequent similar pattern mining is introduced.Several lemmas to prune the search space are introduced and proved.A novel closed frequent similar pattern mining algorithm (CFSP-Miner), is proposed.CFSP-Miner is more efficient than ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

August 2003

736 pages

ISBN:1581137370

DOI:10.1145/956750

Conference Chair:
Lise Getoor
University of Maryland, College Park
,
General Chair:
Ted Senator
DARPA
,
Program Chairs:
Pedro Domingos
University of Washington
,
Christos Faloutsos
Carnegie Mellon University

Copyright © 2003 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2003

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD03

Sponsor:

KDD03: The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 24 - 27, 2003

Washington, D.C.

Acceptance Rates

KDD '03 Paper Acceptance Rate 46 of 298 submissions, 15%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
737
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu BVinci GSnyder AKass R(2018)Sequential Monte Carlo Method for Bayesian Multiple Testing of Pairwise Interactions among Large Number of Neurons2018 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)10.1109/FSKD.2018.8686862(1115-1121)Online publication date: Jul-2018
https://doi.org/10.1109/FSKD.2018.8686862
Weiβ C(2008)Statistical mining of interesting association rulesStatistics and Computing10.1007/s11222-007-9047-618:2(185-194)Online publication date: 1-Jun-2008
https://dl.acm.org/doi/10.1007/s11222-007-9047-6
Ceglar ARoddick J(2007)GAM: a guidance enabled association mining environmentInternational Journal of Business Intelligence and Data Mining10.1504/IJBIDM.2007.0129442:1(3-28)Online publication date: 1-Mar-2007
https://dl.acm.org/doi/10.1504/IJBIDM.2007.012944
Zhang HPadmanabhan BTuzhilin AKim WKohavi RGehrke JDuMouchel W(2004)On the discovery of significant statistical quantitative rulesProceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1014052.1014094(374-383)Online publication date: 22-Aug-2004
https://dl.acm.org/doi/10.1145/1014052.1014094

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents