Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Finding the most interesting patterns in a database quickly by using sequential sampling

Published: 01 March 2003 Publication History

Abstract

Many discovery problems, e.g. subgroup or association rule discovery, can naturally be cast as n-best hypotheses problems where the goal is to find the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this problem by issuing a small number of database queries while guaranteeing precise bounds on the confidence and quality of solutions. Known sampling approaches have treated single hypothesis selection problems, assuming that the utility is the average (over the examples) of some function --- which is not the case for many frequently used utility functions. We show that our algorithm works for all utilities that can be estimated with bounded error. We provide these error bounds and resulting worst-case sample bounds for some of the most frequently used utilities, and prove that there is no sampling algorithm for a popular class of utility functions that cannot be estimated with bounded error. The algorithm is sequential in the sense that it starts to return (or discard) hypotheses that already seem to be particularly good (or bad) after a few examples. Thus, the algorithm is almost always faster than its worst-case bounds.

References

[1]
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD Conference on Management of Data, pages 207-216, 1993.
[2]
L. Breiman, J Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Pacific Grove, 1984.
[3]
H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sums of observations. Annals of Mathematical Statistics, 23: 409-507, 1952.
[4]
H. Dodge and H. Romig. A method of sampling inspection. The Bell System Technical Journal, 8: 613-631, 1929.
[5]
C. Domingo, R. Gavelda, and O. Watanabe. Practical algorithms for on-line selection. In Proc. International Conference on Discovery Science, pages 150-161, 1998.
[6]
C. Domingo, R. Gavelda, and O. Watanabe. Adaptive sampling methods for scaling up knowledge discovery algorithms. Technical Report TR-C131, Dept. de LSI, Politecnica de Catalunya, 1999.
[7]
U. Fayyad, G. Piatetski-Shapiro, and P. Smyth. Knowledge discovery and data mining: Towards a unifying framework. In KDD-96, 1996.
[8]
Y. Freund. Self-bounding learning algorithms. In Proceedings of the International Workshop on Computational Learning Theory (COLT-98), 1998.
[9]
K. Ghosh, M. Mukhopadhyay, and P. Sen. Sequential Estimation. Wiley, 1997.
[10]
R. Greiner and R. Isukapalli. Learning to select useful landmarks. IEEE Transactions on Systems, Man, and Cybernetics, Part B: 473-449, 1996.
[11]
Russell Greiner. PALO: A probabilistic hill-climbing algorithm. Artificial Intelligence, 83(1-2), July 1996.
[12]
P. Haas and A. Swami. Sequential sampling procedures for quesy size estimation. Research Report RJ 9101 (80915), IBM, 1992.
[13]
D. Haussler, M. Kearns, S. Seung, and N. Tishby. Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25, 1996.
[14]
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301): 13-30, 1963.
[15]
G. Hulten and P. Domingos. Mining high-speed data streams. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.
[16]
M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994.
[17]
W. Klösgen. Problems in knowledge discovery in databases and their treatment in the statistics interpreter explora. Journal of Intelligent Systems, 7: 649-673, 1992.
[18]
W. Klösgen. Assistant for knowledge discovery in data. In P. Hoschka, editor, Assisting Computer: A New Generation of Support Systems, 1995.
[19]
W. Klösgen. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining, pages 249-271. AAAI, 1996.
[20]
J. Langford and D. McAllester. Computable shell decomposition bounds. In Proceedings of the International Conference on Computational Learning Theory, 2000.
[21]
O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classification and function approximating. In Advances in Neural Information Processing Systems, pages 59-66, 1994.
[22]
A. Moore and M. Lee. Efficient algorithms for minimizing cross validation error. In Proceedings of the Eleventh International Conference on Machine Learning, pages 190-198, 1994.
[23]
G. Piatetski-Shapiro. Discovery, analysis, and presentation of strong rules. In Knowledge Discovery in Databases, pages 229-248, 1991.
[24]
T. Scheffer and S. Wrobel. A sequential sampling algorithm for a general class of utility functions. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.
[25]
T. Scheffer and S. Wrobel. Hidden markov models for text classification and information extraction. Technical report, University of Magdeburg, 2002.
[26]
H. Toivonen. Sampling large databases for association rules. In Proc. VLDB Conference, 1996.
[27]
R. Uthurusamy, U. Fayyad, and S. Spangler. Learning useful rules from inconclusive data. In Knowledge Discovery in Databases, pages 141-158, 1991.
[28]
V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1996.
[29]
A. Wald. Sequential Analysis. Wiley, 1947.
[30]
D. H. Wolpert. The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. In D. H. Wolpert, editor, The Mathematics of Generalization, The SFI Studies in the Sciences of Complexity, pages 117-214. Addison-Wesley, 1995.
[31]
Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. In Proc. First European Symposion on Principles of Data Mining and Knowledge Discovery (PKDD-97), pages 78-87, Berlin, 1997.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research
The Journal of Machine Learning Research  Volume 3, Issue
3/1/2003
1437 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 March 2003
Published in JMLR Volume 3

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)5
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2021)"What makes my queries slow?"Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE51524.2021.9678915(642-652)Online publication date: 15-Nov-2021
  • (2020)MiSoSouPACM Transactions on Knowledge Discovery from Data10.1145/338565314:5(1-31)Online publication date: 21-Jun-2020
  • (2018)MiSoSouPProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219989(2130-2139)Online publication date: 19-Jul-2018
  • (2017)Efficient frequent itemsets mining through sampling and information granulationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2017.07.01665:C(119-136)Online publication date: 1-Oct-2017
  • (2016)Ensembles of Interesting Subgroups for Discovering High Potential EmployeesProceedings, Part II, of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining - Volume 965210.1007/978-3-319-31750-2_17(208-220)Online publication date: 19-Apr-2016
  • (2015)Mining Frequent Itemsets through Progressive Sampling with Rademacher AveragesProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783265(1005-1014)Online publication date: 10-Aug-2015
  • (2014)Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance GuaranteesACM Transactions on Knowledge Discovery from Data10.1145/26295868:4(1-32)Online publication date: 29-Aug-2014
  • (2012)Active comparison of prediction modelsProceedings of the 25th International Conference on Neural Information Processing Systems - Volume 210.5555/2999325.2999331(1754-1762)Online publication date: 3-Dec-2012
  • (2012)A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy EstimationsACM Transactions on Knowledge Discovery from Data10.1145/2297456.22974576:2(1-37)Online publication date: 1-Jul-2012
  • (2011)Direct local pattern sampling by efficient two-step random proceduresProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2020408.2020500(582-590)Online publication date: 21-Aug-2011
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media