article

Free access

Finding the most interesting patterns in a database quickly by using sequential sampling

Authors:

Tobias Scheffer,

Stefan WrobelAuthors Info & Claims

The Journal of Machine Learning Research, Volume 3

Pages 833 - 862

Published: 01 March 2003 Publication History

Abstract

Many discovery problems, e.g. subgroup or association rule discovery, can naturally be cast as n-best hypotheses problems where the goal is to find the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this problem by issuing a small number of database queries while guaranteeing precise bounds on the confidence and quality of solutions. Known sampling approaches have treated single hypothesis selection problems, assuming that the utility is the average (over the examples) of some function --- which is not the case for many frequently used utility functions. We show that our algorithm works for all utilities that can be estimated with bounded error. We provide these error bounds and resulting worst-case sample bounds for some of the most frequently used utilities, and prove that there is no sampling algorithm for a popular class of utility functions that cannot be estimated with bounded error. The algorithm is sequential in the sense that it starts to return (or discard) hypotheses that already seem to be particularly good (or bad) after a few examples. Thus, the algorithm is almost always faster than its worst-case bounds.

References

[1]

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD Conference on Management of Data, pages 207-216, 1993.

[2]

L. Breiman, J Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Pacific Grove, 1984.

[3]

H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sums of observations. Annals of Mathematical Statistics, 23: 409-507, 1952.

[4]

H. Dodge and H. Romig. A method of sampling inspection. The Bell System Technical Journal, 8: 613-631, 1929.

[5]

C. Domingo, R. Gavelda, and O. Watanabe. Practical algorithms for on-line selection. In Proc. International Conference on Discovery Science, pages 150-161, 1998.

[6]

C. Domingo, R. Gavelda, and O. Watanabe. Adaptive sampling methods for scaling up knowledge discovery algorithms. Technical Report TR-C131, Dept. de LSI, Politecnica de Catalunya, 1999.

[7]

U. Fayyad, G. Piatetski-Shapiro, and P. Smyth. Knowledge discovery and data mining: Towards a unifying framework. In KDD-96, 1996.

[8]

Y. Freund. Self-bounding learning algorithms. In Proceedings of the International Workshop on Computational Learning Theory (COLT-98), 1998.

[9]

K. Ghosh, M. Mukhopadhyay, and P. Sen. Sequential Estimation. Wiley, 1997.

[10]

R. Greiner and R. Isukapalli. Learning to select useful landmarks. IEEE Transactions on Systems, Man, and Cybernetics, Part B: 473-449, 1996.

[11]

Russell Greiner. PALO: A probabilistic hill-climbing algorithm. Artificial Intelligence, 83(1-2), July 1996.

[12]

P. Haas and A. Swami. Sequential sampling procedures for quesy size estimation. Research Report RJ 9101 (80915), IBM, 1992.

[13]

D. Haussler, M. Kearns, S. Seung, and N. Tishby. Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25, 1996.

[14]

W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301): 13-30, 1963.

[15]

G. Hulten and P. Domingos. Mining high-speed data streams. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.

[16]

M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994.

[17]

W. Klösgen. Problems in knowledge discovery in databases and their treatment in the statistics interpreter explora. Journal of Intelligent Systems, 7: 649-673, 1992.

[18]

W. Klösgen. Assistant for knowledge discovery in data. In P. Hoschka, editor, Assisting Computer: A New Generation of Support Systems, 1995.

[19]

W. Klösgen. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining, pages 249-271. AAAI, 1996.

[20]

J. Langford and D. McAllester. Computable shell decomposition bounds. In Proceedings of the International Conference on Computational Learning Theory, 2000.

[21]

O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classification and function approximating. In Advances in Neural Information Processing Systems, pages 59-66, 1994.

[22]

A. Moore and M. Lee. Efficient algorithms for minimizing cross validation error. In Proceedings of the Eleventh International Conference on Machine Learning, pages 190-198, 1994.

[23]

G. Piatetski-Shapiro. Discovery, analysis, and presentation of strong rules. In Knowledge Discovery in Databases, pages 229-248, 1991.

[24]

T. Scheffer and S. Wrobel. A sequential sampling algorithm for a general class of utility functions. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000.

[25]

T. Scheffer and S. Wrobel. Hidden markov models for text classification and information extraction. Technical report, University of Magdeburg, 2002.

[26]

H. Toivonen. Sampling large databases for association rules. In Proc. VLDB Conference, 1996.

[27]

R. Uthurusamy, U. Fayyad, and S. Spangler. Learning useful rules from inconclusive data. In Knowledge Discovery in Databases, pages 141-158, 1991.

[28]

V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1996.

[29]

A. Wald. Sequential Analysis. Wiley, 1947.

[30]

D. H. Wolpert. The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. In D. H. Wolpert, editor, The Mathematics of Generalization, The SFI Studies in the Sciences of Complexity, pages 117-214. Addison-Wesley, 1995.

[31]

Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. In Proc. First European Symposion on Principles of Data Mining and Knowledge Discovery (PKDD-97), pages 78-87, Berlin, 1997.

Cited By

Remil YBendimerad AMathonat RChaleat PKaytoue MGrundy J(2021)"What makes my queries slow?"Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE51524.2021.9678915(642-652)Online publication date: 15-Nov-2021
https://dl.acm.org/doi/10.1109/ASE51524.2021.9678915
Riondato MVandin F(2020)MiSoSouPACM Transactions on Knowledge Discovery from Data10.1145/338565314:5(1-31)Online publication date: 21-Jun-2020
https://dl.acm.org/doi/10.1145/3385653
Riondato MVandin FGuo YFarooq F(2018)MiSoSouPProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219989(2130-2139)Online publication date: 19-Jul-2018
https://dl.acm.org/doi/10.1145/3219819.3219989
Show More Cited By

Index Terms

Finding the most interesting patterns in a database quickly by using sequential sampling

Recommendations

Mining Interesting Sequential Patterns using a Novel Balanced Utility Measure
Abstract
High utility sequential pattern (HUSP) mining (HUSM) is an emerging task in data mining. The goal is to identify sequential patterns in a quantitative sequence database that have high importance, as measured by a utility function. Nevertheless, a ...
Mining interesting sequential patterns for intelligent systems

Mining sequential patterns means to discover sequential purchasing behaviors of most customers from a large number of customer transactions. Past transaction data can be analyzed to discover customer purchasing behaviors such that the quality of ...
Sequential Sampling Algorithms: Unified Analysis and Lower Bounds
SAGA '01: Proceedings of the International Symposium on Stochastic Algorithms: Foundations and Applications

Sequential sampling algorithms have recently attracted interest as a way to design scalable algorithms for Data mining and KDD processes. In this paper, we identify an elementary sequential samplingtask (estimation from examples), from which one can ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Machine Learning Research

The Journal of Machine Learning Research Volume 3, Issue

3/1/2003

1437 pages

ISSN:1532-4435

EISSN:1533-7928

Issue’s Table of Contents

Publisher

JMLR.org

Publication History

Published: 01 March 2003

Published in JMLR Volume 3

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
480
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)5

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Remil YBendimerad AMathonat RChaleat PKaytoue MGrundy J(2021)"What makes my queries slow?"Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering10.1109/ASE51524.2021.9678915(642-652)Online publication date: 15-Nov-2021
https://dl.acm.org/doi/10.1109/ASE51524.2021.9678915
Riondato MVandin F(2020)MiSoSouPACM Transactions on Knowledge Discovery from Data10.1145/338565314:5(1-31)Online publication date: 21-Jun-2020
https://dl.acm.org/doi/10.1145/3385653
Riondato MVandin FGuo YFarooq F(2018)MiSoSouPProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219989(2130-2139)Online publication date: 19-Jul-2018
https://dl.acm.org/doi/10.1145/3219819.3219989
Zhang ZPedrycz WHuang J(2017)Efficient frequent itemsets mining through sampling and information granulationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2017.07.01665:C(119-136)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.1016/j.engappai.2017.07.016
Palshikar GSahu KSrivastava R(2016)Ensembles of Interesting Subgroups for Discovering High Potential EmployeesProceedings, Part II, of the 20th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining - Volume 965210.1007/978-3-319-31750-2_17(208-220)Online publication date: 19-Apr-2016
https://dl.acm.org/doi/10.1007/978-3-319-31750-2_17
Riondato MUpfal ECao LZhang CJoachims TWebb GMargineantu DWilliams G(2015)Mining Frequent Itemsets through Progressive Sampling with Rademacher AveragesProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783265(1005-1014)Online publication date: 10-Aug-2015
https://dl.acm.org/doi/10.1145/2783258.2783265
Riondato MUpfal E(2014)Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance GuaranteesACM Transactions on Knowledge Discovery from Data10.1145/26295868:4(1-32)Online publication date: 29-Aug-2014
https://dl.acm.org/doi/10.1145/2629586
Sawade CLandwehr NScheffer T(2012)Active comparison of prediction modelsProceedings of the 25th International Conference on Neural Information Processing Systems - Volume 210.5555/2999325.2999331(1754-1762)Online publication date: 3-Dec-2012
https://dl.acm.org/doi/10.5555/2999325.2999331
Mavroeidis DMagdalinos P(2012)A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy EstimationsACM Transactions on Knowledge Discovery from Data10.1145/2297456.22974576:2(1-37)Online publication date: 1-Jul-2012
https://dl.acm.org/doi/10.1145/2297456.2297457
Boley MLucchese CPaurat DGärtner TApte CGhosh JSmyth P(2011)Direct local pattern sampling by efficient two-step random proceduresProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/2020408.2020500(582-590)Online publication date: 21-Aug-2011
https://dl.acm.org/doi/10.1145/2020408.2020500
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents