article

Mining top-K frequent itemsets through progressive sampling

Authors:

Andrea Pietracaprina,

Matteo Riondato,

Fabio VandinAuthors Info & Claims

Data Mining and Knowledge Discovery, Volume 21, Issue 2

Pages 310 - 326

https://doi.org/10.1007/s10618-010-0185-7

Published: 01 September 2010 Publication History

Abstract

We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real benchmark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.

References

[1]

Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams. Theor Comput Sci 312(1):3-15.

Digital Library

[2]

Chakaravarthy VT, Pandit V, Sabharwal Y (2009) Analysts of sampling techniques for association rule mining. Proceedings of ICDT 2009, pp 276-283.

Digital Library

[3]

Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. Proceedings of KDD 2002, pp 462-468.

[4]

Cohen E, Grossaug N, Kaplan H (2008) Processing top-k queries from samples. Comput Netw 52(14): 2605-2622.

Digital Library

[5]

Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. Proceedings of SIGMOD 1998, pp 331-342.

Digital Library

[6]

John GH, Langley P (1996) Static versus dynamic sampling for data mining. Proceedings of KDD 1996, pp 367-370.

[7]

Li Y, Gopalan RP (2004) Effective sampling for mining association rules. Proceedings of AUS-AI 2004, pp 391-401.

[8]

Manku GS, Motwani R (2002) Approximate frequency counts over data streams. Proceedings of VLDB 2002, pp 346-357.

[9]

Metwally A, Agrawal D, El Abbadi A (2005) Efficient computation of frequent and top-k elements in data streams. Proceedings of ICDT 2005, pp 398-412.

[10]

Mitzenmacher M, Upfal E (2005) Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press, Cambridge.

[11]

Parthasarathy S (2002) Efficient progressive sampling for association rules. Proceedings of ICDM 2002, pp 354-361.

[12]

Pietracaprina A, Vandin F (2007) Efficient incremental mining of top-K frequent closed itemsets. Proceedings of discovery science 2007, pp 275-280.

[13]

Toivonen H (1996) Sampling large databases for association rules. Proceedings of VLDB 1996, pp 134-145.

[14]

Vasudevan D, Vjnovic M (2009) Ranking through random sampling. Manuscript.

[15]

Wang J, Han J, Lu Y, Tzvetkov P (2005) TFP: an efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans Knowl Data Eng 17(5):652-664.

Digital Library

[16]

Wong RC-W, Fu AW-C (2006) Mining top-K frequent itemsets from data streams. Data Min Knowl Discov 13(2):193-217.

Digital Library

[17]

Zaki MJ, Parthasarathy S, Li W, Ogihara M (1997) Evaluation of sampling for data mining of association rules. Proceedings of RIDE 1997, pp 42-50.

Cited By

Anneser CKipf AZhang HNeumann TKemper AIves ZBonifati AEl Abbadi A(2022)Adaptive Hybrid IndexesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526121(1626-1639)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526121
Borassi MNatale E(2019)KADABRA is an ADaptive Algorithm for Betweenness via Random ApproximationACM Journal of Experimental Algorithmics10.1145/328435924(1-35)Online publication date: 20-Feb-2019
https://dl.acm.org/doi/10.1145/3284359
Macke SZhang YHuang SParameswaran A(2018)Adaptive sampling for rapidly matching histogramsProceedings of the VLDB Endowment10.14778/3231751.323175311:10(1262-1275)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.14778/3231751.3231753
Show More Cited By

Index Terms

Mining top-K frequent itemsets through progressive sampling
1. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Developing Novel and Effective Approach for Association Rule Mining Using Progressive Sampling
ICCEE '09: Proceedings of the 2009 Second International Conference on Computer and Electrical Engineering - Volume 01

A challenging task in data mining is the process of discovering association rules from a large database. Most of the existing association rule mining algorithms make repeated passes over the entire database to determine the frequent itemsets, which is ...
Compressed Bitmaps Based Frequent Itemsets Mining on Hadoop
INFOS '16: Proceedings of the 10th International Conference on Informatics and Systems

Frequent itemsets mining is one of the interesting applications of data mining. Recently data mining has got a great deal of attention due to the explosive growth in data and the economic and scientific need for turning such data into useful ...
Efficient frequent itemsets mining through sampling and information granulation

In this study, we propose an algorithm forming high quality approximate frequent itemsets from those datasets with a large scale of transactions. The results produced by the algorithm with high probability contain all frequent itemsets, no itemset with ...

Comments

Information & Contributors

Information

Published In

cover image Data Mining and Knowledge Discovery

Data Mining and Knowledge Discovery Volume 21, Issue 2

September 2010

123 pages

ISSN:1384-5810

Issue’s Table of Contents

Copyright © Copyright © 2010 The Author(s).

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 September 2010

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Anneser CKipf AZhang HNeumann TKemper AIves ZBonifati AEl Abbadi A(2022)Adaptive Hybrid IndexesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526121(1626-1639)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526121
Borassi MNatale E(2019)KADABRA is an ADaptive Algorithm for Betweenness via Random ApproximationACM Journal of Experimental Algorithmics10.1145/328435924(1-35)Online publication date: 20-Feb-2019
https://dl.acm.org/doi/10.1145/3284359
Macke SZhang YHuang SParameswaran A(2018)Adaptive sampling for rapidly matching histogramsProceedings of the VLDB Endowment10.14778/3231751.323175311:10(1262-1275)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.14778/3231751.3231753
Ceccarello MFantozzi CPietracaprina APucci GVandin F(2018)Clustering uncertain graphsProceedings of the VLDB Endowment10.1145/3164135.316414311:4(472-484)Online publication date: 5-Oct-2018
Yang TZhou YJin HChen SLi X(2017)Pyramid sketchProceedings of the VLDB Endowment10.14778/3137628.313765210:11(1442-1453)Online publication date: 1-Aug-2017
https://dl.acm.org/doi/10.14778/3137628.3137652
Ceccarello MFantozzi CPietracaprina APucci GVandin F(2017)Clustering uncertain graphsProceedings of the VLDB Endowment10.1145/3186728.316414311:4(472-484)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1145/3186728.3164143
Ye J(2017)Discovery of Frequent Itemsets through Randomized Sampling with Bernstein's InequalityProceedings of the 2017 International Conference on Data Mining, Communications and Information Technology10.1145/3089871.3089872(1-5)Online publication date: 25-May-2017
https://dl.acm.org/doi/10.1145/3089871.3089872
Yang YWang ZPei JChen E(2017)Tracking Influential Individuals in Dynamic NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.273466729:11(2615-2628)Online publication date: 4-Oct-2017
https://dl.acm.org/doi/10.1109/TKDE.2017.2734667
Zhang ZPedrycz WHuang J(2017)Efficient frequent itemsets mining through sampling and information granulationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2017.07.01665:C(119-136)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.1016/j.engappai.2017.07.016
Roy PKhan AAlonso GÖzcan FKoutrika GMadden S(2016)Augmented SketchProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2882948(1449-1463)Online publication date: 26-Jun-2016
https://dl.acm.org/doi/10.1145/2882903.2882948
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents