Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Mining top-K frequent itemsets through progressive sampling

Published: 01 September 2010 Publication History

Abstract

We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real benchmark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.

References

[1]
Charikar M, Chen K, Farach-Colton M (2004) Finding frequent items in data streams. Theor Comput Sci 312(1):3-15.
[2]
Chakaravarthy VT, Pandit V, Sabharwal Y (2009) Analysts of sampling techniques for association rule mining. Proceedings of ICDT 2009, pp 276-283.
[3]
Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. Proceedings of KDD 2002, pp 462-468.
[4]
Cohen E, Grossaug N, Kaplan H (2008) Processing top-k queries from samples. Comput Netw 52(14): 2605-2622.
[5]
Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. Proceedings of SIGMOD 1998, pp 331-342.
[6]
John GH, Langley P (1996) Static versus dynamic sampling for data mining. Proceedings of KDD 1996, pp 367-370.
[7]
Li Y, Gopalan RP (2004) Effective sampling for mining association rules. Proceedings of AUS-AI 2004, pp 391-401.
[8]
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. Proceedings of VLDB 2002, pp 346-357.
[9]
Metwally A, Agrawal D, El Abbadi A (2005) Efficient computation of frequent and top-k elements in data streams. Proceedings of ICDT 2005, pp 398-412.
[10]
Mitzenmacher M, Upfal E (2005) Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press, Cambridge.
[11]
Parthasarathy S (2002) Efficient progressive sampling for association rules. Proceedings of ICDM 2002, pp 354-361.
[12]
Pietracaprina A, Vandin F (2007) Efficient incremental mining of top-K frequent closed itemsets. Proceedings of discovery science 2007, pp 275-280.
[13]
Toivonen H (1996) Sampling large databases for association rules. Proceedings of VLDB 1996, pp 134-145.
[14]
Vasudevan D, Vjnovic M (2009) Ranking through random sampling. Manuscript.
[15]
Wang J, Han J, Lu Y, Tzvetkov P (2005) TFP: an efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans Knowl Data Eng 17(5):652-664.
[16]
Wong RC-W, Fu AW-C (2006) Mining top-K frequent itemsets from data streams. Data Min Knowl Discov 13(2):193-217.
[17]
Zaki MJ, Parthasarathy S, Li W, Ogihara M (1997) Evaluation of sampling for data mining of association rules. Proceedings of RIDE 1997, pp 42-50.

Cited By

View all
  • (2022)Adaptive Hybrid IndexesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526121(1626-1639)Online publication date: 10-Jun-2022
  • (2019)KADABRA is an ADaptive Algorithm for Betweenness via Random ApproximationACM Journal of Experimental Algorithmics10.1145/328435924(1-35)Online publication date: 20-Feb-2019
  • (2018)Adaptive sampling for rapidly matching histogramsProceedings of the VLDB Endowment10.14778/3231751.323175311:10(1262-1275)Online publication date: 1-Jun-2018
  • Show More Cited By

Index Terms

  1. Mining top-K frequent itemsets through progressive sampling
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Data Mining and Knowledge Discovery
    Data Mining and Knowledge Discovery  Volume 21, Issue 2
    September 2010
    123 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 01 September 2010

    Author Tags

    1. Bloom filters
    2. Frequent itemsets mining
    3. Progressive sampling
    4. Sampling
    5. Top-K frequent itemsets

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Adaptive Hybrid IndexesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526121(1626-1639)Online publication date: 10-Jun-2022
    • (2019)KADABRA is an ADaptive Algorithm for Betweenness via Random ApproximationACM Journal of Experimental Algorithmics10.1145/328435924(1-35)Online publication date: 20-Feb-2019
    • (2018)Adaptive sampling for rapidly matching histogramsProceedings of the VLDB Endowment10.14778/3231751.323175311:10(1262-1275)Online publication date: 1-Jun-2018
    • (2018)Clustering uncertain graphsProceedings of the VLDB Endowment10.1145/3164135.316414311:4(472-484)Online publication date: 5-Oct-2018
    • (2017)Pyramid sketchProceedings of the VLDB Endowment10.14778/3137628.313765210:11(1442-1453)Online publication date: 1-Aug-2017
    • (2017)Clustering uncertain graphsProceedings of the VLDB Endowment10.1145/3186728.316414311:4(472-484)Online publication date: 1-Dec-2017
    • (2017)Discovery of Frequent Itemsets through Randomized Sampling with Bernstein's InequalityProceedings of the 2017 International Conference on Data Mining, Communications and Information Technology10.1145/3089871.3089872(1-5)Online publication date: 25-May-2017
    • (2017)Tracking Influential Individuals in Dynamic NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2017.273466729:11(2615-2628)Online publication date: 4-Oct-2017
    • (2017)Efficient frequent itemsets mining through sampling and information granulationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2017.07.01665:C(119-136)Online publication date: 1-Oct-2017
    • (2016)Augmented SketchProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2882948(1449-1463)Online publication date: 26-Jun-2016
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media