Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

Published: 30 July 2022 Publication History
  • Get Citation Alerts
  • Abstract

    “I’m an MC still as honest” – Eminem, Rap God
    We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both (1) statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and (2) approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This flexibility offered by MCRapper is a big advantage over previously proposed solutions, which could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining, by appropriately computing approximations of the negative and positive borders of the collection of patterns of interest, which allow an effective pruning of the pattern space and the computation of strong bounds to the supremum deviation. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.

    References

    [1]
    Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining association rules between sets of items in large databases. ACM SIGMOD Record 22, 2 (June 1993), 207–216. DOI:
    [2]
    Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering. IEEE, 3–14.
    [3]
    N. K. Ahmed, J. Neville, R. A. Rossi, and Duffield N.2015. Efficient graphlet counting for large networks. In Proceedings of the 2015 IEEE International Conference on Data Mining. 1–10. DOI:
    [4]
    Peter L. Bartlett and Shahar Mendelson. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3, Nov (2002), 463–482.
    [5]
    Stephen D. Bay and Michael J. Pazzani. 2001. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery 5, 3 (2001), 213–246.
    [6]
    Mario Boley, Claudio Lucchese, Daniel Paurat, and Thomas Gärtner. 2011. Direct local pattern sampling by efficient two-step random procedures. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011). DOI:
    [7]
    Olivier Bousquet. 2002. A Bennet concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique 334, 6 (2002), 495–500.
    [8]
    Venkatesan T. Chakaravarthy, Vinayaka Pandit, and Yogish Sabharwal. 2009. Analysis of sampling techniques for association rule mining. In Proceedings of the 12th International Conference Database Theory (St. Petersburg, Russia). ACM, New York, NY, 276–283. DOI:
    [9]
    Cyrus Cousins and Matteo Riondato. 2020. Sharp uniform convergence bounds through empirical centralization. In Proceedings of the Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 15123–15132. Retrieved from https://proceedings.neurips.cc/paper/2020/file/ac457ba972fb63b7994befc83f774746-Paper.pdf.
    [10]
    Cyrus Cousins, Chloe Wohlgemuth, and Matteo Riondato. 2021. Bavarian: Betweenness centrality approximation with variance-aware Rademacher averages. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.
    [11]
    L. De Stefani and E. Upfal. 2019. A Rademacher complexity based method for controlling power and confidence level in adaptive statistical analysis. In Proceedings of the 2019 IEEE International Conference on Data Science and Advanced Analytics. 71–80. DOI:
    [12]
    Vladimir Dzyuba, Matthijs van Leeuwen, and Luc De Raedt. 2017. Flexible constrained sampling with guarantees for pattern mining. Data Mining and Knowledge Discovery 31, 5 (Mar 2017), 1266–1293. DOI:
    [13]
    Philippe Fournier-Viger, Jerry Chun-Wei Lin, Tin Truong-Chi, and Roger Nkambou. 2019. A survey of high utility itemset mining. In Proceedings of the High-Utility Pattern Mining. Springer International Publishing.
    [14]
    Wilhelmiina Hämäläinen and Geoffrey I. Webb. 2018. A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery 33, 2 (Dec 2018), 325–377. DOI:
    [15]
    Adam Kirsch, Michael Mitzenmacher, Andrea Pietracaprina, Geppino Pucci, Eli Upfal, and Fabio Vandin. 2012. An efficient rigorous approach for identifying statistically significant frequent itemsets. Journal of the ACM 59, 3 (2012), 1–22.
    [16]
    Willi Klösgen. 1992. Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. International Journal of Intelligent Systems 7, 7 (1992), 649–673.
    [17]
    Vladimir Koltchinskii and Dmitriy Panchenko. 2000. Rademacher processes and bounding the risk of function learning. In Proceedings of the High Dimensional Probability II. Springer, 443–457.
    [18]
    Heikki Mannila and Hannu Toivonen. 1996. On an algorithm for finding all interesting sentences. In Proceedings of the 13th European Meeting on Cybernetics and Systems Research, Vol. II. Citeseer.
    [19]
    Colin McDiarmid. 1989. On the method of bounded differences. Surveys in Combinatorics 141, 1 (1989), 148–188.
    [20]
    Luca Oneto, Alessandro Ghio, Davide Anguita, and Sandro Ridella. 2013. An improved analysis of the Rademacher data-dependent bound using its self bounding property. Neural Networks 44 (2013), 107–111. https://www.sciencedirect.com/science/article/abs/pii/S0893608013001020.
    [21]
    Leonardo Pellegrina. 2021. Rigorous and Efficient Algorithms for Significant and Approximate Pattern Mining. Ph.D. Thesis. Universitá degli Studi di Padova. Retrieved from http://www.dei.unipd.it/pellegri/thesis/leonardo_pellegrina_tesi.pdf.
    [22]
    Leonardo Pellegrina, Cyrus Cousins, Fabio Vandin, and Matteo Riondato. 2020. MCRapper: Monte-Carlo Rademacher averages for poset families and approximate pattern mining. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.
    [23]
    Leonardo Pellegrina, Matteo Riondato, and Fabio Vandin. 2019. SPuManTE: Significant pattern mining with unconditional testing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York, NY, 1528–1538. DOI:
    [24]
    Leonardo Pellegrina and Fabio Vandin. 2020. Efficient mining of the most significant patterns with permutation testing. Data Mining and Knowledge Discovery 34, 4 (2020), 1201–1234.
    [25]
    Matteo Riondato and Eli Upfal. 2014. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Transactions on Knowledge Discovery from Data 8, 4 (2014), 20. DOI:
    [26]
    Matteo Riondato and Eli Upfal. 2015. Mining frequent itemsets through progressive sampling with Rademacher averages. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1005–1014.
    [27]
    Matteo Riondato and Fabio Vandin. 2014. Finding the true frequent itemsets. In Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, 497–505.
    [28]
    Matteo Riondato and Fabio Vandin. 2018. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2130–2139.
    [29]
    Matteo Riondato and Fabio Vandin. 2020. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. ACM Transactions on Knowledge Discovery from Data 14, 5 (June 2020), Article 56, 31 pages. DOI:
    [30]
    Diego Santoro, Andrea Tonon, and Fabio Vandin. 2020. Mining sequential patterns with VC-dimension and Rademacher complexity. Algorithms 13, 5 (2020), 123.
    [31]
    Sacha Servan-Schreiber, Matteo Riondato, and Emanuel Zgraggen. 2018. ProSecCo: Progressive sequence mining with convergence guarantees. In Proceedings of the 18th IEEE International Conference on Data Mining. 417–426.
    [32]
    Sacha Servan-Schreiber, Matteo Riondato, and Emanuel Zgraggen. 2020. ProSecCo: Progressive sequence mining with convergence guarantees. Knowledge and Information Systems 62, 4 (2020), 1313–1340.
    [33]
    Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
    [34]
    Mahito Sugiyama, Felipe Llinares-López, Niklas Kasenburg, and Karsten M Borgwardt. 2015. Significant subgraph mining with multiple testing correction. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 37–45.
    [35]
    Aika Terada, Mariko Okada-Hatakeyama, Koji Tsuda, and Jun Sese. 2013. Statistical significance of combinatorial regulations. Proceedings of the National Academy of Sciences 110, 32 (2013), 12996–13001.
    [36]
    Hannu Toivonen. 1996. Sampling large databases for association rules. In Proceedings of the 22nd International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, 134–145.
    [37]
    Andrea Tonon and Fabio Vandin. 2019. Permutation strategies for mining significant sequential patterns. In Proceedings of the 2019 IEEE International Conference on Data Mining. IEEE, 1330–1335.
    [38]
    Vladimir N. Vapnik. 1998. Statistical Learning Theory. Wiley.

    Cited By

    View all
    • (2023)Mining Significant Utility Discriminative Patterns in Quantitative DatabasesMathematics10.3390/math1104095011:4(950)Online publication date: 13-Feb-2023
    • (2023) SILVAN: Estimating Betweenness Centralities with Progressive Sampling and Non-uniform Rademacher BoundsACM Transactions on Knowledge Discovery from Data10.1145/362860118:3(1-55)Online publication date: 9-Dec-2023
    • (2023)Estimation and update of betweenness centrality with progressive algorithm and shortest paths approximationScientific Reports10.1038/s41598-023-44392-013:1Online publication date: 10-Oct-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 16, Issue 6
    December 2022
    631 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3543989
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 July 2022
    Online AM: 25 April 2022
    Accepted: 01 April 2022
    Revised: 01 December 2021
    Received: 01 August 2021
    Published in TKDD Volume 16, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Approximation algorithms
    2. frequent patterns
    3. itemsets
    4. sampling
    5. significant patterns
    6. statistical testing
    7. statistical learning theory
    8. subgroup discovery

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Science Foundation NSF
    • DARPA/ARFL
    • Italian Ministry of Education, University and Research (MIUR)
    • SID 2020: RATED-X

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)52
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Mining Significant Utility Discriminative Patterns in Quantitative DatabasesMathematics10.3390/math1104095011:4(950)Online publication date: 13-Feb-2023
    • (2023) SILVAN: Estimating Betweenness Centralities with Progressive Sampling and Non-uniform Rademacher BoundsACM Transactions on Knowledge Discovery from Data10.1145/362860118:3(1-55)Online publication date: 9-Dec-2023
    • (2023)Estimation and update of betweenness centrality with progressive algorithm and shortest paths approximationScientific Reports10.1038/s41598-023-44392-013:1Online publication date: 10-Oct-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media