research-article

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

Authors:

Leonardo Pellegrina,

Matteo RiondatoAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 16, Issue 6

Article No.: 124, Pages 1 - 29

https://doi.org/10.1145/3532187

Published: 30 July 2022 Publication History

Abstract

“I’m an MC still as honest” – Eminem, Rap God

We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both (1) statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and (2) approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This flexibility offered by MCRapper is a big advantage over previously proposed solutions, which could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining, by appropriately computing approximations of the negative and positive borders of the collection of patterns of interest, which allow an effective pruning of the pattern space and the computation of strong bounds to the supremum deviation. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.

References

[1]

Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining association rules between sets of items in large databases. ACM SIGMOD Record 22, 2 (June 1993), 207–216. DOI:

Digital Library

[2]

Rakesh Agrawal and Ramakrishnan Srikant. 1995. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering. IEEE, 3–14.

Digital Library

[3]

N. K. Ahmed, J. Neville, R. A. Rossi, and Duffield N.2015. Efficient graphlet counting for large networks. In Proceedings of the 2015 IEEE International Conference on Data Mining. 1–10. DOI:

Digital Library

[4]

Peter L. Bartlett and Shahar Mendelson. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research 3, Nov (2002), 463–482.

[5]

Stephen D. Bay and Michael J. Pazzani. 2001. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery 5, 3 (2001), 213–246.

Digital Library

[6]

Mario Boley, Claudio Lucchese, Daniel Paurat, and Thomas Gärtner. 2011. Direct local pattern sampling by efficient two-step random procedures. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011). DOI:

Digital Library

[7]

Olivier Bousquet. 2002. A Bennet concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique 334, 6 (2002), 495–500.

[8]

Venkatesan T. Chakaravarthy, Vinayaka Pandit, and Yogish Sabharwal. 2009. Analysis of sampling techniques for association rule mining. In Proceedings of the 12th International Conference Database Theory (St. Petersburg, Russia). ACM, New York, NY, 276–283. DOI:

Digital Library

[9]

Cyrus Cousins and Matteo Riondato. 2020. Sharp uniform convergence bounds through empirical centralization. In Proceedings of the Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 15123–15132. Retrieved from https://proceedings.neurips.cc/paper/2020/file/ac457ba972fb63b7994befc83f774746-Paper.pdf.

[10]

Cyrus Cousins, Chloe Wohlgemuth, and Matteo Riondato. 2021. Bavarian: Betweenness centrality approximation with variance-aware Rademacher averages. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.

Digital Library

[11]

L. De Stefani and E. Upfal. 2019. A Rademacher complexity based method for controlling power and confidence level in adaptive statistical analysis. In Proceedings of the 2019 IEEE International Conference on Data Science and Advanced Analytics. 71–80. DOI:

[12]

Vladimir Dzyuba, Matthijs van Leeuwen, and Luc De Raedt. 2017. Flexible constrained sampling with guarantees for pattern mining. Data Mining and Knowledge Discovery 31, 5 (Mar 2017), 1266–1293. DOI:

Digital Library

[13]

Philippe Fournier-Viger, Jerry Chun-Wei Lin, Tin Truong-Chi, and Roger Nkambou. 2019. A survey of high utility itemset mining. In Proceedings of the High-Utility Pattern Mining. Springer International Publishing.

[14]

Wilhelmiina Hämäläinen and Geoffrey I. Webb. 2018. A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery 33, 2 (Dec 2018), 325–377. DOI:

Digital Library

[15]

Adam Kirsch, Michael Mitzenmacher, Andrea Pietracaprina, Geppino Pucci, Eli Upfal, and Fabio Vandin. 2012. An efficient rigorous approach for identifying statistically significant frequent itemsets. Journal of the ACM 59, 3 (2012), 1–22.

Digital Library

[16]

Willi Klösgen. 1992. Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. International Journal of Intelligent Systems 7, 7 (1992), 649–673.

[17]

Vladimir Koltchinskii and Dmitriy Panchenko. 2000. Rademacher processes and bounding the risk of function learning. In Proceedings of the High Dimensional Probability II. Springer, 443–457.

[18]

Heikki Mannila and Hannu Toivonen. 1996. On an algorithm for finding all interesting sentences. In Proceedings of the 13th European Meeting on Cybernetics and Systems Research, Vol. II. Citeseer.

[19]

Colin McDiarmid. 1989. On the method of bounded differences. Surveys in Combinatorics 141, 1 (1989), 148–188.

[20]

Luca Oneto, Alessandro Ghio, Davide Anguita, and Sandro Ridella. 2013. An improved analysis of the Rademacher data-dependent bound using its self bounding property. Neural Networks 44 (2013), 107–111. https://www.sciencedirect.com/science/article/abs/pii/S0893608013001020.

[21]

Leonardo Pellegrina. 2021. Rigorous and Efficient Algorithms for Significant and Approximate Pattern Mining. Ph.D. Thesis. Universitá degli Studi di Padova. Retrieved from http://www.dei.unipd.it/pellegri/thesis/leonardo_pellegrina_tesi.pdf.

[22]

Leonardo Pellegrina, Cyrus Cousins, Fabio Vandin, and Matteo Riondato. 2020. MCRapper: Monte-Carlo Rademacher averages for poset families and approximate pattern mining. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.

Digital Library

[23]

Leonardo Pellegrina, Matteo Riondato, and Fabio Vandin. 2019. SPuManTE: Significant pattern mining with unconditional testing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York, NY, 1528–1538. DOI:

Digital Library

[24]

Leonardo Pellegrina and Fabio Vandin. 2020. Efficient mining of the most significant patterns with permutation testing. Data Mining and Knowledge Discovery 34, 4 (2020), 1201–1234.

[25]

Matteo Riondato and Eli Upfal. 2014. Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Transactions on Knowledge Discovery from Data 8, 4 (2014), 20. DOI:

Digital Library

[26]

Matteo Riondato and Eli Upfal. 2015. Mining frequent itemsets through progressive sampling with Rademacher averages. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1005–1014.

Digital Library

[27]

Matteo Riondato and Fabio Vandin. 2014. Finding the true frequent itemsets. In Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, 497–505.

[28]

Matteo Riondato and Fabio Vandin. 2018. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2130–2139.

Digital Library

[29]

Matteo Riondato and Fabio Vandin. 2020. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. ACM Transactions on Knowledge Discovery from Data 14, 5 (June 2020), Article 56, 31 pages. DOI:

Digital Library

[30]

Diego Santoro, Andrea Tonon, and Fabio Vandin. 2020. Mining sequential patterns with VC-dimension and Rademacher complexity. Algorithms 13, 5 (2020), 123.

[31]

Sacha Servan-Schreiber, Matteo Riondato, and Emanuel Zgraggen. 2018. ProSecCo: Progressive sequence mining with convergence guarantees. In Proceedings of the 18th IEEE International Conference on Data Mining. 417–426.

[32]

Sacha Servan-Schreiber, Matteo Riondato, and Emanuel Zgraggen. 2020. ProSecCo: Progressive sequence mining with convergence guarantees. Knowledge and Information Systems 62, 4 (2020), 1313–1340.

Digital Library

[33]

Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.

[34]

Mahito Sugiyama, Felipe Llinares-López, Niklas Kasenburg, and Karsten M Borgwardt. 2015. Significant subgraph mining with multiple testing correction. In Proceedings of the 2015 SIAM International Conference on Data Mining. SIAM, 37–45.

[35]

Aika Terada, Mariko Okada-Hatakeyama, Koji Tsuda, and Jun Sese. 2013. Statistical significance of combinatorial regulations. Proceedings of the National Academy of Sciences 110, 32 (2013), 12996–13001.

[36]

Hannu Toivonen. 1996. Sampling large databases for association rules. In Proceedings of the 22nd International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, 134–145.

Digital Library

[37]

Andrea Tonon and Fabio Vandin. 2019. Permutation strategies for mining significant sequential patterns. In Proceedings of the 2019 IEEE International Conference on Data Mining. IEEE, 1330–1335.

[38]

Vladimir N. Vapnik. 1998. Statistical Learning Theory. Wiley.

Digital Library

Cited By

Tang HWang JWang L(2023)Mining Significant Utility Discriminative Patterns in Quantitative DatabasesMathematics10.3390/math1104095011:4(950)Online publication date: 13-Feb-2023
https://doi.org/10.3390/math11040950
Pellegrina LVandin F(2023) SILVAN: Estimating Betweenness Centralities with Progressive Sampling and Non-uniform Rademacher BoundsACM Transactions on Knowledge Discovery from Data10.1145/362860118:3(1-55)Online publication date: 9-Dec-2023
https://dl.acm.org/doi/10.1145/3628601
Xiang NWang QYou M(2023)Estimation and update of betweenness centrality with progressive algorithm and shortest paths approximationScientific Reports10.1038/s41598-023-44392-013:1Online publication date: 10-Oct-2023
https://doi.org/10.1038/s41598-023-44392-0

Index Terms

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

Recommendations

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA ...
Closed frequent similar pattern mining

The concept of closed frequent similar pattern mining is introduced.Several lemmas to prune the search space are introduced and proved.A novel closed frequent similar pattern mining algorithm (CFSP-Miner), is proposed.CFSP-Miner is more efficient than ...
Developing Novel and Effective Approach for Association Rule Mining Using Progressive Sampling
ICCEE '09: Proceedings of the 2009 Second International Conference on Computer and Electrical Engineering - Volume 01

A challenging task in data mining is the process of discovering association rules from a large database. Most of the existing association rule mining algorithms make repeated passes over the entire database to determine the frequent itemsets, which is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 16, Issue 6

December 2022

631 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3543989

Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 July 2022

Online AM: 25 April 2022

Accepted: 01 April 2022

Revised: 01 December 2021

Received: 01 August 2021

Published in TKDD Volume 16, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Science Foundation NSF
DARPA/ARFL
Italian Ministry of Education, University and Research (MIUR)
SID 2020: RATED-X

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
185
Total Downloads

Downloads (Last 12 months)52
Downloads (Last 6 weeks)1

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tang HWang JWang L(2023)Mining Significant Utility Discriminative Patterns in Quantitative DatabasesMathematics10.3390/math1104095011:4(950)Online publication date: 13-Feb-2023
https://doi.org/10.3390/math11040950
Pellegrina LVandin F(2023) SILVAN: Estimating Betweenness Centralities with Progressive Sampling and Non-uniform Rademacher BoundsACM Transactions on Knowledge Discovery from Data10.1145/362860118:3(1-55)Online publication date: 9-Dec-2023
https://dl.acm.org/doi/10.1145/3628601
Xiang NWang QYou M(2023)Estimation and update of betweenness centrality with progressive algorithm and shortest paths approximationScientific Reports10.1038/s41598-023-44392-013:1Online publication date: 10-Oct-2023
https://doi.org/10.1038/s41598-023-44392-0

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents