Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3534678.3539398acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Discovering Significant Patterns under Sequential False Discovery Control

Published: 14 August 2022 Publication History

Abstract

We are interested in discovering those patterns from data with an empirical frequency that is significantly differently than expected. To avoid spurious results, yet achieve high statistical power, we propose to sequentially control for false discoveries during the search. To avoid redundancy, we propose to update our expectations whenever we discover a significant pattern. To efficiently consider the exponentially sized search space, we employ an easy-to-compute upper bound on significance, and propose an effective search strategy for sets of significant patterns. Through an extensive set of experiments on synthetic data, we show that our method, Spass, recovers the ground truth reliably, does so efficiently, and without redundancy. On real-world data we show it works well on both single and multiple classes, on low and high dimensional data, and through case studies that it discovers meaningful results.

Supplemental Material

MOV File
We are interested in discovering those patterns from data with an empirical frequency that is significantly differently than expected. To avoid spurious results, yet achieve high statistical power, we propose to \emph{sequentially} control for false discoveries \emph{during} the search. To avoid redundancy, we propose to update our expectations whenever we discover a significant pattern. To efficiently consider the exponentially sized search space, we employ an easy-to-compute upper bound on significance, and propose an effective search strategy for sets of significant patterns.

References

[1]
Charu C. Aggarwal and Jiawei Han. 2004. Frequent Pattern Mining .Springer.
[2]
Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining Association Rules. In VLDB, Vol. 1215. Morgan Kaufmann, 487--499.
[3]
Ehud Aharoni and Saharon Rosset. 2013. Generalized alpha-investing: definitions, optimality results and application to public databases. J. R. Stat. Soc. B, Vol. 76, 4 (2013), 771--794.
[4]
Stephen D. Bay and Michael J. Pazzani. 2001. Detecting Group Differences : Mining Contrast Sets. Data Min. Knowl. Discov., Vol. 5, 3 (2001), 213--246.
[5]
Yoav Benjamini and Yosef Hochberg. 1995. Controlling the False Discovery Rate : A Practical and Powerful Approach to Multiple Testing. J Royal Stat. S B (Methodological), Vol. 57, 1 (1995), 289--300.
[6]
C. E. Bonferroni. 1936. Teoria Statistica Delle Classi e Calcolo Delle Probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, Vol. 8 (1936), 3--62.
[7]
Kailash Budhathoki and Jilles Vreeken. 2015. The Difference and the Norm - Characterising Similarities and Differences Between Databases. In ECML PKDD (LNCS, Vol. 9285). Springer, 206--223.
[8]
Herman Chernoff. 1952. A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the Sum of Observations. Ann Math Stat, Vol. 23, 4 (1952), 493--507.
[9]
Imre Csiszár. 1975. I-Divergence Geometry of Probability Distributions and Minimization Problems. Ann. Probab., Vol. 3, 1 (1975), 146--158.
[10]
Christina Curtis and et.al. 2012. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, Vol. 486, 7403 (2012), 346--352.
[11]
Sebastian Dalleiger and Jilles Vreeken. 2020 a. Explainable Data Decompositions. In AAAI. 3709--3716.
[12]
Sebastian Dalleiger and Jilles Vreeken. 2020 b. The Relaxed Maximum Entropy Distribution and its Application to Pattern Discovery. In ICDM. IEEE, 978--983.
[13]
Jonas Fischer and Jilles Vreeken. 2020. Discovering Succinct Pattern Sets Expressing Co-Occurrence and Mutual Exclusivity. In KDD '20. ACM, 813--823.
[14]
Jaroslav Fowkes and Charles Sutton. 2016. A Bayesian Network Model for Interesting Itemsets. In ECML PKDD. Springer, 410--425.
[15]
Cristian A Gallo, Rocio L Cecchini, Jessica A Carballido, Sandra Micheletto, and Ignacio Ponzoni. 2016. Discretization of gene expression data revised. Briefings in bioinformatics, Vol. 17, 5 (2016), 758--770.
[16]
Aristides Gionis, Heikki Mannila, Taneli Mielik"ainen, and Panayiotis Tsaparas. 2007. Assessing Data Mining Results via Swap Randomization. Trans. Knowl. Discov. Data, Vol. 1, 3 (2007), 14.
[17]
Wilhelmiina H"am"al"ainen. 2012. Kingfisher: An Efficient Algorithm for Searching for Both Positive and Negative Dependency Rules with Statistical Significance Measures. Knowl Inf Syst, Vol. 32, 2 (2012), 383--414.
[18]
Adel Javanmard and Andrea Montanari. 2015. On Online Control of False Discovery Rate. CoRR (2015). showeprint[arXiv]1502.06197
[19]
Adel Javanmard and Andrea Montanari. 2018. Online rules for control of false discovery rate and false discovery exceedance. Ann. Statist., Vol. 46, 2 (2018), 526 -- 554.
[20]
E.T. Jaynes. 1982. On the Rationale of Maximum-Entropy Methods. IEEE, Vol. 70, 9 (1982), 939--952.
[21]
Adam Kirsch, Michael Mitzenmacher, Andrea Pietracaprina, Geppino Pucci, Eli Upfal, and Fabio Vandin. 2012. An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets. J. ACM, Vol. 59, 3, Article 12 (jun 2012).
[22]
Felipe Llinares-López, Laetitia Papaxanthos, Dean Bodenham, Damian Roqueiro, and Karsten Borgwardt. 2017. Genome-Wide Genetic Heterogeneity Discovery with Categorical Covariates. Bioinformatics, Vol. 33, 12 (2017), 1820--1828.
[23]
Felipe Llinares-López, Laetitia Papaxanthos, Damian Roqueiro, Dean Bodenham, and Karsten Borgwardt. 2019. CASMAP : Detection of Statistically Significant Combinations of SNPs in Association Mapping. Bioinformatics, Vol. 35, 15 (2019), 2680--2682.
[24]
Felipe Llinares-López, Mahito Sugiyama, Laetitia Papaxanthos, and Karsten Borgwardt. 2015. Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing. In KDD. ACM, 725--734.
[25]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In AMACL. Association for Computational Linguistics, 142--150.
[26]
Michael Mampaey, Jilles Vreeken, and Nikolaj Tatti. 2012. Summarizing Data Succinctly with the Most Informative Itemsets. TKDD, Vol. 6, 4 (2012), 16.
[27]
Shin-ichi Minato, Takeaki Uno, Koji Tsuda, Aika Terada, and Jun Sese. 2014. A Fast Method of Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Enumeration. In Mach Learn. Know Disc. Data. Springer, 422--436.
[28]
Laetitia Papaxanthos, Felipe Llinares-López, Dean Bodenham, and Karsten Borgwardt. 2016. Finding Significant Combinations of Features in the Presence of Categorical Covariates. In NeurIPS. 2279--2287.
[29]
Leonardo Pellegrina, Matteo Riondato, and Fabio Vandin. 2019. SPuManTE : Significant Pattern Mining with Unconditional Testing. In KDD. ACM, 1528--1538.
[30]
Leonardo Pellegrina and Fabio Vandin. 2018. Efficient Mining of the Most Significant Patterns with Permutation Testing. In KDD. ACM, 2070--2079.
[31]
Aaditya Ramdas, Tijana Zrnic, Martin J. Wainwright, and Michael I. Jordan. 2018. SAFFRON: an Adaptive Algorithm for Online Control of the False Discovery Rate. In ICML, Vol. 80. PMLR, 4283--4291.
[32]
Raissa T. Relator, Aika Terada, and Jun Sese. 2018. Identifying Statistically Significant Combinatorial Markers for Survival Analysis. BMC Med. Genomics, Vol. 11, 2 (2018), 31.
[33]
Mahito Sugiyama and Karsten M. Borgwardt. 2019. Finding Statistically Significant Interactions between Continuous Features. In IJCAI. 3490--3498.
[34]
M. Sugiyama, F. López, N. Kasenburg, and K. Borgwardt. 2015. Significant Subgraph Mining with Multiple Testing Correction. In SDM (Proceedings). Society for Industrial and Applied Mathematics, 37--45.
[35]
R. E. Tarone. 1990. A Modified Bonferroni Method for Discrete Data. Biometrics, Vol. 46, 2 (1990), 515.
[36]
Nikolaj Tatti. 2006. Computational Complexity of Queries Based on Itemsets. Inform. Process. Lett., Vol. 98, 5 (2006), 183--187.
[37]
Aika Terada, Koji Tsuda, and Jun Sese. 2013. Fast Westfall-Young permutation procedure for combinatorial regulation discovery. In BIBM. IEEE, 153--158.
[38]
Jinjin Tian and Aaditya Ramdas. 2019. ADDIS: an adaptive discarding algorithm for online FDR control with conservative nulls. In NeurIPS. 9383--9391.
[39]
Jinjin Tian and Aaditya Ramdas. 2021. Online control of the familywise error rate. Stat. Meth. Med. R., Vol. 30, 4 (2021), 976--993.
[40]
Fabio Vandin, Eli Upfal, and Benjamin J. Raphael. 2011. Algorithms for Detecting Significantly Mutated Pathways in Cancer. J. Comput. Biol., Vol. 18, 3 (2011), 507--522.
[41]
Jilles Vreeken, Matthijs van Leeuwen, and Arno Siebes. 2011. Krimp: mining itemsets that compress. Data Min. Knowl. Discov., Vol. 23, 1 (2011), 169--214.
[42]
Geoffrey I. Webb. 2008. Layered Critical Values: A Powerful Direct-Adjustment Approach to Discovering Significant Patterns. Mach. Learn., Vol. 71, 2--3 (2008), 307--323.
[43]
Geoffrey I. Webb and Franc cois Petitjean. 2016. A Multiple Test Correction for Streams and Cascades of Statistical Hypothesis Tests. In KDD. ACM, 1255--1264.
[44]
Geoffrey I. Webb and Jilles Vreeken. 2013. Efficient Discovery of the Most Interesting Associations. ACM Trans. Knowl. Discov. Data, Vol. 8, 3 (2013), 15:1--15:31.
[45]
Qingrun Zhang, Quan Long, and Jurg Ott. 2014. AprioriGWAS, a New Pattern Mining Strategy for Detecting Genetic Variants Associated with Disease through Interaction Effects. PLoS, Vol. 10, 6 (2014), 14.

Cited By

View all
  • (2024)Efficient Discovery of Significant Patterns with Few-Shot ResamplingProceedings of the VLDB Endowment10.14778/3675034.367505517:10(2668-2680)Online publication date: 6-Aug-2024
  • (2023) SILVAN: Estimating Betweenness Centralities with Progressive Sampling and Non-uniform Rademacher BoundsACM Transactions on Knowledge Discovery from Data10.1145/362860118:3(1-55)Online publication date: 9-Dec-2023
  • (2023)Efficient Centrality Maximization with Rademacher AveragesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599325(1872-1884)Online publication date: 6-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2022
5033 pages
ISBN:9781450393850
DOI:10.1145/3534678
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. binomial test
  2. false discovery rate
  3. family-wise error rate
  4. maximum entropy distribution
  5. multiple hypothesis testing
  6. pattern mining
  7. sequential hypothesis testing

Qualifiers

  • Research-article

Conference

KDD '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)41
  • Downloads (Last 6 weeks)2
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Discovery of Significant Patterns with Few-Shot ResamplingProceedings of the VLDB Endowment10.14778/3675034.367505517:10(2668-2680)Online publication date: 6-Aug-2024
  • (2023) SILVAN: Estimating Betweenness Centralities with Progressive Sampling and Non-uniform Rademacher BoundsACM Transactions on Knowledge Discovery from Data10.1145/362860118:3(1-55)Online publication date: 9-Dec-2023
  • (2023)Efficient Centrality Maximization with Rademacher AveragesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599325(1872-1884)Online publication date: 6-Aug-2023
  • (2023)Modeling Dynamic Interactions over Tensor StreamsProceedings of the ACM Web Conference 202310.1145/3543507.3583458(1793-1803)Online publication date: 30-Apr-2023
  • (2023)FASM and FAST-YB: Significant Pattern Mining with False Discovery Rate Control2023 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM58522.2023.00159(1265-1270)Online publication date: 1-Dec-2023
  • (2023)USER: Towards High-Utility Sequential Rules with Repetitive Items2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386473(5977-5986)Online publication date: 15-Dec-2023
  • (2023)ROhAN: Row-order agnostic null models for statistically-sound knowledge discoveryData Mining and Knowledge Discovery10.1007/s10618-023-00938-437:4(1692-1718)Online publication date: 6-May-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media