research-article

Discovering Significant Patterns under Sequential False Discovery Control

Authors:

Sebastian Dalleiger,

Jilles VreekenAuthors Info & Claims

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 263 - 272

https://doi.org/10.1145/3534678.3539398

Published: 14 August 2022 Publication History

Abstract

We are interested in discovering those patterns from data with an empirical frequency that is significantly differently than expected. To avoid spurious results, yet achieve high statistical power, we propose to sequentially control for false discoveries during the search. To avoid redundancy, we propose to update our expectations whenever we discover a significant pattern. To efficiently consider the exponentially sized search space, we employ an easy-to-compute upper bound on significance, and propose an effective search strategy for sets of significant patterns. Through an extensive set of experiments on synthetic data, we show that our method, Spass, recovers the ground truth reliably, does so efficiently, and without redundancy. On real-world data we show it works well on both single and multiple classes, on low and high dimensional data, and through case studies that it discovers meaningful results.

Supplemental Material

MOV File

We are interested in discovering those patterns from data with an empirical frequency that is significantly differently than expected. To avoid spurious results, yet achieve high statistical power, we propose to \emph{sequentially} control for false discoveries \emph{during} the search. To avoid redundancy, we propose to update our expectations whenever we discover a significant pattern. To efficiently consider the exponentially sized search space, we employ an easy-to-compute upper bound on significance, and propose an effective search strategy for sets of significant patterns.

Download
119.98 MB

References

[1]

Charu C. Aggarwal and Jiawei Han. 2004. Frequent Pattern Mining .Springer.

[2]

Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining Association Rules. In VLDB, Vol. 1215. Morgan Kaufmann, 487--499.

[3]

Ehud Aharoni and Saharon Rosset. 2013. Generalized alpha-investing: definitions, optimality results and application to public databases. J. R. Stat. Soc. B, Vol. 76, 4 (2013), 771--794.

[4]

Stephen D. Bay and Michael J. Pazzani. 2001. Detecting Group Differences : Mining Contrast Sets. Data Min. Knowl. Discov., Vol. 5, 3 (2001), 213--246.

Digital Library

[5]

Yoav Benjamini and Yosef Hochberg. 1995. Controlling the False Discovery Rate : A Practical and Powerful Approach to Multiple Testing. J Royal Stat. S B (Methodological), Vol. 57, 1 (1995), 289--300.

[6]

C. E. Bonferroni. 1936. Teoria Statistica Delle Classi e Calcolo Delle Probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, Vol. 8 (1936), 3--62.

[7]

Kailash Budhathoki and Jilles Vreeken. 2015. The Difference and the Norm - Characterising Similarities and Differences Between Databases. In ECML PKDD (LNCS, Vol. 9285). Springer, 206--223.

[8]

Herman Chernoff. 1952. A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the Sum of Observations. Ann Math Stat, Vol. 23, 4 (1952), 493--507.

[9]

Imre Csiszár. 1975. I-Divergence Geometry of Probability Distributions and Minimization Problems. Ann. Probab., Vol. 3, 1 (1975), 146--158.

[10]

Christina Curtis and et.al. 2012. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, Vol. 486, 7403 (2012), 346--352.

[11]

Sebastian Dalleiger and Jilles Vreeken. 2020 a. Explainable Data Decompositions. In AAAI. 3709--3716.

[12]

Sebastian Dalleiger and Jilles Vreeken. 2020 b. The Relaxed Maximum Entropy Distribution and its Application to Pattern Discovery. In ICDM. IEEE, 978--983.

[13]

Jonas Fischer and Jilles Vreeken. 2020. Discovering Succinct Pattern Sets Expressing Co-Occurrence and Mutual Exclusivity. In KDD '20. ACM, 813--823.

[14]

Jaroslav Fowkes and Charles Sutton. 2016. A Bayesian Network Model for Interesting Itemsets. In ECML PKDD. Springer, 410--425.

[15]

Cristian A Gallo, Rocio L Cecchini, Jessica A Carballido, Sandra Micheletto, and Ignacio Ponzoni. 2016. Discretization of gene expression data revised. Briefings in bioinformatics, Vol. 17, 5 (2016), 758--770.

[16]

Aristides Gionis, Heikki Mannila, Taneli Mielik"ainen, and Panayiotis Tsaparas. 2007. Assessing Data Mining Results via Swap Randomization. Trans. Knowl. Discov. Data, Vol. 1, 3 (2007), 14.

Digital Library

[17]

Wilhelmiina H"am"al"ainen. 2012. Kingfisher: An Efficient Algorithm for Searching for Both Positive and Negative Dependency Rules with Statistical Significance Measures. Knowl Inf Syst, Vol. 32, 2 (2012), 383--414.

[18]

Adel Javanmard and Andrea Montanari. 2015. On Online Control of False Discovery Rate. CoRR (2015). showeprint[arXiv]1502.06197

[19]

Adel Javanmard and Andrea Montanari. 2018. Online rules for control of false discovery rate and false discovery exceedance. Ann. Statist., Vol. 46, 2 (2018), 526 -- 554.

[20]

E.T. Jaynes. 1982. On the Rationale of Maximum-Entropy Methods. IEEE, Vol. 70, 9 (1982), 939--952.

[21]

Adam Kirsch, Michael Mitzenmacher, Andrea Pietracaprina, Geppino Pucci, Eli Upfal, and Fabio Vandin. 2012. An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets. J. ACM, Vol. 59, 3, Article 12 (jun 2012).

Digital Library

[22]

Felipe Llinares-López, Laetitia Papaxanthos, Dean Bodenham, Damian Roqueiro, and Karsten Borgwardt. 2017. Genome-Wide Genetic Heterogeneity Discovery with Categorical Covariates. Bioinformatics, Vol. 33, 12 (2017), 1820--1828.

[23]

Felipe Llinares-López, Laetitia Papaxanthos, Damian Roqueiro, Dean Bodenham, and Karsten Borgwardt. 2019. CASMAP : Detection of Statistically Significant Combinations of SNPs in Association Mapping. Bioinformatics, Vol. 35, 15 (2019), 2680--2682.

[24]

Felipe Llinares-López, Mahito Sugiyama, Laetitia Papaxanthos, and Karsten Borgwardt. 2015. Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing. In KDD. ACM, 725--734.

[25]

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In AMACL. Association for Computational Linguistics, 142--150.

Digital Library

[26]

Michael Mampaey, Jilles Vreeken, and Nikolaj Tatti. 2012. Summarizing Data Succinctly with the Most Informative Itemsets. TKDD, Vol. 6, 4 (2012), 16.

Digital Library

[27]

Shin-ichi Minato, Takeaki Uno, Koji Tsuda, Aika Terada, and Jun Sese. 2014. A Fast Method of Statistical Assessment for Combinatorial Hypotheses Based on Frequent Itemset Enumeration. In Mach Learn. Know Disc. Data. Springer, 422--436.

[28]

Laetitia Papaxanthos, Felipe Llinares-López, Dean Bodenham, and Karsten Borgwardt. 2016. Finding Significant Combinations of Features in the Presence of Categorical Covariates. In NeurIPS. 2279--2287.

[29]

Leonardo Pellegrina, Matteo Riondato, and Fabio Vandin. 2019. SPuManTE : Significant Pattern Mining with Unconditional Testing. In KDD. ACM, 1528--1538.

[30]

Leonardo Pellegrina and Fabio Vandin. 2018. Efficient Mining of the Most Significant Patterns with Permutation Testing. In KDD. ACM, 2070--2079.

Digital Library

[31]

Aaditya Ramdas, Tijana Zrnic, Martin J. Wainwright, and Michael I. Jordan. 2018. SAFFRON: an Adaptive Algorithm for Online Control of the False Discovery Rate. In ICML, Vol. 80. PMLR, 4283--4291.

[32]

Raissa T. Relator, Aika Terada, and Jun Sese. 2018. Identifying Statistically Significant Combinatorial Markers for Survival Analysis. BMC Med. Genomics, Vol. 11, 2 (2018), 31.

[33]

Mahito Sugiyama and Karsten M. Borgwardt. 2019. Finding Statistically Significant Interactions between Continuous Features. In IJCAI. 3490--3498.

[34]

M. Sugiyama, F. López, N. Kasenburg, and K. Borgwardt. 2015. Significant Subgraph Mining with Multiple Testing Correction. In SDM (Proceedings). Society for Industrial and Applied Mathematics, 37--45.

[35]

R. E. Tarone. 1990. A Modified Bonferroni Method for Discrete Data. Biometrics, Vol. 46, 2 (1990), 515.

[36]

Nikolaj Tatti. 2006. Computational Complexity of Queries Based on Itemsets. Inform. Process. Lett., Vol. 98, 5 (2006), 183--187.

Digital Library

[37]

Aika Terada, Koji Tsuda, and Jun Sese. 2013. Fast Westfall-Young permutation procedure for combinatorial regulation discovery. In BIBM. IEEE, 153--158.

[38]

Jinjin Tian and Aaditya Ramdas. 2019. ADDIS: an adaptive discarding algorithm for online FDR control with conservative nulls. In NeurIPS. 9383--9391.

[39]

Jinjin Tian and Aaditya Ramdas. 2021. Online control of the familywise error rate. Stat. Meth. Med. R., Vol. 30, 4 (2021), 976--993.

[40]

Fabio Vandin, Eli Upfal, and Benjamin J. Raphael. 2011. Algorithms for Detecting Significantly Mutated Pathways in Cancer. J. Comput. Biol., Vol. 18, 3 (2011), 507--522.

[41]

Jilles Vreeken, Matthijs van Leeuwen, and Arno Siebes. 2011. Krimp: mining itemsets that compress. Data Min. Knowl. Discov., Vol. 23, 1 (2011), 169--214.

Digital Library

[42]

Geoffrey I. Webb. 2008. Layered Critical Values: A Powerful Direct-Adjustment Approach to Discovering Significant Patterns. Mach. Learn., Vol. 71, 2--3 (2008), 307--323.

Digital Library

[43]

Geoffrey I. Webb and Franc cois Petitjean. 2016. A Multiple Test Correction for Streams and Cascades of Statistical Hypothesis Tests. In KDD. ACM, 1255--1264.

[44]

Geoffrey I. Webb and Jilles Vreeken. 2013. Efficient Discovery of the Most Interesting Associations. ACM Trans. Knowl. Discov. Data, Vol. 8, 3 (2013), 15:1--15:31.

[45]

Qingrun Zhang, Quan Long, and Jurg Ott. 2014. AprioriGWAS, a New Pattern Mining Strategy for Detecting Genetic Variants Associated with Disease through Interaction Effects. PLoS, Vol. 10, 6 (2014), 14.

Cited By

Pellegrina LVandin F(2024)Efficient Discovery of Significant Patterns with Few-Shot ResamplingProceedings of the VLDB Endowment10.14778/3675034.367505517:10(2668-2680)Online publication date: 6-Aug-2024
https://dl.acm.org/doi/10.14778/3675034.3675055
Pellegrina LVandin F(2023) SILVAN: Estimating Betweenness Centralities with Progressive Sampling and Non-uniform Rademacher BoundsACM Transactions on Knowledge Discovery from Data10.1145/362860118:3(1-55)Online publication date: 9-Dec-2023
https://dl.acm.org/doi/10.1145/3628601
Pellegrina LSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)Efficient Centrality Maximization with Rademacher AveragesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599325(1872-1884)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599325
Show More Cited By

Index Terms

Discovering Significant Patterns under Sequential False Discovery Control
1. Information systems
  1. Information systems applications
    1. Data mining
2. Mathematics of computing
  1. Information theory
  2. Probability and statistics
    1. Probabilistic inference problems
    2. Statistical paradigms
      1. Exploratory data analysis

Recommendations

Discovering Skyline Periodic Itemset Patterns in Transaction Sequences
Advanced Data Mining and Applications
Abstract
As an extended version of frequent itemset patterns, periodic itemset patterns concern both the frequency and periodicity of itemsets at the same time, so they contain more information than frequent itemset patterns, which only concern the ...
Discovering partial periodic-frequent patterns in a transactional database

Proposed a novel model to find partial periodic-frequent patterns in a database.Introduced a measure to find partial periodic-frequent patterns in a database.An efficient pruning technique has been proposed to reduce the computational cost.Described a ...
Hyperclique pattern discovery

Existing algorithms for mining association patterns often rely on the support-based pruning strategy to prune a combinatorial search space. However, this strategy is not effective for discovering potentially interesting patterns at low levels of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2022

5033 pages

ISBN:9781450393850

DOI:10.1145/3534678

General Chairs:
Aidong Zhang
University of Virginia
,
Huzefa Rangwala
Amazon/George Mason University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '22

Sponsor:

KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 14 - 18, 2022

Washington DC, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
275
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)2

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pellegrina LVandin F(2024)Efficient Discovery of Significant Patterns with Few-Shot ResamplingProceedings of the VLDB Endowment10.14778/3675034.367505517:10(2668-2680)Online publication date: 6-Aug-2024
https://dl.acm.org/doi/10.14778/3675034.3675055
Pellegrina LVandin F(2023) SILVAN: Estimating Betweenness Centralities with Progressive Sampling and Non-uniform Rademacher BoundsACM Transactions on Knowledge Discovery from Data10.1145/362860118:3(1-55)Online publication date: 9-Dec-2023
https://dl.acm.org/doi/10.1145/3628601
Pellegrina LSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)Efficient Centrality Maximization with Rademacher AveragesProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599325(1872-1884)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599325
Kawabata KMatsubara YSakurai Y(2023)Modeling Dynamic Interactions over Tensor StreamsProceedings of the ACM Web Conference 202310.1145/3543507.3583458(1793-1803)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583458
Pellizzoni PBorgwardt K(2023)FASM and FAST-YB: Significant Pattern Mining with False Discovery Rate Control2023 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM58522.2023.00159(1265-1270)Online publication date: 1-Dec-2023
https://doi.org/10.1109/ICDM58522.2023.00159
Lin HGan WHuang GYu P(2023)USER: Towards High-Utility Sequential Rules with Repetitive Items2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386473(5977-5986)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386473
Abuissa MLee ARiondato M(2023)ROhAN: Row-order agnostic null models for statistically-sound knowledge discoveryData Mining and Knowledge Discovery10.1007/s10618-023-00938-437:4(1692-1718)Online publication date: 6-May-2023
https://dl.acm.org/doi/10.1007/s10618-023-00938-4

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents