research-article

Public Access

ROhAN: Row-order agnostic null models for statistically-sound knowledge discovery

Authors:

Maryam Abuissa,

Matteo RiondatoAuthors Info & Claims

Data Mining and Knowledge Discovery, Volume 37, Issue 4

Pages 1692 - 1718

https://doi.org/10.1007/s10618-023-00938-4

Published: 06 May 2023 Publication History

Abstract

We introduce a novel class of null models for the statistical validation of results obtained from binary transactional and sequence datasets. Our null models are Row-Order Agnostic (ROA), i.e., do not consider the order of rows in the observed dataset to be fixed, in stark contrast with previous null models, which are Row-Order Enforcing (ROE). We present ROhAN, an algorithmic framework for efficiently sampling datasets from ROA models according to user-specified distributions, which is a necessary step for the resampling-based statistical hypothesis tests employed to validate the results. ROhAN uses Metropolis-Hastings or rejection sampling to build on top of existing or future ROE sampling procedures. Our experimental evaluation shows that ROA models are very different from ROE ones, impacting the statistical validation, and that ROhAN is efficient, mixes fast, and scales well as the dataset grows.

References

[1]

Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proc. 20th Int. Conf. Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, VLDB ’94, pp 487–499

[2]

Besag J and Clifford P Generalized monte carlo significance tests Biometrika 1989 76 4 633-642

[3]

Casella G, Robert CP, Wells MT (2004) Generalized accept-reject sampling schemes. In: A Festschrift for Herman Rubin, IMS Lecture Notes - Monograph Series, vol 45. IMS, p 342–347

[4]

Chen Y, Diaconis P, Holmes SP, et al. Sequential monte carlo methods for statistical analysis of tables J Am Stat Assoc 2005 100 469 109-120

[5]

Cimini G, Squartini T, Saracco F, et al. The statistical physics of real-world networks Nature Rev Phys 2019 1 1 58-71

[6]

Connor EF and Simberloff D The assembly of species communities: chance or competition? Ecology 1979 60 6 1132-1140

[7]

Dalleiger S, Vreeken J (2022) Discovering significant patterns under sequential false discovery control. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, KDD ’22

[8]

De Bie T Maximum entropy models and subjective interestingness: an application to tiles in binary databases Data Min Knowl Disc 2010 23 3 407-446

[9]

Ferkingstad E, Holden L, and Sandve GK Monte Carlo null models for genomic data Stat Sci 2015 30 1 59-71

[10]

Fout AM (2022) New methods for fixed-margin binary matrix sampling, Fréchet covariance, and MANOVA tests for random objects in multiple metric spaces. PhD thesis, Colorado State University

[11]

Gionis A, Mannila H, Mielikäinen T, et al. Assessing data mining results via swap randomization ACM Trans Knowl Dis from Data (TKDD) 2007 1 3 14

[12]

Gwadera R, Crestani F (2010) Ranking sequential patterns with respect to significance. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 286–299

[13]

Hämäläinen W and Webb GI A tutorial on statistically sound pattern discovery Data Min Knowl Disc 2019 33 2 325-377

[14]

Hrovat G, Fister IJr, Yermak K, et al. (2015) Interestingness measure for mining sequential patterns in sports. Journal of Intelligent & Fuzzy Systems 29(5):1981–1994

[15]

Jenkins S, Walzer-Goldfeld S, and Riondato M SPEck: mining statistically-significant sequential patterns efficiently with exact sampling Data Min Knowl Disc 2022 36 4 1575-1599

[16]

Lehmann EL and Romano JP Testing Statistical Hypotheses 2022 4 Berlin Springer

[17]

Low-Kam C, Raïssi C, Kaytoue M, et al. (2013) Mining statistically significant sequential patterns. In: 2013 IEEE 13th International Conference on Data Mining, IEEE, pp 488–497

[18]

Méger N, Rigotti C, Pothier C (2015) Swap randomization of bases of sequences for mining satellite image times series. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 190–205

[19]

Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, KDD ’98, pp 274–278

[20]

Mitzenmacher M, Upfal E (2005) Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press

[21]

Ojala M (2010) Assessing data mining results on matrices with randomization. In: 2010 IEEE International Conference on Data Mining, pp 959–964,

[22]

Ojala M, Vuokko N, Kallio A, et al. (2008) Randomization of real-valued matrices for assessing the significance of data mining results. In: Proceedings of the 2008 SIAM International Conference on Data Mining, SDM ’08, pp 494–505,

[23]

Ojala M, Garriga GC, Gionis A, et al. (2010) Evaluating query result significance in databases via randomizations. In: Proceedings of the 2010 SIAM International Conference on Data Mining (SDM), pp 906–917,

[24]

Pei J, Han J, Mortazavi-Asl B, et al. Mining sequential patterns by pattern-growth: the PrefixSpan approach IEEE Trans Knowl Data Eng 2004 16 11 1424-1440

[25]

Pellegrina L, Riondato M, Vandin F (2019) Hypothesis testing and statistically-sound pattern mining. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York, NY, USA, KDD ’19, pp 3215–3216,

[26]

Pinxteren S, Calders T (2021) Efficient permutation testing for significant sequential patterns. In: Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), SIAM, pp 19–27

[27]

Preti G, De Francisci Morales G, Riondato M (2022) Alice and the caterpillar: A more descriptive null models for assessing data mining results. In: Proceedings of the 22nd IEEE International Conference on Data Mining, pp 418–427

[28]

Ryser HJ Combinatorial Mathematics 1963 USA American Mathematical Society

[29]

Stanley RP (2011) Enumerative Combinatorics, vol 1, 2nd edn. Cambridge University Press

[30]

Tonon A, Vandin F (2019) Permutation strategies for mining significant sequential patterns. In: 2019 IEEE International Conference on Data Mining (ICDM), IEEE, pp 1330–1335

[31]

Vreeken J, Tatti N (2014) Interesting patterns. In: Frequent pattern mining. Springer, p 105–134

[32]

Wang G A fast MCMC algorithm for the uniform sampling of binary matrices with fixed margins Electron J Statistics 2020 14 1 1690-1706

[33]

Westfall PH, Young SS (1993) Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons

[34]

Zimmermann A The data problem in data mining SIGKDD Explor 2014 16 2 38-45

Cited By

Preti GDe Francisci Morales GRiondato M(2024)Alice and the Caterpillar: A more descriptive null model for assessing data mining resultsKnowledge and Information Systems10.1007/s10115-023-02001-666:3(1917-1954)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10115-023-02001-6

Recommendations

A tutorial on statistically sound pattern discovery

Statistically sound pattern discovery harnesses the rigour of statistical hypothesis testing to overcome many of the issues that have hampered standard data mining approaches to pattern discovery. Most importantly, application of appropriate statistical ...
K-sample tests for equality of variances of random fuzzy sets

The problem of testing equality of variances often arises when distributions of random variables are compared or linear models between them are considered. The usual tests for variances given normality of the underlying populations are highly non-robust ...
Nonparametric lack-of-fit tests for parametric mean-regression models with censored data

We developed two kernel smoothing based tests of a parametric mean-regression model against a nonparametric alternative when the response variable is right-censored. The new test statistics are inspired by the synthetic data and the weighted least ...

Comments

Information & Contributors

Information

Published In

cover image Data Mining and Knowledge Discovery

Data Mining and Knowledge Discovery Volume 37, Issue 4

Jul 2023

392 pages

ISSN:1384-5810

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 06 May 2023

Accepted: 06 April 2023

Received: 28 November 2022

Author Tags

Qualifiers

Research-article

Funding Sources

Division of Information and Intelligent Systems

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Preti GDe Francisci Morales GRiondato M(2024)Alice and the Caterpillar: A more descriptive null model for assessing data mining resultsKnowledge and Information Systems10.1007/s10115-023-02001-666:3(1917-1954)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10115-023-02001-6

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents