Efficient Discovery of Significant Patterns
with Few-Shot Resampling

Leonardo Pellegrina Dept. of Information Engineering, University of PadovaVia Gradenigo 6bPadovaItaly35129 leonardo.pellegrina@unipd.it and Fabio Vandin Dept. of Information Engineering, University of PadovaVia Gradenigo 6bPadovaItaly35129 fabio.vandin@unipd.it

Abstract.

Significant pattern mining is a fundamental task in mining transactional data, requiring to identify patterns significantly associated with the value of a given feature, the target. In several applications, such as biomedicine, basket market analysis, and social networks, the goal is to discover patterns whose association with the target is defined with respect to an underlying population, or process, of which the dataset represents only a collection of observations, or samples. A natural way to capture the association of a pattern with the target is to consider its statistical significance, assessing its deviation from the (null) hypothesis of independence between the pattern and the target. While several algorithms have been proposed to find statistically significant patterns, it remains a computationally demanding task, and for complex patterns such as subgroups, no efficient solution exists.

We present FSR, an efficient algorithm to identify statistically significant patterns with rigorous guarantees on the probability of false discoveries. FSR builds on a novel general framework for mining significant patterns that captures some of the most commonly considered patterns, including itemsets, sequential patterns, and subgroups. FSR uses a small number of resampled datasets, obtained by assigning i.i.d. labels to each transaction, to rigorously bound the supremum deviation of a quality statistic measuring the significance of patterns. FSR builds on novel tight bounds on the supremum deviation that require to mine a small number of resampled datasets, while providing a high effectiveness in discovering significant patterns. As a test case, we consider significant subgroup mining, and our evaluation on several real datasets shows that FSR is effective in discovering significant subgroups, while requiring a small number of resampled datasets.

PVLDB Reference Format:
Efficient Discovery of Significant Patterns with Few-Shot Resampling. PVLDB, 17(10): XXX-XXX, 2024.
doi:XX.XX/XXX.XX This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 17, No. 10 ISSN 2150-8097.
doi:XX.XX/XXX.XX

PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at https://github.com/VandinLab/FSR

1. Introduction

Pattern mining is a fundamental task in data mining that, in its most common definition (Han et al., 2007), requires to find patterns that occur more often than a given frequency threshold in a database of transactions. Pattern mining finds applications in several areas such as market basket analysis (Agrawal et al., 1993), graph databases (Aggarwal et al., 2010; Al Hasan and Zaki, 2009; Chen et al., 2009), and the analysis of spatial and temporal data (Cao et al., 2019; Ceccarello and Gamper, 2022; Ho et al., 2022).

Significant pattern mining (Pellegrina et al., 2019a; Hämäläinen and Webb, 2019) is an extension of pattern mining that, in its most general formulation, requires to discover patterns with a significant association with a binary label from a dataset consisting of a collection of elements, where each element comprises the values of features, which may be categorical, binary, or continuous, and the value of the binary label of interests, also called the target. Such formulation captures various types of patterns, such as itemsets, when all features are binary, or subgroups (Atzmueller, 2015), with more general features. This task finds applications in a wide range of domains, such as market basket analysis, medicine, and molecular biology, where finding reliable associations is paramount.

Significance is usually assessed using the statistical hypothesis testing framework. In such framework one defines a measure of quality for patterns, and assumes the null hypothesis of no association between a pattern and the target label. The significant patterns are then the ones with quality that significantly deviate from the null distribution, that is, the distribution of the quality under the null hypothesis. The deviation from the null distribution is usually measured by a $p$ -value, that is the probability, under the null distribution, that the pattern has quality as large as the one observed in the dataset.

A major complication in the use of the statistical hypothesis testing framework in data mining is given by the huge number of candidate patterns that are considered, resulting in a multiple hypothesis testing problem. With a huge number of candidate patterns, some non-significant patterns display a substantial deviation from the null distribution just by chance. Therefore, it is critical to account for testing multiple hypotheses when mining significant patterns, in order to avoid reporting a large number of spurious discoveries. Several methods have been proposed to deal with multiple hypothesis testing (Benjamini and Hochberg, 1995; Bonferroni, 1936; Westfall and Young, 1993). While these methods provide various guarantees, the one most commonly considered is the Family-Wise Error Rate (FWER), which is the probability of reporting in output one or more false discoveries.

Current approaches for significant pattern mining with guarantees on the FWER belong to one of two classes. The first class is given by approaches that assess the significance of each single pattern (e.g., through a $p$ -value or related quantities), and then perform an analytical correction to account for multiple hypothesis testing (Webb, 2006, 2007, 2008). A widely used procedure in this class is given by Bonferroni correction (Benjamini and Hochberg, 1995), which computes a corrected $p$ -value by multiplying the $p$ -value $p_{\mathcal{P}}$ of a pattern $\mathcal{P}$ by the number $h$ of candidate hypotheses. If patterns with corrected $p$ -value $h\times p_{\mathcal{P}}$ below a threshold $\alpha$ are flagged as significant, the FWER of the output is guaranteed to be $\leq\alpha$ . While these approaches and their improved versions (Terada et al., 2013; Minato et al., 2014) are fairly efficient, thanks to the use of analytical derivations, they suffer from a low statistical power, that is they often fail in identifying significant patterns, due to the multiple hypothesis corrections that must provide guarantees for every situation, independently of the observed data.

The second class is given by approaches that use the dataset to estimate the overall distribution of the patterns’ statistics (or corresponding measures, such as the $p$ -values) under the null hypothesis (Llinares-López et al., 2015; Pellegrina and Vandin, 2020; Terada et al., 2015). The distribution is estimated using permuted versions of the data, obtained by keeping the features in each element of the dataset fixed and randomly permuting the target labels among elements. For example, the Westfall-Young (WY) method (Westfall and Young, 1993) uses permuted datasets to estimate the quantiles of the smallest $p$ -value under the null hypotheses (or, equivalently, largest qualities or test statistics), and uses such estimate to derive a corrected threshold to flag patterns as significant. These approaches usually improve the statistical power for detecting significant patterns compared to approaches in the first class, but are often computationally demanding, since they need to mine a large number of permuted datasets to obtain good estimates of the overall distribution of the patterns’ statistics. While this can be achieved fairly efficiently for simple patterns such as itemsets (Llinares-López et al., 2015; Pellegrina and Vandin, 2020), the overall approach is impractical for more complex patterns such as subgroups, for which mining even a single dataset is extremely time-consuming.

An additional limitation of permutational approaches is that they focus on conditional testing (Fisher, 1922). In conditional testing one assumes that the variables of interest, in our case the frequency of patterns and the fraction of elements with target label $1$ , are the same in every dataset from the null distribution. In contrast, in unconditional testing (Barnard, 1945) one assumes that the variables of interest are the realization of corresponding random variables. Conditional testing and unconditional testing capture different assumptions regarding how data is generated and collected, that is, whether the variables of interest would be the same in different repetitions of the experiment. The choice between the two types of testing depends on the specific scenarios. However, in practice conditional tests are often used for computational reasons, since unconditional tests are much more demanding from the computational standpoint due to the need to account for uncertainties in the observed quantities. In fact, while for simple patterns such as itemsets (Pellegrina et al., 2019b) significant pattern mining procedures with (partial) unconditional testing have been designed, for more complex patterns such as subgroups no unconditional testing procedure is available.

1.1. Contributions

This work focuses on the efficient discovery of significant patterns. Our contributions are four-fold. Firstly, we propose the first general framework to discover significant patterns that can be used for both conditional and unconditional testing. Our framework is based on a natural definition of a pattern’s quality that captures its significance, and applies to any type of pattern for which the appearance of the pattern in an element of the dataset is well defined. Such patterns include widely used patterns such as itemsets, subgroups, sequential patterns, and subgraphs. Second, we propose FSR, an algorithm for the efficient discovery of a rigorous approximation of significant patterns while controlling the Family-Wise Error Rate, which is the probability of reporting even a single false discovery. FSR uses a few-shot resampling approach, that is, it mines a small number of resampled datasets, obtained by keeping the features of each element of the dataset fixed and assigning i.i.d. values to the target. Moreover, FSR can leverage any existing algorithm for mining the patterns of interest. Third, we provide novel tight theoretical results relating the distribution of patterns’ qualities under the conditional and the unconditional distributions, and relating the estimated maximum deviation of patterns’ qualities in resampled datasets with their (unknown) true quality in the corresponding distribution. These results are crucial in making our approach practical, since they imply that mining a small number of resampled datasets is enough to identify significant patterns. Fourth, we consider significant subgroups mining as a test case in our extensive empirical evaluation. We use our algorithm FSR to derive the first approach for mining significant subgroups with unconditional testing, and an approach for the conditional testing scenario which is much more efficient than permutation testing while maintaining an extremely high power. For the most challenging datasets, FSR is the only approach that allows to mine significant subgroups within reasonable time. More importantly, we remark that the considered test case of subgroups is well representative of other settings. In fact, we expect the sensible improvements obtained by FSR to transfer to other cases (i.e., other pattern types), given the generality of our framework and the characteristics of our permutational approaches, which are shared by all significant pattern tasks.

Due to space constraints, we defer some of the proofs and additional results to the Appendix.

2. Related Works

Our work focuses on efficiently mining significant patterns. We now discuss the previous works most related to our contributions. We refer the reader to recent comprehensive reviews and tutorials for an overview of commonly used techniques for mining significant patterns (Hämäläinen and Webb, 2019; Pellegrina et al., 2019a).

Several approaches (Webb, 2006, 2007, 2008) have used general methods for multiple hypotheses testing, such as Bonferroni (Bonferroni, 1936) and Holm methods (Holm, 1979), within significant pattern mining. Such methods result in low statistical power, since they correct the $p$ -value, or a measure of the significance of a pattern, by the number of candidate hypotheses (i.e., the number of patterns), which is extremely large in data mining applications. LAMP (Terada et al., 2013; Minato et al., 2014) is a recently introduced method that partially addresses such issue by selecting a subset of patterns for testing, while discarding the patterns with no chance of being significant. Such approach leads to identifying an improved (i.e., smaller) correction factor, resulting in higher statistical power while controlling the FWER. However, since each selected pattern is still tested as a distinct hypothesis, LAMP still leads to reduced statistical power (Llinares-López et al., 2015; Terada et al., 2015). Our algorithm FSR instead uses resampled datasets, taking into account dependencies among patterns, to achieve high statistical power.

Several permutation-based methods have been proposed to identify patterns significantly associated with a binary target label. (Llinares-López et al., 2015; Pellegrina and Vandin, 2020; Terada et al., 2015) use the Westfall-Young (WY) permutation test (Westfall and Young, 1993). The WY permutation procedure requires to estimate the $\delta$ -quantile of the distribution of the minimum (overall all patterns) $p$ -value (or, equivalently, of the distribution of the maximum deviation, over all patterns, of the measure of significance). The use of such quantiles makes WY permutation testing more powerful than LAMP, but also more computationally expensive, since the estimation of such quantiles requires to mine a large number of permuted datasets. For these reasons, all such methods may need impractical resources. Our algorithm FSR instead requires to mine a small number of resampled datasets, leading to an efficient approach even for complex patterns such as subgroups. In addition, permutation-based methods only consider the conditional distribution, where patterns’ frequencies and the fraction of elements with target label equal to $1$ are assumed fixed in all permuted datasets. Our approach instead works for both the conditional and the unconditional distribution.

The identification of significant patterns with unconditional testing has received scant attention. This is due, in part, to the higher computational cost required for assessing the significance of even a single hypothesis in the unconditional setting, for example using Barnard’s test (Barnard, 1945). As far as we know, the only approach for mining significant patterns in a partially unconditional setting is SPuManTE (Pellegrina et al., 2019b). In the partially unconditional setting considered by SPuManTE only the frequencies of the patterns are not fixed, while the target is fixed by design. In contrast our algorithm FSR considers a fully unconditional setting, where the fraction of elements with target label equal to $1$ is not fixed either. Moreover, SPuManTE leverages specialized techniques for mining significant itemsets that cannot be easily generalized to other pattern types, while our approach applies directly to several pattern languages, including itemsets, sequential patterns, and subgroups.

Other works, orthogonal to ours, consider improving the diversity or limiting redundancy of the output (Van Leeuwen and Knobbe, 2012; Kalofolias et al., 2017; Dalleiger and Vreeken, 2022).

3. Preliminaries

We consider a dataset $\mathcal{D}$ as a collection of $m$ transactions $\mathcal{D}=\left\{(s_{1},\ell_{1}),\dots,(s_{m},\ell_{m})\right\}$ , where each transaction $(s,\ell)\in\mathcal{D}$ is composed by a set $s$ of $d$ features, either binary, categorical, or continuous, and a binary target variable $\ell\in\{0,1\}$ . More generally, we assume that $s$ belongs to a domain $\mathcal{X}$ . By defining the multisets $\mathcal{A}=\left\{s_{1},\dots,s_{m}\right\}$ and $\mathcal{T}=\left\{\ell_{1},\dots,\ell_{m}\right\}$ , a dataset $\mathcal{D}$ is also represented by the pair $\mathcal{D}=(\mathcal{A},\mathcal{T})$ . We assume to have a language $\mathcal{L}$ containing the patterns of potential interest. This scenario captures widely used pattern mining tasks, such as: itemset mining, where all features correspond to (binary) items and the language $\mathcal{L}$ corresponds to the set of all itemsets, i.e., (non-empty) subsets of items; subgroup mining, where the language $\mathcal{L}$ contains all subgroups, i.e., the sets of conjunctions with at most $z$ conditions over features from $\mathcal{A}$ , where each condition is either an equality, on a categorical feature, or an interval, on a continuous feature.

Given a transaction $(s,\ell)$ , we use the notation $\mathcal{P}\in s$ to say that the set $s$ of features supports pattern $\mathcal{P}$ , where the meaning of $\mathcal{P}\in s$ depends on the specific data mining task. For example, for itemset mining, $\mathcal{P}\in s$ means that the pattern $\mathcal{P}$ is contained in the set $s$ , while for subgroup mining it means that the conditions defined by $\mathcal{P}$ are all satisfied by the features of $s$ . We define the set $\mathsf{C}_{\mathcal{P}}(\mathcal{D})$ of transactions in the dataset $\mathcal{D}$ that support a pattern $\mathcal{P}$ as $\mathsf{C}_{\mathcal{P}}(\mathcal{D})=\left\{(s,\ell)\in\mathcal{D}:\mathcal{P% }\in s\right\}$ . The frequency $\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ of a pattern $\mathcal{P}$ in the dataset $\mathcal{D}$ is the fraction of transactions of $\mathcal{D}$ that support $\mathcal{P}$ : $\mathsf{f}_{\mathcal{P}}(\mathcal{D})=\frac{1}{m}\sum_{i=1}^{m}\mathds{1}\left% [\mathcal{P}\in s_{i}\right]=\frac{|\mathsf{C}_{\mathcal{P}}(\mathcal{D})|}{m}$ .

Finally, let $\mu(\mathcal{D})$ denote the average value of the target $\ell$ for transactions in the dataset $\mathcal{D}$ : $\mu(\mathcal{D})=\frac{1}{|\mathcal{D}|}\sum_{(s,\ell)\in\mathcal{D}}\ell$ .

3.1. Significant Patterns

Our goal is to find significant patterns, where a pattern $\mathcal{P}$ is significant if the presence of $\mathcal{P}$ in a transaction is associated with the target variable of the transaction being $1$ . In particular, we consider the significance of a pattern in the framework of statistical significance, assuming that the transactions $\left\{(s_{1},\ell_{1}),\dots,(s_{m},\ell_{m})\right\}$ constituting the dataset $\mathcal{D}$ are samples from an unknown distribution $\gamma$ . A pattern $\mathcal{P}$ is associated with the target variable if the probability of the event “ $\mathcal{P}\in s$ and $\ell=1$ ” is higher than the corresponding probability when the event “ $\mathcal{P}\in s$ ” and the event “ $\ell=1$ ” are independent. Formally, this corresponds to consider the following null hypothesis for a pattern $\mathcal{P}$ :

\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\wedge\ell=1\right)=\Pr_{(s,\ell% )\sim\gamma}\left(\mathcal{P}\in s\right)\Pr_{(s,\ell)\sim\gamma}\left(\ell=1% \right).

Since we are interested in patterns with a significant association with the target variable being $1$ , we are interested in finding patterns for which the following alternative hypothesis holds:

\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\wedge\ell=1\right)>\Pr_{(s,\ell% )\sim\gamma}\left(\mathcal{P}\in s\right)\Pr_{(s,\ell)\sim\gamma}\left(\ell=1% \right).

To this end, we define the quality¹¹1The quality $\mathsf{q}_{\mathcal{P}}$ is often called leverage for general patterns (Hämäläinen and Webb, 2019), but we use the term quality given its relation to the $1$ -quality commonly employed to find interesting subgroups. of a pattern $\mathcal{P}$ as

\mathsf{q}_{\mathcal{P}}=\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\wedge% \ell=1\right)-\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\right)\Pr_{(s,% \ell)\sim\gamma}\left(\ell=1\right).

Note that the alternative hypothesis is equivalent to $\mathsf{q}_{\mathcal{P}}>0$ , and the null hypothesis is equivalent to $\mathsf{q}_{\mathcal{P}}=0$ . Therefore, finding significant patterns is equivalent to finding patterns with quality $\mathsf{q}_{\mathcal{P}}>0$ .

For example, consider the study of the association between the characteristics of the users of an online social network and users’ interests in a given topic. In this case, each user is a transaction, users’ characteristics are the features, and being interested or not in the topic defines the target variable. Significant patterns in this example are associations between users’ characteristics and users’ interests that are significantly stronger than expected under the null hypothesis of independence between characteristics and interests. For example, if the probability that a user has a given binary feature $f$ is $0.3$ , and the probability that a user is interested in the topic is $0.5$ , then under the null hypothesis of independence we have that the probability that a user has feature $f$ and is interested in the topic is $0.15$ . Therefore, the feature $f$ is significantly associated with the topic of interest if the actual probability that a user has feature $f$ and is interested in the topic is $>0.15$ .

Task Definition. Given a dataset $\mathcal{D}$ and the corresponding pattern language $\mathcal{L}$ , our goal is to identify significant patterns, that is, patterns $\mathcal{P}\in\mathcal{L}$ with quality $\mathsf{q}_{\mathcal{P}}>0$ . Since the distribution $\gamma$ is unknown and we have access only to the dataset $\mathcal{D}$ comprising transactions sampled from $\gamma$ , we cannot hope to discover all significant patterns without errors. As a consequence, we must resort to approximations. In particular, define the subset $\mathcal{L}^{\star}$ of the language $\mathcal{L}$ of patterns for which the null hypothesis holds:

\displaystyle\mathcal{L}^{\star}

\displaystyle=\left\{\mathcal{P}\in\mathcal{L}:\mathsf{q}_{\mathcal{P}}=0% \right\}.

Our task is then to produce a subset $O\subseteq\mathcal{L}$ of all patterns in the language with Family-Wise Error Rate (FWER) below a user-defined threshold $\delta$ , such that the probability that $O$ contains at least one element from $\mathcal{L}^{\star}$ is at most $\delta$ :

(1)

\Pr_{\mathcal{D}}\left(O\cap\mathcal{L}^{\star}\neq\emptyset\right)\leq\delta.

Note that (1) implies that $O$ is false discovery free approximation, which have previously been considered for other pattern mining tasks (Riondato and Vandin, 2020; Santoro et al., 2020).

Since $\mathsf{q}_{\mathcal{P}}$ depends on the unknown distribution $\gamma$ , we define a statistic ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ that corresponds to an estimate of $\mathsf{q}_{\mathcal{P}}$ from data and that we will use in our algorithm FSR to identify significant patterns. For each pattern $\mathcal{P}$ , we define the functions $f_{\mathcal{P}}$ and $g_{\mathcal{P}}$ as follows: each $f_{\mathcal{P}}:\mathcal{X}\rightarrow\{0,1\}$ is defined as $f_{\mathcal{P}}(s)=\mathds{1}\left[\mathcal{P}\in s\right]$ , such that $f_{\mathcal{P}}(s)=1$ if $\mathcal{P}\in s$ , and $f_{\mathcal{P}}(s)=0$ otherwise; each $g_{\mathcal{P}}:\mathcal{X}\times\{0,1\}\rightarrow[-\mu(\mathcal{D}),1-\mu(% \mathcal{D})]$ is defined as $g_{\mathcal{P}}(s,\ell)=f_{\mathcal{P}}(s)(\ell-\mu(\mathcal{D}))$ . Then, the estimate ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ of $\mathsf{q}_{\mathcal{P}}$ for pattern $\mathcal{P}$ from the dataset $\mathcal{D}$ is

\displaystyle{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}=\frac{1}{m}\sum_{i=% 1}^{m}g_{\mathcal{P}}(s_{i},\ell_{i}).

Interestingly, ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ corresponds to the $1$ -quality commonly used in subgroup mining (Atzmueller, 2015) (see Section A.1), even if the way it is used in our algorithm FSR (see Section 4) is different from its usual application, due to our focus on statistically significant patterns.

Intuitively, we expect significant patterns to have a sufficiently high empirical quality ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ measured on the data $\mathcal{D}$ , since ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ estimates the (true) quality $\mathsf{q}_{\mathcal{P}}$ of $\mathcal{P}$ . To take into account the multiple hypothesis testing issue described above, we need to identify a threshold $\varepsilon$ such that reporting in output all patterns with quality $\geq\varepsilon$ has bounded FWER. Note that $\varepsilon$ should be as small as possible in order to have high statistical power (i.e., to report the largest set of results with guarantees on false discoveries). A critical quantity we study to address this issue is the supremum deviation of the empirical qualities of non-significant patterns, defined as

(2)

\displaystyle\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{{\bar{\mathsf{q}}_% {\mathcal{P}}(\mathcal{D})}\right\}.

We address this challenge with FSR: we derive novel analytical bounds on the concentration of the supremum deviation (2), and build on these tools an efficient few-shot resampling algorithm to sharply estimate it. FSR achieves high statistical power while scaling to large datasets and complex languages. Our approach applies to both conditional and unconditional testing, both of great interest in data mining.

3.2. Conditional and Unconditional Testing

When assessing the statistical significance of a pattern $\mathcal{P}$ , one has to choose between conditional and unconditional tests. A conditional test assumes that the data-generating process represented by the unknown distribution $\gamma$ only produces datasets with $m$ transactions in which both the frequency $\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ of pattern $\mathcal{P}$ and the fraction $\mu(\mathcal{D})$ of transactions with target value $1$ are the same as in the observed dataset; that is, it conditions on the observed variables of interest. In contrast, unconditional tests assume that $\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ and $\mu(\mathcal{D})$ are the realization of corresponding random variables. Unconditional tests therefore assess the association between a pattern and class labels considering also scenarios (i.e., datasets) where all frequencies of the patterns and/or the average target value may differ from what is observed in the data. Equivalently, conditional tests and unconditional tests are based on different assumptions regarding how data is generated and collected, namely, whether the variables of interest would be the same in a different repetition of the experiment (conditional tests) or not (unconditional tests).

Consider for example the scenario of online social networks described in Section 3.1, and assume for simplicity that we are interested in associations between the single features and the target. If the data is collected so that the total number of transactions for each value of the target is fixed and the fraction of transactions with given values of the features is fixed as well, then conditional testing is more appropriate. If instead one collects the data without constraints on the features/target values (e.g., simply collecting as many transactions as possible), then unconditional testing is more appropriate.

Conditional tests and unconditional tests are therefore both valid and of interest for data mining applications, and the choice between the two classes depends on the specific scenario, even if in practice conditional tests are usually preferred for computational reasons, since unconditional tests need to take into account more uncertainties in the observed quantities. In what follows, we introduce a general algorithm to identify significant patterns for both conditional testing and unconditional testing. Our algorithm is extremely efficient in both cases due to the use of few-shot resampling.

4. FSR Algorithm

We now describe our algorithm FSR (Algorithm 1) to find significant patterns for both conditional testing and unconditional testing. We first present the general approach, that is common to both testing scenarios, and then present the details for conditional testing in Section 4.1, and the details for unconditional testing in Section 4.2.

In a nutshell, FSR identifies significant patterns from a dataset $\mathcal{D}=\left\{(s_{1},\ell_{1}),\dots,(s_{m},\ell_{m})\right\}$ using a few-shot resampling approach to compute rigorous probabilistic bounds to the deviation of the estimated qualities ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ of the patterns, under the null hypothesis of no association between the patterns and the target label. It then reports in output all patterns with estimated quality ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ above such deviation.

FSR considers a collection $\mathcal{R}^{\star}=\{\mathcal{D}^{\star}_{1},\dots,\mathcal{D}^{\star}_{c}\}$ of $c\geq 1$ i.i.d. resampled datasets, each obtained by resampling the target labels of $\mathcal{D}$ while maintaining the same features of the transactions of $\mathcal{D}$ . Each resampled dataset is $\mathcal{D}^{\star}_{j}=\left\{(s_{1},\xi_{1,j}),\dots,(s_{m},\xi_{m,j})\right\}$ , where $\xi_{i,j}$ are i.i.d. random variables with

\displaystyle\xi_{i,j}\sim Bern(p),\forall i\in[1,m],\forall j\in[1,c],

and $Bern(p)$ is the Bernoulli random variable with parameter $p$ (i.e., it is $1$ with probability $p$ , and $0$ otherwise).

We now describe the general approach followed by FSR (Algorithm 1). FSR starts by computing an upper bound $\varepsilon_{T}$ to the deviation between the average target value $\mu(\mathcal{D})$ observed in $\mathcal{D}$ and its expected value $\mu=\mathop{\mathbb{E}}_{\mathcal{D}}\left[\mu(\mathcal{D})\right]$ under the null hypothesis (line 1). Note that $\mu$ is the probability that a sample from the unknown distribution $\gamma$ has target label $\ell$ equal to $1$ , that is $\mu=\Pr_{(s,\ell)}\left(\ell=1\right)$ . The computation of $\varepsilon_{T}$ is performed by the procedure boundTarget and it depends on whether one is interested in conditional testing or in unconditional testing. The details of boundTarget for the two settings are described in Section 4.1 and in Section 4.2, respectively. Then FSR uses $\varepsilon_{T}$ to obtain the upper bound $\hat{\mu}$ (line 1) and lower bound $\check{\mu}$ (line 1) to $\mu$ (we trivially assume $0\leq\check{\mu}\leq\hat{\mu}\leq 1$ ). It then uses the procedure resampleTarget to generate $c$ resampled datasets $\mathcal{R}^{\star}$ (line 1) assigning to each transaction in the datasets of $\mathcal{R}^{\star}$ the target label $1$ with probability $p=\hat{\mu}$ (i.e, the upper bound to $\mu$ ). Note that the resampleTarget procedure is the same for both conditional and unconditional testing. The algorithm then computes, from each of the $c$ resampled datasets of $\mathcal{R}^{\star}$ , an estimate of the maximum deviation of the empirical quality ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ for non-significant patterns. This is achieved by computing, for every resampled dataset $\mathcal{D}^{\star}_{j}$ , the quantity $\sup_{\mathcal{P}\in\mathcal{L}}\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{% \star}_{j},\check{\mu})$ , where $\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{\star}_{j},\check{\mu})=\frac{1}{m% }\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\xi_{i,j}-\check{\mu})$ . The value $\sup_{\mathcal{P}\in\mathcal{L}}\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{% \star}_{j},\check{\mu})$ can be interpreted as an empirical estimate of the maximum quality of non-significant patterns, measured from datasets sampled from the null distribution. Note that we use $\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{\star}_{j},\check{\mu})$ as $\check{\mu}$ is a lower bound to $\mu$ ; consequently, $\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{\star}_{j},\check{\mu})$ provides a proper upper bound to such maximum deviation. Then, note that it is necessary to consider the supremum over the language $\mathcal{L}$ , since the set of true null hypothesis $\mathcal{L}^{\star}$ is unknown. We remark that the computation of $\sup_{\mathcal{P}\in\mathcal{L}}\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{% \star}_{j},\check{\mu})$ can be performed with fast pattern enumeration strategies, similar to the ones leveraged by previous methods for frequent and significant pattern mining; in fact, FSR can be combined with any efficient exploration procedure, such as the ones that explore the search space of the pattern language of interest using pruning bounds, for instance depth-first (Minato et al., 2014; Llinares-López et al., 2015; Terada et al., 2015) or best-first searches (Pietracaprina and Vandin, 2007; Pellegrina and Vandin, 2020; Pellegrina et al., 2022). The algorithm then stores the empirical deviations $\sup_{\mathcal{P}\in\mathcal{L}}\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{% \star}_{j},\check{\mu})$ in the variables $d_{j}$ . Then, the average maximum deviation $\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ over the $c$ resampled datasets is computed (line 1) as the mean of the values $\{d_{j},j\in[1,c]\}$ , and it is then used to compute a rigorous upper bound $\varepsilon$ to the supremum deviation $\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{{\bar{\mathsf{q}}_{\mathcal{P}}% (\mathcal{D})}\right\}$ (Eq. (2)) under the null hypothesis (line 1) through the procedure boundStatistic. The implementation of this procedure, i.e., the returned value of $\varepsilon$ , depends on whether conditional testing or unconditional testing is considered, and it is described in Section 4.1 and Section 4.2, respectively. In both cases, we use advanced concentration bounds (Boucheron et al., 2013; McDiarmid, 1989) that allow us to obtain small values of $\varepsilon$ with a small number $c$ of resampled datasets. Finally, the set of patterns with estimated quality ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ greater than $\varepsilon+\varepsilon_{T}\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ is reported in output (lines 1-1).

Input: Pattern language

\mathcal{L}

; dataset

\mathcal{D}

m

transactions;

c\geq 1

;

\delta\in(0,1)

Output: Set

O\subseteq\mathcal{L}

of significant patterns with FWER

\leq\delta

\varepsilon_{T}\leftarrow

boundTarget(

\mu(\mathcal{D})

m

\delta

);

\hat{\mu}\leftarrow\mu(\mathcal{D})+\varepsilon_{T}

;

\check{\mu}\leftarrow\mu(\mathcal{D})-\varepsilon_{T}

;

\mathcal{R}^{\star}\leftarrow

resampleTarget(

\mathcal{D},c,\hat{\mu}

);

5 forall $j\in[1,c]$ do

d_{j}\leftarrow\sup_{\mathcal{P}\in\mathcal{L}}\bigl{\{}\bar{\mathsf{q}}_{% \mathcal{P}}(\mathcal{D}^{\star}_{j},\check{\mu})\bigr{\}}

;

\tilde{d}(\mathcal{R}^{\star},\check{\mu})\leftarrow\frac{1}{c}\sum_{j=1}^{c}d% _{j}

;

\varepsilon\leftarrow

boundStatistic(

\mathcal{D}

\mathcal{L}

\tilde{d}(\mathcal{R}^{\star},\check{\mu})

\delta

);

O\leftarrow\left\{\mathcal{P}\in\mathcal{L}:{\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D})}\geq\varepsilon+\varepsilon_{T}\mathsf{f}_{\mathcal{P}}(\mathcal{% D})\right\}

;

9 return

O

;

Algorithm 1 FSR

Note that while the computation of $\varepsilon_{T}$ and $\varepsilon$ depends on whether conditional testing or unconditional testing is considered, the overall approach followed by FSR is the same in both cases. In particular, for both cases FSR relies on the resampled datasets $\mathcal{R}^{\star}$ to estimate the maximum deviation, over all patterns $\mathcal{P}\in\mathcal{L}$ , of the estimate ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ from the (unknown) value $\mathsf{q}_{\mathcal{P}}$ under the null hypothesis. Moreover, the exact same procedure is used to generate the resampled datasets. Note that our approach is similar to permutation approaches for significant pattern mining, which use permuted datasets to estimate the significance of patterns, but with two crucial differences. First, our datasets are obtained by resampling the target values, and not by permuting them, which allows us to obtain rigorous bounds for both conditional and unconditional testing, while permutation approaches can be used for conditional testing only. Second, since our analysis depends on the expectation of the maximum deviation, we can employ advanced bounds on the concentration of the expected value of functions of independent random variables; this allows us to use a small number $c$ of resampled datasets, as shown by our analysis and experimental evaluation. This is in contrast with permutation approaches (e.g., the ones based on WY permutation testing (Llinares-López et al., 2015; Pellegrina and Vandin, 2020; Terada et al., 2015)) that instead estimate the quantiles of the distribution of the maximum deviation using a large number of permutations (see also Section A.2 in Appendix for a more detailed comparison).

4.1. FSR for Conditional Testing

We now describe the details of procedures boundTarget and boundStatistic for the version of FSR that uses conditional testing, which we refer to as FSR-C. As a reminder, a conditional test for our problem assumes that the average target value $\mu$ and patterns frequencies $\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ are fixed, for all $\mathcal{P}\in\mathcal{L}$ , to the values observed in the dataset $\mathcal{D}$ .

For boundTarget, since the average target value $\mu$ is fixed to the value $\mu(\mathcal{D})$ observed in the dataset $\mathcal{D}$ , the bound $\varepsilon_{T}$ on its deviation from the expectation is $0$ , that is, boundTarget simply returns $0$ . Note that this implies that the output of FSR-C consists of all patterns in $\mathcal{L}$ with ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\geq\varepsilon$ (line 1).

For boundStatistic, we now show how to compute a rigorous probabilistic bound $\varepsilon$ to the supremum deviation $\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{{\bar{\mathsf{q}}_{\mathcal{P}}% (\mathcal{D})}\right\}$ (Eq. (2)) under the null hypothesis. The bound is obtained by computing the average maximum deviation $\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ over $c$ resampled datasets, and then applying advanced concentration results (Boucheron et al., 2013). Note that since $\varepsilon_{T}=0$ , in this case $\tilde{d}(\mathcal{R}^{\star},\check{\mu})=\tilde{d}(\mathcal{R}^{\star},\mu(% \mathcal{D}))$ .

Since FSR-C is based on resampled datasets, where each target label is sampled independently, our estimate $\tilde{d}(\mathcal{R}^{\star},\mu(\mathcal{D}))$ of the expected maximum deviation is not based on the conditional distribution assumed by conditional tests, where the fraction of target labels equal to $1$ is exactly $\mu(\mathcal{D})$ in every dataset (while in our resampled datasets such fraction may vary, and it is equal to $\mu(\mathcal{D})$ only in expectation). We therefore need to relate the supremum deviation of the observed quality of patterns on resampled datasets with the one observed in datasets sampled from the conditional distribution. Interestingly, we show that the resampling and conditional distributions are closely related, in the sense that high probability bounds for the former also apply to the latter.

We first prove a general result (Lemma 1) relating the expectation of monotone functions of permutations of binary vectors, corresponding to the conditional distribution, with the expectation taken w.r.t. to independent resamples, corresponding to resampled datasets. For an integer $k$ with $0\leq k\leq m$ , define the set $B(k)$ of binary vectors with $k$ entries equal to one as

\displaystyle B(k)=\Bigl{\{}\mathbf{v}\in\{0,1\}^{m},\sum_{i=1}^{m}\mathbf{v}_% {i}=k\Bigr{\}},

and let $U(B(k))$ be the uniform distribution over the set $B(k)$ . Equivalently, $U(B(k))$ corresponds to the set of uniform permutations of a binary vector with $k$ ones. Then, define $I(p)$ as a probability distribution over $\{0,1\}^{m}$ , such that each entry $\mathbf{v}_{i}$ of a random vector $\mathbf{v}$ taken from $I(p)$ is an i.i.d. Bernoulli r.v. with $\Pr(\mathbf{v}_{i}=1)=p=1-\Pr(\mathbf{v}_{i}=0)$ , for some $p\in[0,1]$ .

Lemma 1.

Let $f:\{0,1\}^{m}\rightarrow\mathbb{R}$ be a nonnegative function such that $\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[f(\mathbf{v})\right]$ is either monotonically increasing or monotonically decreasing in $k$ . It holds

\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[f(\mathbf{v})% \right]\leq 2\mathop{\mathbb{E}}_{\mathbf{v}\sim I(k/m)}\left[f(\mathbf{v})% \right].

To prove Lemma 1, we need the following technical result, providing bounds to the probability that a Binomial random variable exceeds its expectation (Jogdeo and Samuels, 1968).

Lemma 2.

Let $\mu m$ be an integer. Then it holds

\displaystyle\Pr_{\mathbf{v}\sim I(\mu)}\left(\sum_{i=1}^{m}\mathbf{v}_{i}>\mu m% \right)<\frac{1}{2}<\Pr_{\mathbf{v}\sim I(\mu)}\left(\sum_{i=1}^{m}\mathbf{v}_% {i}\geq\mu m\right).

We now prove Lemma 1.

Proof of Lemma 1.

We prove the result assuming that $\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[f(\mathbf{v})\right]$ is monotonically increasing in $k$ , as the other case is analogous. First, we note that

\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[f(\mathbf{v})% \right]=\mathop{\mathbb{E}}_{\mathbf{v}\sim I(k/m)}\left[f(\mathbf{v})\mid\sum% _{i=1}^{m}\mathbf{v}_{i}=k\right].

Therefore, we have

	$\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim I(k/m)}\left[f(\mathbf{v})\right]$
	$\displaystyle=\sum_{j=0}^{m}\mathop{\mathbb{E}}_{\mathbf{v}\sim I(k/m)}\left[f% (\mathbf{v})\mid\sum_{i=1}^{m}\mathbf{v}_{i}=j\right]\Pr_{\mathbf{v}\sim I(k/m% )}\left(\sum_{i=1}^{m}\mathbf{v}_{i}=j\right)$
	$\displaystyle\geq\sum_{j=k}^{m}\mathop{\mathbb{E}}_{\mathbf{v}\sim I(k/m)}% \left[f(\mathbf{v})\mid\sum_{i=1}^{m}\mathbf{v}_{i}=j\right]\Pr_{\mathbf{v}% \sim I(k/m)}\left(\sum_{i=1}^{m}\mathbf{v}_{i}=j\right)$
	$\displaystyle\geq\mathop{\mathbb{E}}_{\mathbf{v}\sim I(k/m)}\left[f(\mathbf{v}% )\mid\sum_{i=1}^{m}\mathbf{v}_{i}=k\right]\sum_{j=k}^{m}\Pr_{\mathbf{v}\sim I(% k/m)}\left(\sum_{i=1}^{m}\mathbf{v}_{i}=j\right)$
	$\displaystyle=\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[f(\mathbf{v})% \right]\Pr_{\mathbf{v}\sim I(k/m)}\left(\sum_{i=1}^{m}\mathbf{v}_{i}\geq k\right)$
	$\displaystyle\geq\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[f(\mathbf{v% })\right]\frac{1}{2},$

where the last inequality follows from Lemma 2. ∎

We make use of Lemma 1 to prove the following result.

Theorem 3.

Define the constant $\bar{\mu}=\mu(\mathcal{D})$ and let $k=\bar{\mu}m$ . For any $z\geq 0$ , it holds

	$\displaystyle\Pr_{\mathbf{v}\sim U(B(k))}\left(\sup_{\mathcal{P}\in\mathcal{L}% ^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{% \mu})\geq z\right)$
	$\displaystyle\leq 2\Pr_{\mathbf{v}\sim I(\bar{\mu})}\left(\sup_{\mathcal{P}\in% \mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}% _{i}-\bar{\mu})\geq z\right).$

Proof of Theorem 3.

Define the function $g:\{0,1\}^{m}\rightarrow\{0,1\}$

\displaystyle g(\mathbf{v})=\mathds{1}\left[\sup_{\mathcal{P}\in\mathcal{L}^{% \star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{\mu% })\geq z\right].

In order to apply Lemma 1, we prove that $\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[g(\mathbf{v})\right]$ is nondecreasing with $k$ . This is equivalent to show that

\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[g(\mathbf{v})% \right]\leq\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k^{\prime}))}\left[g(% \mathbf{v})\right],

where $k^{\prime}=k+j$ , for any pair of integers $k\in[0,m]$ and $j\in[0,m-k]$ . To do so, we build a coupling $\pi(k,k^{\prime})$ between the two distributions $U(B(k))$ and $U(B(k^{\prime}))$ as follows. For any $\mathbf{v}$ taken from $U(B(k))$ , define $\mathbf{v}^{\prime}$ as a copy of $\mathbf{v}$ (i.e., such that $\mathbf{v}_{i}=\mathbf{v}^{\prime}_{i},\forall i\in[1,m]$ ), that is modified according to the following procedure: for $j$ times, select uniformly at random an index $i$ such that $\mathbf{v}^{\prime}_{i}=0$ , and set $\mathbf{v}^{\prime}_{i}$ to $1$ . Denote the pair $\mathbf{v},\mathbf{v}^{\prime}$ sampled according to $\pi(k,k^{\prime})$ as the output of this procedure. It is immediate to observe that $\mathbf{v}^{\prime}\sim U(B(k^{\prime}))$ (i.e., that the marginal distribution of $\mathbf{v}^{\prime}$ is $U(B(k^{\prime})$ ), and that $\mathbf{v}_{i}\leq\mathbf{v}^{\prime}_{i},\forall i\in[1,m]$ . This implies that, for any pair of vectors $\mathbf{v},\mathbf{v}^{\prime}\sim\pi(k,k^{\prime})$ , it holds

\displaystyle\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}% f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{\mu})\leq\sup_{\mathcal{P}\in% \mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}% ^{\prime}_{i}-\bar{\mu}).

A consequence of this fact is

	$\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}^{\prime}\sim U(B(k+j))}\left[g(% \mathbf{v}^{\prime})\right]$
	$\displaystyle=\Pr_{\mathbf{v}^{\prime}\sim U(B(k+j))}\left(\sup_{\mathcal{P}% \in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf% {v}^{\prime}_{i}-\bar{\mu})\geq z\right)$
	$\displaystyle=\Pr_{\mathbf{v},\mathbf{v}^{\prime}\sim\pi(k,k^{\prime})}\left(% \sup_{\mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P% }}(s_{i})(\mathbf{v}^{\prime}_{i}-\bar{\mu})\geq z\right)$
	$\displaystyle\geq\Pr_{\mathbf{v},\mathbf{v}^{\prime}\sim\pi(k,k^{\prime})}% \left(\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{% \mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{\mu})\geq z\right)$
	$\displaystyle=\Pr_{\mathbf{v}\sim U(B(k))}\left(\sup_{\mathcal{P}\in\mathcal{L% }^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{% \mu})\geq z\right)$
	$\displaystyle=\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[g(\mathbf{v})% \right].$

Therefore, $g$ is nonnegative and $\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[g(\mathbf{v})\right]$ is nondecreasing in $k$ . We apply Lemma 1 to the function $g$ , obtaining the statement. ∎

We note that Theorem 3 is precisely what we seek: it implies that large supremum deviations that are unlikely in the independent resamples distribution $\mathbf{v}\sim I(\bar{\mu})$ are also unlikely in the conditional distribution $\mathbf{v}\sim U(B(k))$ . By using Theorem 3, we prove the following result, that implies strong concentration of the supremum deviation of pattern qualities w.r.t. their expectations, taken w.r.t. independent resamples of the target labels rather than permutations.

Theorem 4.

Define $\bar{\mu}=\mu(\mathcal{D})$ , and

\displaystyle\omega=(1-\bar{\mu})\min\Bigl{\{}\bar{\mu},\sup_{\mathcal{P}\in% \mathcal{L}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})\Bigr{\}}.

With probability at least $1-\delta$ over $\mathbf{v}\sim U(B(k))$ , it holds

	$\displaystyle\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}% f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{\mu})$
	$\displaystyle\leq\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\bar{\mu})}\left[\sup_{% \mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{% i})(\mathbf{v}_{i}-\bar{\mu})\right]+\sqrt{\frac{2\omega\log\left(\frac{2}{% \delta}\right)}{m}}.$

To prove Theorem 4, we need the following technical result regarding the concentration of functions of independent random variables.

Theorem 5 (Theorem 6.7 of (Boucheron et al., 2013)).

Let $g:\mathcal{Y}^{m}\rightarrow\mathbb{R}$ be a function, and let $X=(X_{1},\dots,X_{m})\in\mathcal{Y}^{m}$ a collection of $m$ independent random variables. Define $\bar{X}^{j}=(X_{1},\dots,X_{j-1},X^{\prime}_{j},X_{j+1},\dots,X_{m})\in% \mathcal{Y}^{m}$ as a copy of $X$ , where its $j$ -th element $X_{j}$ is replaced by an independent copy $X^{\prime}_{j}$ . Assume that, for some constant $q\geq 0$ ,

\displaystyle\sum_{j=1}^{m}\mathop{\mathbb{E}}\left[\left(g(X)-g(\bar{X}^{j})% \right)_{+}^{2}\>|\>X\right]\leq q

holds almost surely. Then, it holds, for all $t\geq 0$ ,

\displaystyle\Pr\Bigl{(}g(X)\geq\mathop{\mathbb{E}}_{X}\left[g(X)\right]+t% \Bigr{)}\leq\exp(-t^{2}/(2q)).

We use these results to prove Theorem 4.

Proof of Theorem 4.

Define the function $g:\{0,1\}^{m}\rightarrow\mathbb{R}$ as

\displaystyle g(\mathbf{v})=\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m% }\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{\mu}).

We first note that, from the result of Theorem 3, it is sufficient to show that, choosing the constants $t$ and $z$ as

(3)

\displaystyle t=\sqrt{\frac{2\omega\log(\frac{2}{\delta})}{m}},\;\;\;z=\mathop% {\mathbb{E}}_{\mathbf{v}\sim I(\bar{\mu})}\left[g(\mathbf{v})\right]+t,

it holds

\displaystyle\Pr_{\mathbf{v}\sim I(\bar{\mu})}\left(g(\mathbf{v})\geq z\right)% \leq\delta/2.

To do so, we make use of Theorem 5. Define $\bar{\mathbf{v}}^{j}$ a copy of $\mathbf{v}$ , where its $j$ -th element $\mathbf{v}_{j}$ is replaced by an independent copy $\mathbf{v}^{\prime}_{j}$ . We want to upper bound

\displaystyle\sum_{j=1}^{m}\mathop{\mathbb{E}}\left[\left(g(\mathbf{v})-g(\bar% {\mathbf{v}}^{j})\right)_{+}^{2}\>|\>\mathbf{v}\right]

below some constant $q\geq 0$ . Define $\mathcal{P}^{\star}$ as one of the elements of $\mathcal{L}^{\star}$ that achieve the supremum in $g(\mathbf{v})$ . We observe that

	$\displaystyle\left(g(\mathbf{v})-g(\bar{\mathbf{v}}^{j})\right)_{+}$
	$\displaystyle\leq\biggl{(}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}^{\star}}(s_{% i})(\mathbf{v}_{i}-\bar{\mu})-\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}^{\star}}% (s_{i})(\bar{\mathbf{v}}^{j}_{i}-\bar{\mu})\biggr{)}_{+}$
	$\displaystyle=\biggl{(}\frac{1}{m}f_{\mathcal{P}^{\star}}(s_{j})(\mathbf{v}_{j% }-\bar{\mu})-\frac{1}{m}f_{\mathcal{P}^{\star}}(s_{j})(\mathbf{v}^{\prime}_{j}% -\bar{\mu})\biggr{)}_{+}$
(4)		$\displaystyle=\biggl{(}\frac{1}{m}f_{\mathcal{P}^{\star}}(s_{j})(\mathbf{v}_{j% }-\mathbf{v}^{\prime}_{j})\biggr{)}_{+}.$

We now observe that $\mathbf{v}^{\prime}_{j}=0$ is the only possible value of $\mathbf{v}^{\prime}_{j}$ that makes (4) be $>0$ . Therefore, we have

	$\displaystyle\sum_{j=1}^{m}\mathop{\mathbb{E}}\left[\left(g(\mathbf{v})-g(\bar% {\mathbf{v}}^{j})\right)_{+}^{2}\>\|\>\mathbf{v}\right]$
	$\displaystyle\leq\sum_{j=1}^{m}\mathop{\mathbb{E}}\left[\left(\frac{1}{m}f_{% \mathcal{P}^{\star}}(s_{j})(\mathbf{v}_{j}-\mathbf{v}^{\prime}_{j})\right)_{+}% ^{2}\>\|\>\mathbf{v}\right]$
	$\displaystyle=(1-\bar{\mu})\sum_{j=1}^{m}\left(\frac{1}{m}f_{\mathcal{P}^{% \star}}(s_{j})\mathbf{v}_{j}\right)^{2}$
	$\displaystyle=\frac{(1-\bar{\mu})}{m}\sum_{j=1}^{m}\left(\frac{1}{m}f_{% \mathcal{P}^{\star}}(s_{j})\mathbf{v}_{j}\right)$
	$\displaystyle\leq\frac{(1-\bar{\mu})}{m}\sup_{\mathcal{P}\in\mathcal{L}^{\star% }}\sum_{j=1}^{m}\left(\frac{1}{m}f_{\mathcal{P}}(s_{j})\mathbf{v}_{j}\right)% \leq\frac{\omega}{m}.$

We apply Theorem 5 to the function $g$ with $q=\omega/m$ , obtaining that

\displaystyle\Pr_{\mathbf{v}\sim I(\bar{\mu})}\Bigl{(}g(\mathbf{v})\geq\mathop% {\mathbb{E}}_{\mathbf{v}\sim I(\bar{\mu})}[g(\mathbf{v})]+t\Bigr{)}\leq\exp(-% mt^{2}/(2\omega)).

Setting $t$ as in (3), it is immediate to observe that the probability above is $\leq\delta/2$ , obtaining the statement. ∎

Note that Theorem 4 provides a probabilistic upper bound, holding with probability at least $1-\delta$ , to the maximum observed value of the pattern quality $\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P% }}(s_{i})(\mathbf{v}_{i}-\bar{\mu})$ when the conditional distribution $\mathbf{v}\sim U(B(k))$ is considered, in terms of the expectation of the maximum observed value of the pattern quality according to the resampled distribution $\mathbf{v}\sim I(\bar{\mu})$ . Then, the following result proves that the estimation $\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ of the expected deviation in the upper bound above is very accurate.

Theorem 6.

With probability at least $1-\delta/4$ over $\mathcal{R}^{\star}$ , it holds

\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\bar{\mu})}\left[\sup_{% \mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{% i})(\mathbf{v}_{i}-\bar{\mu})\right]\leq\tilde{d}(\mathcal{R}^{\star},\check{% \mu})+\sqrt{\frac{\log\bigl{(}\frac{4}{\delta}\bigr{)}}{2cm}}.

By combining the results above, we prove the guarantees provided by FSR for the task of finding significant patterns when conditional testing is used. In particular, the following Corollary proves that, when boundStatistic returns the value $\varepsilon$ defined below, then the output set $O$ of significant patterns returned by FSR has FWER bounded by the user-defined parameter $\delta$ .

Corollary 7.

Fix $\delta\in(0,1)$ and $c\geq 1$ . Let $O$ be the output of FSR with input parameters $\mathcal{L}$ , $\mathcal{D}$ , $c$ , $\delta$ , and let

\displaystyle\varepsilon=\tilde{d}(\mathcal{R}^{\star},\check{\mu})+\sqrt{% \frac{2\omega\log\bigl{(}\frac{4}{\delta}\bigr{)}}{m}}+\sqrt{\frac{\log\bigl{(% }\frac{4}{\delta}\bigr{)}}{2cm}}

be the value returned by boundStatistic in line 1. Then, the set $O$ has FWER $\leq\delta$ under the conditional null distribution.

Note that the value of $\varepsilon$ returned by boundStatistic requires to compute $\sup_{\mathcal{P}\in\mathcal{L}}\bigl{\{}\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D}^{\star}_{j},\check{\mu})\bigr{\}}$ , which is costly, but only needs to be performed on $c$ resampled datasets. As we will show in our experimental evaluation (see Section 5), small values of $c$ suffice. Moreover, the maximum frequency of a pattern in the language $\mathcal{L}$ is required in order to compute $\omega$ (see Theorem 4). Such maximum frequency can be computed very efficiently in most data mining tasks. For example: in itemset mining it corresponds to the frequency of the most frequent item; in subgroup mining, it is equal to $1$ whenever a continuous feature is present and the conditions defining the pattern language $\mathcal{L}$ include inequalities.

Power analysis. The results above show that FSR rigorously controls the probability of false positives (i.e., patterns for which the null hypothesis hold but are wrongly reported in output as significant). However, they do not provide guarantees on the power of FSR, that is, its ability to report patterns $\mathcal{P}$ with sufficiently high quality $\mathsf{q}_{\mathcal{P}}$ . The following result provides guarantees on the power of FSR-C, the version of FSR that uses conditional testing, for the pattern language of subgroups. Our analysis is based on a probabilistic upper bound to $\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ , obtained from bounds to the pseudodimension (Li et al., 2001; Pollard, 2012; Shalev-Shwartz and Ben-David, 2014) of subgroups, and an advanced concentration bound for sums of dependent random variables (Dubhashi and Panconesi, 2009), that hold under mild (but necessary) assumptions on the distribution of alternative hypotheses (the set of patterns with $\mathsf{q}_{\mathcal{P}}>0$ ). More precisely, we assume that the target labels of the transactions that support a pattern $\mathcal{P}$ with $\mathsf{q}_{\mathcal{P}}>0$ are distributed according to a noncentral hypergeometric distribution (Wallenius, 1963), i.e., a biased version of the standard hypergeometric distribution. We provide the proofs and additional details in Section A.5.

Theorem 8.

Fix $\delta\in(0,1)$ and $c,z\geq 1$ . Let $O$ be the output of FSR with input parameters $\mathcal{L}$ , $\mathcal{D}$ , $c$ , and $\delta$ , where $\mathcal{L}$ is the language of subgroups composed by conjunctions with at most $z$ conditions over $d$ continuous features. Then, with probability at least $1-\delta$ , $O$ contains all patterns with quality $\mathsf{q}_{\mathcal{P}}$ satisfying

	$\displaystyle\mathsf{q}_{\mathcal{P}}\geq$	$\displaystyle\;\sqrt{\frac{2\hat{\omega}z\ln(\frac{e^{3}dm^{2}}{4z^{3}})}{m}}+% \sqrt{\frac{\ln\bigl{(}\frac{2}{\delta}\bigr{)}}{2cm}}+\frac{z\ln\bigl{(}\frac% {e^{3}dm^{2}}{4z^{3}}\bigr{)}}{3m}$
		$\displaystyle+\sqrt{\frac{2\omega\log\bigl{(}\frac{4}{\delta}\bigr{)}}{m}}+% \sqrt{\frac{\log\bigl{(}\frac{4}{\delta}\bigr{)}}{2cm}}+\sqrt{\frac{2\hat{% \mathsf{f}}(\mathcal{D})z\ln(\frac{e^{3}dm^{2}}{2z^{3}\delta})}{m}},$

where $\hat{\mathsf{f}}(\mathcal{D})=\sup_{\mathcal{P}\in\mathcal{L}}\mathsf{f}_{% \mathcal{P}}(\mathcal{D})$ and $\hat{\omega}=\hat{\mu}(1-\check{\mu})\hat{\mathsf{f}}(\mathcal{D})$ .

4.2. FSR for Unconditional Testing

We now describe the details of procedures boundTarget and boundStatistic for the version of FSR that uses unconditional testing, which we refer to as FSR-U. As a reminder, in our scenario an unconditional test assumes that the average target value and patterns frequencies $\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ are observations of random variables, whose expected values are unknown. In particular, here we consider the transactions $\left\{(s_{1},\ell_{1}),\dots,(s_{m},\ell_{m})\right\}$ constituting the dataset $\mathcal{D}$ as i.i.d. samples from an unknown distribution $\gamma$ . This scenario is more complex than the conditional testing one, since we need to account for the unknown deviation of all observed values when computing a bound to the maximum pattern quality under the null hypothesis. However, we show that the use of resampled datasets allow us to efficiently take into account such deviations.

For boundTarget, recall that $\mu$ is the the probability that a sample from the unknown distribution $\gamma$ has target label $\ell$ equal to $1$ , such that $\mu=\mathop{\mathbb{E}}_{\mathcal{D}}\left[\mu(\mathcal{D})\right]=\Pr_{(s,% \ell)\sim\gamma}\left(\ell=1\right)$ . More importantly, in this setting $\mu$ is unknown (as $\gamma$ is). However, given that the samples in $\mathcal{D}$ are i.i.d., the following result provides a probabilistic bound $\varepsilon_{T}$ to the deviation of the observed value $\mu(\mathcal{D})$ from its expectation $\mu$ .

Lemma 9.

Let $\mathcal{D}$ be a collection of $m$ samples taken i.i.d. from $\gamma$ . For $\delta\in(0,1)$ , it holds with probability $\geq 1-\delta/4$

\displaystyle\lvert\mu(\mathcal{D})-\mu\rvert\leq\varepsilon_{T}\doteq\sqrt{% \frac{2\min\bigl{\{}\mu(\mathcal{D}),\frac{1}{4}\bigr{\}}\ln\left(\frac{8}{% \delta}\right)}{m}}+\frac{2\ln\left(\frac{8}{\delta}\right)}{m}.

boundTarget returns the value $\varepsilon_{T}$ as defined in Lemma 9, which requires the values $\mu(\mathcal{D})$ , $\delta$ , and $m$ .

For boundStatistic, the computation of the upper bound $\varepsilon$ to the maximum deviation between the observed pattern qualities and their expectations is more involved than in the conditional case. We show that the quality $\mathsf{q}_{\mathcal{P}}$ of a pattern $\mathcal{P}$ , which depends on several unknown quantities (see Section 3.1), can be sharply estimated from a dataset $\mathcal{D}$ provided that $\mu$ is known, which is not the case for the unconditional distribution. We then show that using $\mu(\mathcal{D})$ in place of $\mu$ provides a good estimate of $\mathsf{q}_{\mathcal{P}}$ , and prove a probabilistic upper bound on the difference between the two estimates.

First, we introduce the function family that we use to analyze the supremum deviation of the empirical quality of patterns from the dataset $\mathcal{D}$ . We define the family of functions $g^{*}_{\mathcal{P}}:\mathcal{X}\times\{0,1\}\rightarrow[-\mu,1-\mu]$ , where $g^{*}_{\mathcal{P}}$ is defined as $g^{*}_{\mathcal{P}}(s,\ell)=f_{\mathcal{P}}(s)(\ell-\mu)$ . Note that $g^{*}_{\mathcal{P}}(s,\ell)=f_{\mathcal{P}}(s)(\ell-\mu)$ corresponds to the function $g_{\mathcal{P}}(s,\ell)$ used by the FSR statistic where the unknown value $\mu$ is used in place of its estimate $\mu(\mathcal{D})$ obtained from dataset $\mathcal{D}$ . Define the estimator $\mathsf{q}_{\mathcal{P}}(\mathcal{D})$ of the quality $\mathsf{q}_{\mathcal{P}}$ of $\mathcal{P}$ from $\mathcal{D}$ as the average of $g^{*}_{\mathcal{P}}$ over $\mathcal{D}$ , that is

\displaystyle\mathsf{q}_{\mathcal{P}}(\mathcal{D})=\frac{1}{m}\sum_{i=1}^{m}g^% {*}_{\mathcal{P}}(s_{i},\ell_{i}).

Note that $\mathop{\mathbb{E}}_{\mathcal{D}}\left[\mathsf{q}_{\mathcal{P}}(\mathcal{D})% \right]=\mathsf{q}_{\mathcal{P}}$ , as $\mathsf{q}_{\mathcal{P}}(\mathcal{D})$ is an unbiased estimator of $\mathsf{q}_{\mathcal{P}}$ . However, $\mathsf{q}_{\mathcal{P}}(\mathcal{D})$ depends on the unknown quantity $\mu$ . Even if $\mathsf{q}_{\mathcal{P}}(\mathcal{D})\neq{\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D})}$ (since $\mu(\mathcal{D})$ may be $\neq\mu$ ), FSR-U exploits the fact that $\mu(\mathcal{D})$ is sharply concentrated around $\mu$ , as proved in Lemma 9. This implies that the maximum deviation $\sup_{\mathcal{P}\in\mathcal{L}}|\mathsf{q}_{\mathcal{P}}(\mathcal{D})-{\bar{% \mathsf{q}}_{\mathcal{P}}(\mathcal{D})}|$ for all patterns $\mathcal{P}\in\mathcal{L}$ can be sharply estimated from $\mathcal{D}$ . To this aim, we prove the following result.

Theorem 10.

Let $\mathcal{D}$ be a collection of $m$ samples taken i.i.d. from $\gamma$ . For $\delta\in(0,1)$ , with probability $\geq 1-\delta/4$ it holds

\displaystyle\lvert\mathsf{q}_{\mathcal{P}}(\mathcal{D})-{\bar{\mathsf{q}}_{% \mathcal{P}}(\mathcal{D})}\rvert\leq\varepsilon_{T}\mathsf{f}_{\mathcal{P}}(% \mathcal{D}),\forall\mathcal{P}\in\mathcal{L}.

Proof.

Using Lemma 9, $\forall\mathcal{P}\in\mathcal{L}$ it holds with prob. $\geq 1-\delta/4$

	$\displaystyle\lvert$	$\displaystyle\mathsf{q}_{\mathcal{P}}(\mathcal{D})-{\bar{\mathsf{q}}_{\mathcal% {P}}(\mathcal{D})}\rvert=\left\lvert\frac{1}{m}\sum_{i=1}^{m}g^{*}_{\mathcal{P% }}(s_{i},\ell_{i})-\frac{1}{m}\sum_{i=1}^{m}g_{\mathcal{P}}(s_{i},\ell_{i})\right\rvert$
		$\displaystyle=\left\lvert\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mu(% \mathcal{D})-\mu)\right\rvert\leq\varepsilon_{T}\mathsf{f}_{\mathcal{P}}(% \mathcal{D}).\qed$

We remark that the definition of $\mathsf{q}_{\mathcal{P}}(\mathcal{D})$ is crucial to the analysis of our algorithm FSR-U, as $\mathsf{q}_{\mathcal{P}}(\mathcal{D})$ is an average of $m$ independent random variables, while ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ is not (since it depends on $\mu(\mathcal{D})$ , that is estimated from the observations in the whole dataset).

Recall the definition of the set $\mathcal{L}^{\star}$ of non-significant patterns: $\mathcal{L}^{\star}=\left\{\mathcal{P}\in\mathcal{L}:\mathsf{q}_{\mathcal{P}}=% 0\right\}$ . To output significant patterns while controlling the FWER below $\delta$ , our goal is to bound the supremum deviation $\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{{\bar{\mathsf{q}}_{\mathcal{P}}% (\mathcal{D})}\right\}$ (Eq. (2)) below some value $\eta$ with probability at least $1-\delta$ , for some $\delta\in(0,1)$ , and provide in output all patterns with ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\geq\eta$ . To bound (2), we study the surrogate quantity $\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{\mathsf{q}_{\mathcal{P}}(% \mathcal{D})\right\}$ , as Theorem 10 guarantees a small bound on $\lvert\mathsf{q}_{\mathcal{P}}(\mathcal{D})-{\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D})}\rvert$ . Consider the collection $\mathcal{R}^{\star}=\left\{\mathcal{D}^{\star}_{1},\dots,\mathcal{D}^{\star}_{% c}\right\}$ of $c\geq 1$ i.i.d. resampled datasets computed by FSR (line 1). We prove the following result.

Theorem 11.

Let $\mathcal{D}$ be a dataset of $m$ samples taken i.i.d. from a distribution $\gamma$ , and $\mathcal{R}^{\star}$ a collection of $c\geq 1$ i.i.d. resamples of the target labels of $\mathcal{D}$ . For any $\delta\in(0,1)$ , define $\nu_{T}\geq\mu(1-\mu)$ , $\nu\geq\nu_{T}\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{\mathop{\mathbb{E% }}_{\mathcal{D}}\left[\mathsf{f}_{\mathcal{P}}(\mathcal{D})\right]\right\}$ , and $\varepsilon$ as

	$\displaystyle\tilde{d}(\mathcal{R}^{\star},\check{\mu})=\frac{1}{c}\sum_{j=1}^% {c}\sup_{\mathcal{P}\in\mathcal{L}}\bigl{\{}\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D}^{\star}_{j},\check{\mu})\bigr{\}}$
	$\displaystyle\hat{r}=\tilde{d}(\mathcal{R}^{\star},\check{\mu})+\sqrt{\frac{% \ln\bigl{(}\frac{4}{\delta}\bigr{)}}{2cm}}$
	$\displaystyle\hat{d}=\hat{r}+\sqrt{\left(\frac{2\nu_{T}\ln\bigl{(}\frac{4}{% \delta}\bigr{)}}{m}\right)^{2}+\frac{2\hat{r}\ln\bigl{(}\frac{4}{\delta}\bigr{% )}}{m}}+\frac{2\nu_{T}\ln\bigl{(}\frac{4}{\delta}\bigr{)}}{m}$
(5)		$\displaystyle\varepsilon\doteq\hat{d}+\sqrt{\frac{2\ln\bigl{(}\frac{4}{\delta}% \bigr{)}\left(\nu+2\hat{d}\right)}{m}}+\frac{\ln\bigl{(}\frac{4}{\delta}\bigr{% )}}{3m}.$

With probability at least $1-\delta$ over the choice of $\mathcal{D}$ and $\mathcal{R}^{\star}$ it holds $\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{\mathsf{q}_{\mathcal{P}}(% \mathcal{D})\right\}\leq\varepsilon$ .

We use the following simple upper bound for both $\nu_{T}$ and $\nu$ : $\nu_{T}=\nu=\sup_{|x-\mu(\mathcal{D})|\leq\varepsilon_{T}}x(1-x)$ , and note that while slightly more refined bounds are possible, we omit them to improve readability.

Note that Theorem 11 does not require the knowledge of the unknown parameter $\mu$ , but only of an upper bound $\hat{\mu}$ and of a lower bound $\check{\mu}$ . Theorem 11 allows to implement the procedure boundStatistic for the unconditional setting: boundStatistic computes $\varepsilon$ as in Eq. (5). Analogously to the conditional case, evaluating $\varepsilon$ requires to compute $\sup_{\mathcal{P}\in\mathcal{L}}\bigl{\{}\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D}^{\star}_{j},\check{\mu})\bigr{\}}$ , which is costly, but only needs to be performed on $c$ resampled datasets, and $c$ is in practice small (as we show in Section 5).

By combining the results in this section, we prove the guarantees provided by FSR for the task of finding significant patterns when unconditional testing is used. In particular, the following Corollary proves that, when boundStatistic returns the value $\varepsilon$ defined in Eq. 5, then the output set $O$ of significant patterns returned by FSR-U has FWER bounded by the user-defined parameter $\delta$ .

Corollary 12.

Fix $\delta\in(0,1)$ and $c\geq 1$ . Let $O$ be the output of FSR with input parameters $\mathcal{L}$ , $\mathcal{D}$ , $c$ , $\delta$ , and let $\varepsilon$ as in Eq. (5). Then, the set $O$ has FWER $\leq\delta$ under the unconditional null distribution.

Power analysis. The following result provides guarantees on the power of FSR-U, the version of FSR that uses unconditional testing, building on the probabilistic upper bound to $\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ proved in Section A.5, and on concentration bounds based on the pseudodimension of the language of subgroups.

Theorem 13.

Fix $\delta\in(0,1)$ and $c,z\geq 1$ . Let $O$ be the output of FSR with input parameters $\mathcal{L}$ , $\mathcal{D}$ , $c$ , and $\delta$ , where $\mathcal{L}$ is the language of subgroups composed by conjunctions with at most $z$ conditions over $d$ continuous features. Define $\hat{\omega}=\hat{\mu}(1-\check{\mu})\sup_{\mathcal{P}\in\mathcal{L}}\mathsf{f% }_{\mathcal{P}}(\mathcal{D})$ , and let $\varepsilon$ be defined as in Theorem 11, where $\hat{r}$ is replaced by

\displaystyle\hat{r}=\sqrt{\frac{2\hat{\omega}z\ln(\frac{e^{3}m^{2}d}{4z^{3}})% }{m}}+\sqrt{\frac{\ln(\frac{3}{\delta})}{2cm}}+\frac{z\ln(\frac{e^{3}m^{2}d}{4% z^{3}})}{3m}+\sqrt{\frac{\ln\bigl{(}\frac{4}{\delta}\bigr{)}}{2cm}}.

Then, define $\mathsf{f}_{\mathcal{P}}=\mathop{\mathbb{E}}_{\mathcal{D}}[\mathsf{f}_{% \mathcal{P}}(\mathcal{D})]$ . With probability at least $1-\delta$ , $O$ contains all patterns $\mathcal{P}$ with quality $\mathsf{q}_{\mathcal{P}}$ satisfying

\displaystyle\mathsf{q}_{\mathcal{P}}\geq\varepsilon+2\varepsilon_{T}\Biggl{(}% \mathsf{f}_{\mathcal{P}}+\sqrt{\frac{z\ln\bigl{(}\frac{2ed}{z}\bigr{)}+\ln% \bigl{(}\frac{3}{\delta}\bigr{)}}{2m}}\Biggr{)}+\sqrt{\frac{z\ln\bigl{(}\frac{% 2ed}{z}\bigr{)}+\ln\bigl{(}\frac{3}{\delta}\bigr{)}}{2m}}.

5. Experiments

This section presents the results of our experiments. The goal of our experimental evaluation is to assess FSR’s capabilities of discovering significant patterns with high statistical power, analyzing efficiently large real-world datasets with few-shot resampling, in both conditional (FSR-C) and unconditional (FSR-U) settings.

Pattern Language. In our experimental evaluation we focus on the problem of discovering significant subgroups from large real-world datasets with mixed feature types (both categorical and continuous). The language $\mathcal{L}$ is composed of conjunctions of up to $z$ conditions on the features of the data (Atzmueller, 2015), where $z$ is a fixed parameter (see below). These conditions are either equalities (for categorical features), inequalities, or intervals (on continuous features).

Datasets. We tested FSR on $12$ standard benchmarks and real-world datasets to evaluate subgroup discovery algorithms from UCI²²2https://archive.ics.uci.edu/. The statistics of the datasets are described in Table 1. These datasets cover a wide range of sizes, dimensionalities, and application domains. The column $z$ of Table 1 reports the maximum number of conjunction terms for the subgroups in the language $\mathcal{L}$ for each dataset.

Implementation of FSR. We implemented FSR in Python. The code and the scripts to reproduce all experiments are available online³³3https://github.com/VandinLab/FSR. To mine subgroups, we make use of a fast depth-first enumeration algorithm included in the library pysubgroup⁴⁴4https://github.com/flemmerich/pysubgroup (Lemmerich and Becker, 2019).

Baselines. Since our algorithm FSR is the first algorithm that can be used for mining significant patterns with both conditional and unconditional testing, we consider different baselines for conditional testing and for unconditional testing.

For conditional testing, we compare FSR-C with a variant of TopKWY (Pellegrina and Vandin, 2020), the state-of-the-art method for significant pattern mining with conditional testing, based on the WY permutation testing procedure (Westfall and Young, 1993). The original implementation of TopKWY is only tailored to identify significant itemsets and subgraphs, and does not support subgroups from categorical and continuous features; however, we note that it is fairly simple to adapt its strategy to such case. In particular, we extended TopKWY to identify significant subgroups by estimating the distribution of the supremum deviation using permuted datasets (instead of the $p$ -values as done in the original TopKWY implementation (Pellegrina and Vandin, 2020)). That is, our variant of TopKWY considers the same statistic of FSR-C (i.e., the supremum deviation), but instead of generating resamples and taking the average of supremum deviations as done by FSR-C, it efficiently computes its $\delta$ -quantiles considering permutations of the labels. In addition, in our variant of TopKWY we use pysubgroup to mine subgroups. Note that since our variant of TopKWY and FSR-C share the same, equally optimized, procedure to explore the search space of the language $\mathcal{L}$ , all comparisons in terms of running times are fair. For TopKWY we use $10^{3}$ permutations, a good trade-off in terms of running time and accuracy for estimating the $\delta$ -quantile for typical values of $\delta$ (e.g., $0.05$ ) (Llinares-López et al., 2015).

Regarding unconditional testing, we note that FSR-U is the first method to discover significant patterns in a (fully) unconditional setting. Therefore, a baseline that may seem reasonable to control the FWER is the standard Bonferroni correction: each pattern $\mathcal{P}$ is flagged as significant if the probability (under the null hypothesis) of observing a quality greater or equal than the one measured in the data is at most $\delta/|\mathcal{L}|$ , where $|\mathcal{L}|$ is the size of the language $\mathcal{L}$ (see also Section 1). Note, however, that this simple method is not useful for subgroups as $|\mathcal{L}|$ is infinite (e.g., as the number of inequalities over a continuous feature is unbounded)⁵⁵5We remark that, while there exist other correction procedures more powerful than Bonferroni (e.g., the Holm procedure), they all require to fix a finite set of hypothesis, therefore do not apply to our setting.. Therefore, we compare FSR-U with a novel non-trivial baseline, that we call FSR-U-UB, that resolves this issue. We describe FSR-U-UB at high level and defer additional details to Section A.6. For FSR-U-UB, we use part of the concentration bounds developed for FSR-U to upper bound the supremum deviation in terms of its expectation (taken w.r.t. the resamples but conditionally on the transactions, see Theorem 11 and Section A.6), but instead of estimating the expected supremum deviation with $c$ resamples (as done by FSR-U), we compute an upper bound to it via an union bound over the (finite) number of distinct subgroups that are observed in the data in at least one transaction, i.e., we consider the finite projection of $\mathcal{L}$ on $\mathcal{D}$ , instead of all $|\mathcal{L}|$ possible subgroups. Note that such approach bounds the FWER while being much less conservative than Bonferroni, since it corrects for the size of the projection instead of the size of $\mathcal{L}$ . By comparing FSR-U to FSR-U-UB we directly evaluate the advantage of computing bounds from resampled datasets that consider dependencies among patterns, as done by FSR-U.

Experimental setup. All the experiments were run on a machine with 2.30 GHz Intel Xeon CPU, 512 GB of RAM, on Ubuntu 20.04. In all experiments we use $\delta=0.05$ (i.e., we control the FWER below $0.05$ ). We repeated all experiments $10$ times, and report averages $\pm$ stds over the $10$ repetitions.

Table 1. Statistics of the datasets considered in our experiments.

m

is the number of transactions,

d

is the number of features (categorical/continuous),

\mu(\mathcal{D})

is the fraction of transactions with target equal to

1

z

is the maximum number of conjunction terms in the language

\mathcal{L}

$\mathcal{D}$	$m$	$d$	$\mu(\mathcal{D})$	$z$
abalone	4177	1/7	0.663	5
adult	32561	8/6	0.241	5
bank	41188	10/10	0.113	3
brain-cancer	862	22/1	0.421	5
cancer-rna-seq	801	0/20531	0.375	2
covtype	581012	0/54	0.365	3
gisette	7000	0/5000	0.500	2
HIGGS	11000000	0/28	0.529	3
kdd-cup	95370	73/405	0.050	2
mushroom	8124	22/0	0.482	5
SUSY	5000000	0/18	0.457	3
theorem-prover	3059	0/51	0.420	3

Impact of parameters on FSR. In the first set of experiments we evaluate the effect of the number of resamples $c$ on the deviation bound $\varepsilon$ computed by FSR and its running time. We consider both FSR-C and FSR-U, respectively designed to compute significant patterns with conditional and unconditional testing. To ease the presentation, for this first experiment we focus on $3$ of the datasets we considered; the results for the other datasets are very similar. In Figure 1-(a) we show the deviation bounds computed by FSR-C for different values of $c$ , while Figure 1-(b) is analogous for FSR-U. From these plots we clearly conclude that using $c=10$ resamples is sufficient to obtain a small deviation bound, and that using more than $10$ resamples is marginally beneficial as all curves flatten. Remarkably, for all datasets (and in particular for the larger dataset adult), and for both methods, even using one resample is enough to compute a meaningful deviation bound. This is in striking contrast with state-of-the-art methods based on permutation testing, that instead require a number of $10^{3}$ - $10^{4}$ permutations of the target labels to properly estimate the $\delta$ -quantile. Furthermore, we observe that the deviation bounds computed by FSR-C, in the conditional setting, are smaller than the bounds computed by FSR-U, for the unconditional scenario; this confirms the fact that the assumptions made on the process generating the data have a sensible effect on this aspect, a consequence of properly taking into account the uncertainty of the collected data.

Figures 1-(c)-(d) show the running time of the two methods FSR-C and FSR-U as functions of $c$ . Not surprisingly, we clearly observe that the running time increases linearly with the number of resamples $c$ for both methods. Therefore, we expect that processing a small number of resampled datasets is extremely advantageous in terms of running time. We also observe that FSR-U is faster than FSR-C for all values of $c$ . This is due to the fact that FSR-C, by computing a smaller deviation bound, explores a wider portion of the pattern language $\mathcal{L}$ . In any case, using at most $10$ resamples is always feasible (since both algorithms terminate after at most $17$ minutes for these datasets). From these observation, we fix $c$ to $10$ for FSR-C and FSR-U in all our experiments.

Evaluation of FSR-C. In this experiment we report the performance of FSR-C to identify significant patterns with conditional testing, comparing it with a variant of TopKWY, the state-of-the-art method to identify significant patterns with permutation testing. In Figure 2 we show the deviation bounds (a), the running times (b), and the number of reported results (c)-(d) for both algorithms. From Figure 2-(a), we observe that the deviation bounds computed by FSR-C are larger than the ones computed by TopKWY. This is not surprising, as the deviation bound returned by TopKWY, that is based on the WY procedure, is a very accurate estimate of the optimal corrected threshold to bound the FWER (since the WY estimator converges asymptotically to it (Meinshausen et al., 2011)); on the other hand, FSR-C computes an upper bound to such quantity with stronger probabilistic guarantees (i.e., the bound does not only converge asymptotically and in expectation, but rather holds in finite samples with high probability). While the deviation bounds computed by TopKWY are smaller than FSR-C, from Figure 2-(b) we observe a significant gap in terms of running time between the two methods. In fact, TopKWY requires two orders of magnitude more time than FSR-C to compute its deviation bound; this is mainly due to the fact that TopKWY has to process two orders of magnitude more permutations of the target label than FSR-C. Note that using $10^{4}$ permutations (instead of $10^{3}$ , to have a more accurate estimation of the $\delta$ -quantile, as often done in practice) in TopKWY would result in an even more substantial gap.

We now compare the two methods in terms of number of reported significant patterns. To do so, we follow a typical subgroup discovery analysis, based on the task of identifying the $k$ most “interesting” subgroups in terms of quality: first, we mine the set of top- $k$ subgroups with highest quality; then, we count the number of them that are flagged as significant by the two methods. We use $k\in\{5\cdot 10^{3},10^{4}\}$ . Figures 2-(c)-(d) show the number of patterns that are reported in output by both methods. We observe that, for $k=5\cdot 10^{3}$ , for $6$ of the $12$ datasets we considered, both algorithms output all the top- $k$ patterns, i.e., they have the same output; for such datasets TopKWY reports all top- $k$ patterns also for $k=10^{4}$ , while FSR-C outputs the same set of results for all but two datasets, for which it reports more than $70\%$ of them. For the remaining $6$ datasets, which consist of the 4 largest datasets (in terms of the number $m$ of transactions) and the 2 datasets with highest number of features, TopKWY could not complete in reasonable time (i.e., we stopped it after $10$ days), while FSR-C finished the analysis while returning a large number of significant results (i.e., either all the top- $k$ patterns or more than 600 of them for the kdd-cup dataset). For the most challenging datasets (HIGGS, with $11$ millions transactions, and cancer-rna-seq, with $>10^{4}$ continuous features), FSR-C concludes after 5 days, while TopKWY would require (approximately) $1.5$ years of computation. Therefore, in such cases FSR-C enables the analysis of these challenging instances while identifying many significant patterns. These results confirm that FSR-C, while providing a slightly more conservative upper bound to the supremum deviation, is still capable of discovering the same (or almost the same) most significant patterns, while being more than two orders of magnitude faster, and reporting many patterns as significant for challenging instances out of reach for the state-of-the-art. We conclude that FSR-C provides an excellent trade-off between the number of patterns identified as significant and the computational requirement of the analysis.

Evaluation of FSR-U. We now evaluate the performance of FSR-U to discover significant patterns with unconditional testing. We show the results in Figure 3. We compare FSR-U with the baseline FSR-U-UB (described above) in terms of deviation bounds (a), running time (b), and number of results (c)-(d). From Figure 3-(a) we clearly observe that FSR-U computes deviation bounds that are always smaller than FSR-U-UB. We conclude that processing the resamples of the target label provides much more accurate deviation bounds w.r.t. more standard techniques (i.e., a Bonferroni correction). This is a consequence of taking into account the dependencies among patterns when upper bounding the expected supremum deviation. In terms of running time, FSR-U always conclude in reasonable time (similarly to FSR-C), while FSR-U-UB is faster (since it does not consider any resamples). On the other hand, in many cases the baseline FSR-U-UB terminates quickly but without reporting anything in output: Figures 3-(c)-(d) show that for $5$ datasets it does not report significant patterns, while for the other datasets it outputs a significantly smaller amount (e.g., for adult, FSR-U finds almost an order of magnitude more results than FSR-U-UB). This experiment shows that FSR-U is a practical and powerful method to discover significant patterns with unconditional testing, and that it significantly improves over more standard techniques such as Bonferroni correction.

Application to Neural Network interpretation. In this final experiment we evaluate a practical application of FSR to the task of Neural Network interpretation (Fischer et al., 2021). More precisely, we consider the MNIST dataset (LeCun and Cortes, 2010) and train a Convolutional Neural Network (CNN), with the goal of identifying correlations between the activation values of neurons with the predicted target. To do so, we evaluate the association of the activation values of neurons in a convolutional filter with a binary target, composed by drawings of digits composed by straight lines only ( $1$ and $7$ ), versus the other digits. While we defer most details of this experiment to Section A.7, we observed that FSR successfully identifies interpretable activation patterns of several neurons, while requiring orders of magnitude less time than previous methods (as discussed previously).

6. Conclusions

We presented FSR, a novel algorithm to identify statistically significant patterns with rigorous bounds on the FWER. FSR uses a few-shot resampling strategy, which leads to an efficient and practical approach that can be used for both conditional and unconditional testing. Our experimental evaluation shows that FSR is an effective and accurate method for significant patterns discovery, while significantly reducing the computational cost of state-of-the-art multiple comparisons procedures, such as permutation testing, that hardly scale the analysis to complex languages, such as subgroups, and large datasets. While the experiments presented in this work are focused on subgroups, we expect the relative improvements obtained by FSR to directly transfer to other pattern types, given the generality of our framework and of the design of our resampling procedures, which are shared by all types of patterns.

Acknowledgements.

This work was supported by the “National Center for HPC, Big Data, and Quantum Computing”, project CN00000013, and by the PRIN Project n. 2022TS4Y3N - EXPAND: scalable algorithms for EXPloratory Analyses of heterogeneous and dynamic Networked Data, funded by the Italian Ministry of University and Research (MUR).

References

(1)
Aggarwal et al. (2010) Charu C Aggarwal, Yao Li, Philip S Yu, and Ruoming Jin. 2010. On dense pattern mining in graph streams. Proceedings of the VLDB Endowment 3, 1-2 (2010), 975–984.
Agrawal et al. (1993) Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data. 207–216.
Al Hasan and Zaki (2009) Mohammad Al Hasan and Mohammed J Zaki. 2009. Output space sampling for graph patterns. Proceedings of the VLDB Endowment 2, 1 (2009), 730–741.
Atzmueller (2015) Martin Atzmueller. 2015. Subgroup discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5, 1 (2015), 35–49.
Barnard (1945) GA Barnard. 1945. A new test for 2 $\times$ 2 tables. Nature 156, 3954 (1945).
Benjamini and Hochberg (1995) Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57, 1 (1995), 289–300.
Bonferroni (1936) Carlo Bonferroni. 1936. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8 (1936), 3–62.
Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. 2013. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
Cao et al. (2019) Lei Cao, Yizhou Yan, Samuel Madden, Elke A Rundensteiner, and Mathan Gopalsamy. 2019. Efficient discovery of sequence outlier patterns. Proceedings of the VLDB Endowment 12, 8 (2019), 920–932.
Ceccarello and Gamper (2022) Matteo Ceccarello and Johann Gamper. 2022. Fast and Scalable Mining of Time Series Motifs with Probabilistic Guarantees. Proceedings of the VLDB Endowment 15, 13 (2022), 3841–3853.
Chen et al. (2009) Chen Chen, Cindy X Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, and Jiawei Han. 2009. Mining graph patterns efficiently via randomized summaries. Proceedings of the VLDB Endowment 2, 1 (2009), 742–753.
Dalleiger and Vreeken (2022) Sebastian Dalleiger and Jilles Vreeken. 2022. Discovering significant patterns under sequential false discovery control. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
Dubhashi and Panconesi (2009) Devdatt P Dubhashi and Alessandro Panconesi. 2009. Concentration of measure for the analysis of randomized algorithms. Cambridge University Press.
Fischer et al. (2021) Jonas Fischer, Anna Olah, and Jilles Vreeken. 2021. What’s in the Box? Exploring the Inner Life of Neural Networks with Robust Rules. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 139. PMLR, 3352–3362.
Fisher (1922) Ronald A Fisher. 1922. On the interpretation of $\chi$ 2 from contingency tables, and the calculation of P. Journal of the royal statistical society 85, 1 (1922).
Hämäläinen and Webb (2019) Wilhelmiina Hämäläinen and Geoffrey I Webb. 2019. A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery 33 (2019), 325–377.
Han et al. (2007) Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. 2007. Frequent pattern mining: current status and future directions. Data mining and knowledge discovery 15, 1 (2007), 55–86.
Ho et al. (2022) Nguyen Thi Thao Ho, Torben Bach Pedersen, et al. 2022. Efficient temporal pattern mining in big time series using mutual information. Proceedings of the VLDB Endowment 15, 3 (2022), 673–685.
Holm (1979) Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics (1979), 65–70.
Jogdeo and Samuels (1968) Kumar Jogdeo and Stephen M Samuels. 1968. Monotone convergence of binomial probabilities and a generalization of Ramanujan’s equation. The Annals of Mathematical Statistics 39, 4 (1968), 1191–1195.
Kalofolias et al. (2017) Janis Kalofolias, Mario Boley, and Jilles Vreeken. 2017. Efficiently discovering locally exceptional yet globally representative subgroups. In 2017 IEEE International Conference on Data Mining (ICDM). IEEE.
LeCun and Cortes (2010) Yann LeCun and Corinna Cortes. 2010. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/
Lemmerich and Becker (2019) Florian Lemmerich and Martin Becker. 2019. pysubgroup: Easy-to-use subgroup discovery in python. In ECML PKDD 2018. Springer, 658–662.
Li et al. (2001) Yi Li, Philip M Long, and Aravind Srinivasan. 2001. Improved bounds on the sample complexity of learning. J. Comput. System Sci. 62, 3 (2001), 516–527.
Llinares-López et al. (2015) Felipe Llinares-López, Mahito Sugiyama, Laetitia Papaxanthos, and Karsten Borgwardt. 2015. Fast and memory-efficient significant pattern mining via permutation testing. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 725–734.
Löffler and Phillips (2009) Maarten Löffler and Jeff M Phillips. 2009. Shape fitting on point sets with probability distributions. In Algorithms-ESA 2009: 17th Annual European Symposium, Copenhagen, Denmark, September 7-9, 2009. Proceedings 17. Springer, 313–324.
McDiarmid (1989) Colin McDiarmid. 1989. On the method of bounded differences. Surveys in combinatorics 141, 1 (1989), 148–188.
Meinshausen et al. (2011) Nicolai Meinshausen, Marloes H Maathuis, and Peter Bühlmann. 2011. Asymptotic optimality of the Westfall-Young permutation procedure for multiple testing under dependence. The Annals of Statistics (2011), 3369–3391.
Minato et al. (2014) Shin-ichi Minato, Takeaki Uno, Koji Tsuda, Aika Terada, and Jun Sese. 2014. A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In ECML PKDD 2014. Springer.
Pellegrina et al. (2022) Leonardo Pellegrina, Cyrus Cousins, Fabio Vandin, and Matteo Riondato. 2022. MCRapper: Monte-Carlo Rademacher averages for poset families and approximate pattern mining. ACM Transactions on Knowledge Discovery from Data (TKDD) 16, 6 (2022), 1–29.
Pellegrina et al. (2019a) Leonardo Pellegrina, Matteo Riondato, and Fabio Vandin. 2019a. Hypothesis Testing and Statistically-sound Pattern Mining. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). ACM, New York, NY, USA, 3215–3216.
Pellegrina et al. (2019b) Leonardo Pellegrina, Matteo Riondato, and Fabio Vandin. 2019b. SPuManTE: Significant Pattern Mining with Unconditional Testing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). ACM, New York, NY, USA, 1528–1538.
Pellegrina and Vandin (2020) Leonardo Pellegrina and Fabio Vandin. 2020. Efficient mining of the most significant patterns with permutation testing. Data Mining and Knowledge Discovery 34 (2020), 1201–1234.
Pietracaprina and Vandin (2007) Andrea Pietracaprina and Fabio Vandin. 2007. Efficient incremental mining of top-K frequent closed itemsets. In International Conference on Discovery Science. Springer, 275–280.
Pollard (2012) David Pollard. 2012. Convergence of stochastic processes. Springer Science & Business Media.
Riondato and Vandin (2020) Matteo Riondato and Fabio Vandin. 2020. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. ACM Transactions on Knowledge Discovery from Data (TKDD) 14, 5 (2020), 1–31.
Santoro et al. (2020) Diego Santoro, Andrea Tonon, and Fabio Vandin. 2020. Mining Sequential Patterns with VC-Dimension and Rademacher Complexity. Algorithms 13, 5 (2020), 123.
Shalev-Shwartz and Ben-David (2014) Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
Terada et al. (2015) Aika Terada, Hanyoung Kim, and Jun Sese. 2015. High-speed Westfall-Young permutation procedure for genome-wide association studies. In Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics. 17–26.
Terada et al. (2013) Aika Terada, Mariko Okada-Hatakeyama, Koji Tsuda, and Jun Sese. 2013. Statistical significance of combinatorial regulations. Proceedings of the National Academy of Sciences 110, 32 (2013), 12996–13001.
Van Leeuwen and Knobbe (2012) Matthijs Van Leeuwen and Arno Knobbe. 2012. Diverse subgroup set discovery. Data Mining and Knowledge Discovery 25 (2012), 208–242.
Wallenius (1963) Kenneth Ted Wallenius. 1963. Biased sampling: the noncentral hypergeometric probability distribution. Technical Report. https://purl.stanford.edu/wh056vj9347
Webb (2006) Geoffrey I Webb. 2006. Discovering significant rules. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 434–443.
Webb (2007) Geoffrey I Webb. 2007. Discovering significant patterns. Machine learning 68 (2007), 1–33.
Webb (2008) Geoffrey I Webb. 2008. Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Machine Learning 71 (2008), 307–323.
Westfall and Young (1993) Peter H Westfall and S Stanley Young. 1993. Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons.

Appendix A Appendix

A.1. Relation to Quality Measures

Interesting subgroups are often identified using a quality measure, defined by combining the generality and the unusualness of a pattern. The generality $\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ of a pattern $\mathcal{P}$ in the dataset $\mathcal{D}$ is the fraction of samples of $\mathcal{D}$ that support $\mathcal{P}$

\displaystyle\mathsf{f}_{\mathcal{P}}(\mathcal{D})=\frac{1}{m}\sum_{i=1}^{m}% \mathds{1}\left[\mathcal{P}\in s_{i}\right]=\frac{|\mathsf{C}_{\mathcal{P}}(% \mathcal{D})|}{m}.

For any bag of samples $B\subseteq\mathcal{D}$ , define the average target value $\mu(B)$ of samples $t\in B$ as

\displaystyle\mu(B)=\frac{1}{|B|}\sum_{(s,\ell)\in B}\ell.

The unusualness $\mathsf{u}_{\mathcal{P}}(\mathcal{D})$ of the pattern $\mathcal{P}$ on the dataset $\mathcal{D}$ is defined as the difference of the target variable of the samples $\in\mathsf{C}_{\mathcal{P}}(\mathcal{D})$ and the average target in the entire data $\mathcal{D}$

\displaystyle\mathsf{u}_{\mathcal{P}}(\mathcal{D})=\mu(\mathsf{C}_{\mathcal{P}% }(\mathcal{D}))-\mu(\mathcal{D}).

The $\alpha$ -quality ${\bar{\mathsf{q}}^{*}_{\mathcal{P},\alpha}(\mathcal{D})}$ of a pattern $\mathcal{P}$ on a dataset $\mathcal{D}$ is defined as

\displaystyle{\bar{\mathsf{q}}^{*}_{\mathcal{P},\alpha}(\mathcal{D})}=\mathsf{% f}_{\mathcal{P}}(\mathcal{D})^{\alpha}\mathsf{u}_{\mathcal{P}}(\mathcal{D}).

Commonly used quality measures are the $1$ -quality, the $\frac{1}{2}$ -quality, and the $2$ -quality.

Given a subgroup $\mathcal{P}$ , its quality $\mathsf{q}_{\mathcal{P}}$ can be written as

	$\displaystyle\mathsf{q}_{\mathcal{P}}=\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P% }\in s\wedge\ell=1\right)-\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\right% )\Pr_{(s,\ell)\sim\gamma}\left(\ell=1\right)$
	$\displaystyle=\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\right)\left(\Pr_{% (s,\ell)\sim\gamma}\left(\ell=1\|\mathcal{P}\in s\right)-\Pr_{(s,\ell)\sim% \gamma}\left(\ell=1\right)\right).$

Note that the generality $\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ is the estimate (on dataset $\mathcal{D}$ ) of $\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\right)$ , and that the unusualness $\mathsf{u}_{\mathcal{P}}(\mathcal{D})$ is the estimate (on dataset $\mathcal{D}$ ) of $\Pr_{(s,\ell)\sim\gamma}\left(\ell=1|\mathcal{P}\in s\right)-\Pr_{(s,\ell)\sim% \gamma}\left(\ell=1\right)$ , where $\Pr_{(s,\ell)\sim\gamma}\left(\ell=1|\mathcal{P}\in s\right)$ is estimated by $\mu(\mathsf{C}_{\mathcal{P}}(\mathcal{D}))$ and $\Pr_{(s,\ell)\sim\gamma}\left(\ell=1\right)$ is estimated by $\mu(\mathcal{D})$ . This shows that the $1$ -quality ${\bar{\mathsf{q}}^{*}_{\mathcal{P},1}(\mathcal{D})}=\mathsf{f}_{\mathcal{P}}(% \mathcal{D})\mathsf{u}_{\mathcal{P}}(\mathcal{D})$ corresponds to an estimate, obtained from $\mathcal{D}$ , of the quality $\mathsf{q}_{\mathcal{P}}$ of subgroup $\mathcal{P}$ . With this relation in mind, we can consider mining subgroups with high $1$ -quality ${\bar{\mathsf{q}}^{*}_{\mathcal{P},1}(\mathcal{D})}$ as an heuristic for finding significant subgroups, which ignores the random fluctuations of the estimates $\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ and $\mathsf{u}_{\mathcal{P}}(\mathcal{D})$ .

A.2. Comparison with Permutation Approaches

In this section we provide a more detailed comparison between our few-shot approach and commonly used permutation approaches.

Our few-shot approach leverages our analytical results (e.g., Theorem 6) to obtain high probability bounds on the maximum deviation of patterns’ quality by estimating only the expectation of maximum deviation of patterns’ quality. Estimating such expectation requires a small number $c$ of resampled datasets, such as $c=10$ that we used in our experimental evaluation.

The same approach cannot be used by permutation approaches (e.g., (Llinares-López et al., 2015; Pellegrina and Vandin, 2020; Terada et al., 2015)), since they are estimating the $\delta$ -quantile of the distribution of the maximum deviation, that is, the value $q$ for which the maximum deviation is below $q$ with probability $\delta$ , for a (relatively) small value of $\delta$ . Accurately estimating such quantile requires many more permutations than estimating the expectation (as done by our approach). For example, if only $10$ permutations (e.g., corresponding to the value $c$ used in our experiments) are used to estimate the $\delta$ -quantile, with $\delta=0.05$ , the WY procedure returns the maximum deviation over the $10$ permutations (i.e., the element in position $\lceil\delta\cdot 10\rceil=1$ in the list of deviations, sorted in decreasing order). This implies that, with probability $>\frac{1}{2}$ , the FWER will not be controlled at level $\delta$ (since the probability that the deviation of one permutation will be above the $\delta$ -quantile is $0.95$ , and the probability that all deviations are above the $\delta$ -quantile is $0.95^{10}\approx 0.599>\frac{1}{2}$ ). Moreover, with probability $>\frac{1}{3}$ , the FWER will not be controlled even at level $2\delta=0.1$ (since $0.9^{10}\approx 0.3487>\frac{1}{3}$ ). For such a reason, previous works suggest to use at least $10^{3}$ permutations for permutation approaches (as we do in our experiments), while $10^{4}$ is the suggested number of permutations to have a stable FWER estimation ((Terada et al., 2015; Llinares-López et al., 2015; Pellegrina and Vandin, 2020) all use $10^{4}$ ).

A.3. Proofs of Section 4.1

This section presents additional proofs for the results of Section 4.1.

First, we need the following technical result.

Theorem 1 (McDiarmid’s inequality (McDiarmid, 1989)).

Let $\mathcal{Y}$ be a domain, and let $g:\mathcal{Y}^{m}\rightarrow\mathbb{R}$ be a function such that, for each $i$ , $1\leq i\leq m$ , there is a nonnegative constant $c_{i}$ such that:

\sup_{\begin{subarray}{c}\langle x_{1},\dotsc,x_{m}\rangle\in\mathcal{Y}^{m}\\ x_{i}^{\prime}\in\mathcal{Y}\end{subarray}}\lvert g(x_{1},\dotsc,x_{m})-g(x_{1% },\dotsc,x_{i-1},x^{\prime}_{i},x_{i+1},\dotsc,x_{m})\rvert\leq c_{i}.

Let $x_{1},\dotsc,x_{m}$ be $m$ independent random variables such that $\langle x_{1},\dotsc,x_{m}\rangle\in\mathcal{Y}^{m}$ . Then, for $C=\sum_{i=1}^{m}c_{i}^{2}$ , it holds

\Pr\Bigl{(}\mathop{\mathbb{E}}[g]>g(x_{1},\dotsc,x_{m})+t\Bigr{)}\leq e^{-2t^{% 2}/C}.

Proof of Theorem 6.

First, we note that in the conditional setting it holds $\hat{\mu}=\check{\mu}=\bar{\mu}$ . Then, from the fact $\mathcal{L}^{\star}\subseteq\mathcal{L}$ , we have

	$\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\bar{\mu})}\left[\sup_{% \mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{% i})(\mathbf{v}_{i}-\bar{\mu})\right]$
	$\displaystyle\leq\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\bar{\mu})}\left[\sup_{% \mathcal{P}\in\mathcal{L}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(% \mathbf{v}_{i}-\bar{\mu})\right]=\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}% \bigr{[}\tilde{d}(\mathcal{R}^{\star},\check{\mu})\bigl{]},$

therefore we focus on the concentration of $\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ around its expectation taken w.r.t. $\mathcal{R}^{\star}$ . Our proof is based on McDiarmid’s inequality (Theorem 1). Define the function $g(\mathcal{R}^{\star})=\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ , and note that modifying any $\xi_{i,j}$ , for any pair $i,j$ , changes $g(\mathcal{R}^{\star})$ by at most $1/(cm)$ . Therefore, defining $C=\sum_{i}\sum_{j}(1/(cm))^{2}=1/(cm)$ , from Theorem 1 it holds

\displaystyle\Pr\Biggl{(}\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}\bigr{[}% \tilde{d}(\mathcal{R}^{\star},\check{\mu})\bigl{]}>\tilde{d}(\mathcal{R}^{% \star},\check{\mu})+\sqrt{\frac{\log\bigl{(}\frac{4}{\delta}\bigr{)}}{2cm}}% \Biggr{)}\leq\delta/4.

∎

A.4. Proofs of Section 4.2

This Section presents the proofs for the results of Section 4.2.

Proof of Lemma 9.

Since $\mu(\mathcal{D})$ is the average of $m$ independent and bounded random variables, Hoeffding’s and Bernstein’s inequalities (Boucheron et al., 2013) yield, respectively, that

	$\displaystyle\|\mu-\mu(\mathcal{D})\|$	$\displaystyle\leq\sqrt{\frac{\ln\left(\frac{8}{\delta}\right)}{2m}},$
	$\displaystyle\|\mu-\mu(\mathcal{D})\|$	$\displaystyle\leq\sqrt{\frac{2\mu(\mathcal{D})\ln\left(\frac{8}{\delta}\right)% }{m}}+\frac{2\ln\left(\frac{8}{\delta}\right)}{m},$

hold simultaneously with probability $\geq 1-\delta/4$ . The statement follows from the observation that their minimum is $\leq\varepsilon_{T}$ . ∎

Proof of Theorem 11.

Recall that $\mathcal{R}^{\star}=\left\{\mathcal{D}^{\star}_{1},\dots,\mathcal{D}^{\star}_{% c}\right\}$ is a collection of $c\geq 1$ i.i.d. resampled datasets, each obtained by resampling the target labels of $\mathcal{D}$ while maintaining the same features of $\mathcal{D}$ . That is, each resampled dataset is $\mathcal{D}^{\star}_{j}=\left\{(s_{1},\xi_{1,j}),\dots(s_{m},\xi_{m,j})\right\}$ , where

\displaystyle\xi_{i,j}\sim Bern(p),\forall i\in[1,m],\forall j\in[1,c],

and $Bern(p)$ is the Bernoulli distribution with parameter $p$ . In the proof we set $p=\mu=\mathop{\mathbb{E}}_{\mathcal{D}}[\mu(\mathcal{D})]$ . Then, we define the collection of resampled datasets $\hat{\mathcal{R}}^{\star}=\{\hat{\mathcal{D}}^{\star}_{1},\dots,\hat{\mathcal{% D}}^{\star}_{c}\}$ similarly to $\mathcal{R}^{\star}$ , where the parameter $\mu$ is replaced by its upper bound $\hat{\mu}$ (note that the roles of $\mathcal{R}^{\star}$ and $\hat{\mathcal{R}}^{\star}$ are swapped in the statement and in Alg. 1).

Let $Y=\mathop{\mathbb{E}}_{\mathcal{D}}\bigl{[}\sup_{\mathcal{P}\in\mathcal{L}^{% \star}}\{\mathsf{q}_{\mathcal{P}}(\mathcal{D})\}\bigr{]}$ , and observe that it holds $\sup_{\mathcal{P}\in\mathcal{L}^{\star}}Var_{\mathcal{D}}(\mathsf{q}_{\mathcal% {P}}(\mathcal{D}))\leq\nu$ . We apply Bousquet’s inequality (Theorem 12.5 of (Boucheron et al., 2013)) to prove that

\displaystyle\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\{\mathsf{q}_{\mathcal{P}% }(\mathcal{D})\}\leq Y+\sqrt{\frac{2\ln\bigl{(}\frac{4}{\delta}\bigr{)}\left(% \nu+2Y\right)}{m}}+\frac{\ln\bigl{(}\frac{4}{\delta}\bigr{)}}{3m},

with probability $\geq 1-\delta/4$ . We now show how to upper bound $Y$ . First, note that

	$\displaystyle\mathop{\mathbb{E}}_{\mathcal{D}}\bigl{[}\sup_{\mathcal{P}\in% \mathcal{L}^{\star}}\{\mathsf{q}_{\mathcal{P}}(\mathcal{D})\}\bigr{]}$	$\displaystyle=\mathop{\mathbb{E}}_{\mathcal{D}^{\star}_{j}}\bigl{[}\sup_{% \mathcal{P}\in\mathcal{L}^{\star}}\{\mathsf{q}_{\mathcal{P}}(\mathcal{D}^{% \star}_{j})\}\bigr{]}$
		$\displaystyle=\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}\biggl{[}\frac{1}{c}% \sum_{j=1}^{c}\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\{\mathsf{q}_{\mathcal{P% }}(\mathcal{D}^{\star}_{j})\}\biggr{]}$

by definition of $\mathcal{L}^{\star}$ and $\mathcal{R}^{\star}$ . Then, it follows that

\displaystyle\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}\biggl{[}\frac{1}{c}\sum% _{j=1}^{c}\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\{\mathsf{q}_{\mathcal{P}}(% \mathcal{D}^{\star}_{j})\}\biggr{]}\leq\mathop{\mathbb{E}}_{\mathcal{R}^{\star% }}\biggl{[}\frac{1}{c}\sum_{j=1}^{c}\sup_{\mathcal{P}\in\mathcal{L}}\{\mathsf{% q}_{\mathcal{P}}(\mathcal{D}^{\star}_{j})\}\biggr{]},

since $\mathcal{L}^{\star}\subseteq\mathcal{L}$ . We now prove bounds to the concentration of $\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}\bigl{[}\frac{1}{c}\sum_{j=1}^{c}\sup% _{\mathcal{P}\in\mathcal{L}}\{\mathsf{q}_{\mathcal{P}}(\mathcal{D}^{\star}_{j}% )\}\bigr{]}$ w.r.t. the set of features $\mathcal{A}$ . Let $\mathcal{D}^{\star}=(\mathcal{A},\mathcal{T}^{\star})$ be a generic $\mathcal{D}^{\star}\in\mathcal{R}^{\star}$ , with $\mathcal{T}^{\star}=\{\xi_{1},\dots,\xi_{m}\}$ . Define the random variable $Z$ as

\displaystyle Z=\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\bigl{[}\sup_{% \mathcal{P}\in\mathcal{L}}\{\mathsf{q}_{\mathcal{P}}(\mathcal{D}^{\star})\}% \bigr{]},

where the expectation is w.r.t. $\mathcal{T}^{\star}$ , conditioning on $\mathcal{A}$ . To show the concentration of $Z$ w.r.t. its expectation $\mathop{\mathbb{E}}_{\mathcal{A}}[Z]$ , we prove that $Z$ is a self-bounding function (Boucheron et al., 2013). Define the random variable $Z_{j}$ , for $j\in[1,m]$ , as

\displaystyle Z_{j}=\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\biggl{[}\sup_{% \mathcal{P}\in\mathcal{L}}\biggl{\{}\frac{1}{m}\sum_{i=1,i\neq j}^{m}f_{% \mathcal{P}}(s_{i})(\xi_{i}-\mu)\biggr{\}}\biggr{]}

First, note that $Z\geq 0$ :

\displaystyle Z=\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\bigl{[}\sup_{% \mathcal{P}\in\mathcal{L}}\{\mathsf{q}_{\mathcal{P}}(\mathcal{D}^{\star})\}% \bigr{]}\geq\sup_{\mathcal{P}\in\mathcal{L}}\bigl{\{}\mathop{\mathbb{E}}_{% \mathcal{T}^{\star}}\bigl{[}\mathsf{q}_{\mathcal{P}}(\mathcal{D}^{\star})\bigr% {]}\bigr{\}}\geq 0.

Then, we prove that $Z-Z_{j}\geq 0$ . Define $\hat{f}$ as one of the functions that attain the supremum for $Z_{j}$ , for any choice of $\mathcal{T}^{\star}$ . We have

	$\displaystyle Z_{j}-Z$	$\displaystyle\leq Z_{j}-\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\biggl{[}% \frac{1}{m}\sum_{i=1}^{m}\hat{f}(s_{i})(\xi_{i}-\mu)\biggr{]}$
		$\displaystyle=\mathop{\mathbb{E}}_{\xi_{j}}\biggl{[}-\frac{1}{m}\hat{f}(s_{j})% (\xi_{j}-\mu)\biggr{]}=0.$

We now show that $Z-Z_{j}\leq\mu(1-\mu)/m$ . We have

	$\displaystyle Z$	$\displaystyle\leq\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\biggl{[}\sup_{% \mathcal{P}\in\mathcal{L}}\biggl{\{}\frac{1}{m}\sum_{i=1,i\neq j}f_{\mathcal{P% }}(s_{i})(\xi_{i}-\mu)\biggr{\}}$
		$\displaystyle\;\;\;\;\;\;\;+\sup_{\mathcal{P}\in\mathcal{L}}\left\{\frac{1}{m}% f_{\mathcal{P}}(s_{j})(\xi_{j}-\mu)\right\}\biggr{]}$
		$\displaystyle\leq Z_{j}+\frac{\mu(1-\mu)}{m}.$

We now prove that $\sum_{j=1}^{m}Z-Z_{j}\leq Z$ . It holds

	$\displaystyle\sum_{j=1}^{m}Z_{j}$	$\displaystyle\geq\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\biggl{[}\sup_{% \mathcal{P}\in\mathcal{L}}\biggl{\{}\frac{1}{m}\sum_{j=1}^{m}\sum_{i=1,i\neq j% }^{m}f_{\mathcal{P}}(s_{i})(\xi_{i}-\mu)\biggr{\}}\biggr{]}$
		$\displaystyle=\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\biggl{[}\sup_{\mathcal% {P}\in\mathcal{L}}\biggl{\{}\frac{m-1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(% \xi_{i}-\mu)\biggr{\}}\biggr{]}$
		$\displaystyle=(m-1)Z.$

As $Z$ is a self-bounding function, we have that

\displaystyle\Pr\left(\mathop{\mathbb{E}}[Z]-Z\geq q\right)\leq\exp\left(\frac% {-mq^{2}}{2\mu(1-\mu)\mathop{\mathbb{E}}[Z]}\right).

Imposing the r.h.s. $\leq\delta/4$ , solving for $q$ , and finding the fixed point of the inequality we obtain $\hat{d}$ . Let $\tilde{d}(\mathcal{R}^{\star})=\tilde{d}(\mathcal{R}^{\star},\mu(\mathcal{D}))$ . We apply McDiarmid inequality (Theorem 1) to show that $\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}[\tilde{d}(\mathcal{R}^{\star})\>|\>% \mathcal{A}]\leq\hat{r}$ with probability $\geq 1-\delta/4$ . In fact, define the function $g(\mathcal{R}^{\star})=\tilde{d}(\mathcal{R}^{\star})$ , and note that modifying any $\xi_{i,j}$ , for any pair $i,j$ , changes $g(\mathcal{R}^{\star})$ by at most $1/(cm)$ . Therefore, defining $C=\sum_{i}\sum_{j}(1/(cm))^{2}=1/(cm)$ , the upper bound $\hat{r}$ to $\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}[\tilde{d}(\mathcal{R}^{\star})\>|\>% \mathcal{A}]$ holds by Theorem 1. We now need to prove that the upper bound $\hat{d}$ computed using $\hat{\mathcal{R}}^{\star}$ is (probabilistically) not smaller than using $\mathcal{R}^{\star}$ . This is equivalent to show that it holds, for all $x$ ,

\displaystyle\Pr_{\hat{\mathcal{R}}^{\star}}\left(\tilde{d}(\hat{\mathcal{R}}^% {\star})>x\right)\geq\Pr_{\mathcal{R}^{\star}}\left(\tilde{d}(\mathcal{R}^{% \star})>x\right).

Equivalently, the probability of underestimating $\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}[\tilde{d}(\mathcal{R}^{\star})]$ using $\tilde{d}(\mathcal{R}^{\star})$ does not increase when using $\tilde{d}(\hat{\mathcal{R}}^{\star})$ . Since the two probabilities are taken w.r.t. to two different sample spaces, it is not possible to compare them directly. Therefore, we build an appropriate coupling between the two distributions. Define an $m\times c$ matrix $v$ of $mc$ i.i.d. Bernoulli random variables, such that $\Pr(v_{i,j}=1)=\varepsilon_{T}/(1-\mu)$ for all $i,j$ . (Note that we assume $0<\mu<1$ and $0\leq\check{\mu}\leq\mu\leq\hat{\mu}\leq 1$ , otherwise the statement holds trivially.) We observe that

\displaystyle\Pr_{\hat{\mathcal{R}}^{\star}}\left(\tilde{d}(\hat{\mathcal{R}}^% {\star})>x\right)=\Pr_{\hat{\mathcal{R}}^{\star}}\left(\frac{1}{c}\sum_{j=1}^{% c}\sup_{\mathcal{P}\in\mathcal{L}}\left\{\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{% P}}(s_{i})\left(\hat{\xi}_{i,j}-\mu\right)\right\}>x\right),

where $\hat{\xi}_{i,j}$ are i.i.d. Bernoulli with $\Pr(\hat{\xi}_{i,j}=1)=\hat{\mu}$ for all $i,j$ . We build the following coupling between the distributions of $\hat{\mathcal{R}}^{\star}$ and $\mathcal{R}^{\star}$ , using the fact that $\hat{\xi}_{i,j}\sim\max\{\xi_{i,j},v_{i,j}\}$ . This allows us to obtain the lower bound stated above:

	$\displaystyle\Pr_{\hat{\mathcal{R}}^{\star}}\left(\frac{1}{c}\sum_{j=1}^{c}% \sup_{\mathcal{P}\in\mathcal{L}}\left\{\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}% }(s_{i})\left(\hat{\xi}_{i,j}-\mu\right)\right\}>x\right)$
	$\displaystyle=\Pr_{\mathcal{R}^{\star},v}\left(\frac{1}{c}\sum_{j=1}^{c}\sup_{% \mathcal{P}\in\mathcal{L}}\left\{\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i% })\left(\max\{\xi_{i,j},v_{i,j}\}-\mu\right)\right\}>x\right)$
	$\displaystyle\geq\Pr_{\mathcal{R}^{\star},v}\left(\frac{1}{c}\sum_{j=1}^{c}% \sup_{\mathcal{P}\in\mathcal{L}}\left\{\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}% }(s_{i})\left(\xi_{i,j}-\mu\right)\right\}>x\right)$
	$\displaystyle=\Pr_{\mathcal{R}^{\star}}\left(\tilde{d}(\mathcal{R}^{\star})>x% \right).$

Then, it is immediate to observe that

\displaystyle\Pr_{\hat{\mathcal{R}}^{\star}}\left(\tilde{d}(\hat{\mathcal{R}}^% {\star},\check{\mu})>x\right)\geq\Pr_{\hat{\mathcal{R}}^{\star}}\left(\tilde{d% }(\hat{\mathcal{R}}^{\star})>x\right),

since $\mu$ is replaced by its lower bound $\check{\mu}$ in the definition of $\tilde{d}(\hat{\mathcal{R}}^{\star})$ , as $\tilde{d}(\hat{\mathcal{R}}^{\star},\check{\mu})\geq\tilde{d}(\hat{\mathcal{R}% }^{\star})$ for all $\hat{\mathcal{R}}^{\star}$ . The statement follows observing that all other quantities in the definition of $\varepsilon$ are constants independent of $\mu$ , and from an union bound over the $3$ concentration bounds considered in the proof, and the event $\text{``}|\mu-\mu(\mathcal{D})|\leq\varepsilon_{T}\text{''}$ , each of them true with probability $\geq 1-\delta/4$ . ∎

A.5. Power Analysis

In this section we prove the results on the power of FSR stated in Sections 4.1 and 4.2.

We first provide a probabilistic upper bound to $\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ , the estimate of the supremum deviation of false discoveries computed by Algorithm 1 (in line 1). This result can be applied to general languages; we then show how to apply it to the language of subgroups. We define $N_{\mathcal{L}}(\mathcal{D})$ as the number of distinct projections of the language $\mathcal{L}$ on the dataset $\mathcal{D}$ :

\displaystyle N_{\mathcal{L}}(\mathcal{D})=\lvert\left\{\{i:\mathcal{P}\in s_{% i}\},\mathcal{P}\in\mathcal{L}\right\}\rvert.

Note that, differently from $|\mathcal{L}|$ , $N_{\mathcal{L}}(\mathcal{D})$ is always a finite value (a trivial upper bound is $N_{\mathcal{L}}(\mathcal{D})\leq 2^{m}$ ).

Theorem 2.

Let $\lambda\in(0,1)$ , and $\hat{\omega}=\hat{\mu}(1-\check{\mu})\sup_{\mathcal{P}\in\mathcal{L}}\mathsf{f% }_{\mathcal{P}}(\mathcal{D})$ . The value of $\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ computed by FSR in line 1 of Algorithm 1 is

\displaystyle\tilde{d}(\mathcal{R}^{\star},\check{\mu})\leq\sqrt{\frac{2\hat{% \omega}\ln(N_{\mathcal{L}}(\mathcal{D}))}{m}}+\sqrt{\frac{\ln(\frac{1}{\lambda% })}{2cm}}+\frac{\ln(N_{\mathcal{L}}(\mathcal{D}))}{3m}

with probability at least $1-\lambda$ .

Proof.

To obtain the statement, we first prove an upper bound to the expectation $\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\hat{\mu})}[\tilde{d}(\mathcal{R}^{\star% },\check{\mu})]$ , that is

\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\hat{\mu})}[\tilde{d}(% \mathcal{R}^{\star},\check{\mu})]\leq\sqrt{\frac{2\hat{\omega}\ln(N_{\mathcal{% L}}(\mathcal{D}))}{m}}+\frac{\ln(N_{\mathcal{L}}(\mathcal{D}))}{3m},

and then conclude with a concentration bound for $\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ w.r.t. to its expectation $\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\hat{\mu})}[\tilde{d}(\mathcal{R}^{\star% },\check{\mu})]$ , using analogous derivations of the proof of Theorem 6. For any $\mathcal{P}\in\mathcal{L}$ , define $X_{\mathcal{P}}=\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}% -\check{\mu})$ . Let two patterns $\mathcal{P}_{1},\mathcal{P}_{2}$ such that the projection of $\mathcal{P}_{1}$ on $\mathcal{D}$ is equal to the projection of $\mathcal{P}_{2}$ on $\mathcal{D}$ , i.e., it holds $\{i:\mathcal{P}_{1}\in s_{i}\}=\{i:\mathcal{P}_{2}\in s_{i}\}$ . This implies that $\sup_{i\in\{1,2\}}X_{\mathcal{P}_{i}}=X_{\mathcal{P}_{1}}$ . Therefore, we can rewrite the supremum within $\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\hat{\mu})}[\tilde{d}(\mathcal{R}^{\star% },\check{\mu})]$ over the set of patterns with distinct projections, recalling that the number of such distinct projections is $N_{\mathcal{L}}(\mathcal{D})$ . Now, for any $\mathcal{P}\in\mathcal{L}$ , Bernstein’s inequality (Theorem 2.10 in (Boucheron et al., 2013)) implies that $X_{\mathcal{P}}$ is a sub-gamma random variable (Section 2.4 (Boucheron et al., 2013)), such that $X_{\mathcal{P}}\in\Gamma_{+}(u,b)$ with $u=\hat{\omega}/m$ and $b=1/(3m)$ , since it is an average of i.i.d. random variables $f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\check{\mu})$ that are bounded in the interval $[-\check{\mu},1-\check{\mu}]$ and have variance $\leq\hat{\omega}$ . Consequently, we apply a maximal inequality (Corollary 2.6 in (Boucheron et al., 2013)) to upper bound the expected maximum of sub-gamma random variables, obtaining that $\mathop{\mathbb{E}}[\max_{\mathcal{P}}X_{\mathcal{P}}]\leq\sqrt{2u\ln(N_{% \mathcal{L}}(\mathcal{D}))}+b\ln(N_{\mathcal{L}}(\mathcal{D}))$ . Note that the upper bound to the expectation given above holds. The statement follows from the application of Theorem 1 to $\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ , following the same steps of the proof of Theorem 6. ∎

To upper bound $N_{\mathcal{L}}(\mathcal{D})$ for the language of subgroups, we prove the following. Note that we focus on the case of subgroups with continuous features, since every categorical feature $f_{c}$ can be converted to a discrete one $f_{d}$ (assigning a random order to the distinct elements), and observing that each equality condition $\text{``}f_{c}=a\text{''}$ is equivalent to the interval $\text{``}f_{d}\in[a-x,a+x]\text{''}$ for some $x>0$ . Therefore, the language of subgroups over continuous features contains the language over datasets with both continuous and categorical features, thus all results for the former apply to the latter. Note the reverse direction is not true in general.

Lemma 3.

Let $\mathcal{L}$ be the language of subgroups composed by conjunctions with at most $z$ conditions over $d$ continuous features. Then, it holds $N_{\mathcal{L}}(\mathcal{D})\leq\left(\frac{e^{3}dm^{2}}{4z^{3}}\right)^{z}$ .

Proof.

Consider a dataset $\mathcal{D}$ with $d$ continuous features, and let $[1,d]$ be the indices of these features. Let any $A\subseteq[1,d]$ with $|A|=v\leq z$ , and define the language $\mathcal{L}_{A}\subseteq\mathcal{L}$ as all subgroups with $v$ conjunction terms involving conditions on the features of $A$ , such as inequalities or intervals. Equivalently, we can see the projection of $\mathcal{L}_{A}$ over the transactions of $\mathcal{D}$ as the class of axis-aligned rectangles in $\mathbb{R}^{v}$ . It is known that the VC-dimension of this class is $2v$ (Problem 6.5 in (Shalev-Shwartz and Ben-David, 2014)). Therefore, from Sauer-Shelah-Perles’ Lemma (Lemma 6.10 in (Shalev-Shwartz and Ben-David, 2014)), the number $N_{\mathcal{L}_{A}}(\mathcal{D})$ of distinct projections of $\mathcal{L}_{A}$ on $\mathcal{D}$ is $N_{\mathcal{L}_{A}}(\mathcal{D})\leq\sum_{i=1}^{2v}\binom{m}{i}\leq\left(\frac% {em}{2v}\right)^{2v}$ . From an union bound,

\displaystyle N_{\mathcal{L}}(\mathcal{D})\leq\sum_{A}N_{\mathcal{L}_{A}}(% \mathcal{D})\leq\sum_{v=1}^{z}\binom{d}{v}\left(\frac{em}{2v}\right)^{2v}\leq% \left(\frac{em}{2z}\right)^{2z}\left(\frac{ed}{z}\right)^{z},

obtaining the statement. ∎

Combining Lemma 3 and Theorem 2, we obtain the following Corollary.

Corollary 4.

Let $\mathcal{L}$ be the language of subgroups composed by conjunctions with at most $z$ conditions over $d$ continuous features. Then the value $\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ computed by FSR in line 1 of Algorithm 1 is

\displaystyle\tilde{d}(\mathcal{R}^{\star},\check{\mu})\leq\sqrt{\frac{2\hat{% \omega}z\ln(\frac{e^{3}dm^{2}}{4z^{3}})}{m}}+\sqrt{\frac{\ln(\frac{1}{\lambda}% )}{2cm}}+\frac{z\ln(\frac{e^{3}dm^{2}}{4z^{3}})}{3m}

with probability at least $1-\lambda$ .

A.5.1. Power analysis of FSR-C

To prove Theorem 8, regarding the power of FSR-C, we first describe the model we assume for the distribution of the alternative hypotheses, i.e., the set of patterns with $\mathsf{q}_{\mathcal{P}}>0$ . We assume that the quality of patterns correlated with the target follows the Wallenius’ noncentral hypergeometric distribution (Wallenius, 1963), a generalization of the hypergeometric distribution that allows to model a biased random sampling of a contingency table with fixed marginals. We define the model $W_{n}(\{m_{i},w_{i}\})$ that describes the distribution of a sequence of binary random variables $\ell_{1},\dots,\ell_{n}$ for the weighted sampling of $n$ elements from a set of $m_{1}$ items with label $1$ and $m_{0}$ items with label $0$ ; the parameters $w_{0}$ and $w_{1}$ , with $1\leq w_{0}<w_{1}$ , are respectively the weights of items with label $0$ and $1$ . The fact $w_{1}>w_{0}$ expresses the bias toward sampling items with label $1$ . The first element $\ell_{1}$ is sampled according to the weighted proportion of the items, such that

\displaystyle\Pr\left(\ell_{1}=1\right)=\frac{m_{1}w_{1}}{m_{1}w_{1}+m_{0}w_{0% }}.

The second element $\ell_{2}$ is taken according to the weighted proportion of the remaining items, therefore dependending on the outcome of the first choice. In general, we obtain that

\displaystyle\Pr\left(\ell_{i}=1\right)=\frac{m_{1}^{i}w_{1}}{m_{1}^{i}w_{1}+m% _{0}^{i}w_{0}},

where $m_{1}^{i}=m_{1}-\sum_{j=1}^{i-1}\ell_{j}$ and $m_{0}^{i}=m_{0}-\sum_{j=1}^{i-1}(1-\ell_{j})$ are, respectively, the number of remaining items with label $1$ and label $0$ , i.e., the ones that are not sampled in previous steps. We may observe that, as the bias goes to $0$ (i.e., $w_{1}\rightarrow w_{0}$ ), this distribution converges to the (standard) hypergeometric distribution. We remark that a direct calculation of mean and the probability mass function of the noncentral hypergeometric distribution above, therefore the computation of exact tail bounds, is extremely unwieldy. Moreover, the random variables $\{\ell_{i},i\in[1,n]\}$ are clearly not independent, therefore standard concentration results (e.g., Chernoff-Hoeffding bounds) do not apply directly.

To overcome these issues, we leverage an advanced concentration bound for martingales, which also applies to random variables that are not necessarely independent (Dubhashi and Panconesi, 2009). We use the following version of the method of bounded differences (McDiarmid, 1989). For a given set of random variables $X_{1},\dots,X_{n}$ , we use $\mathbf{X}_{i}$ to denote the set $\{X_{j},j\in[1,i]\}$ .

Definition 5.

A function $f$ satisfies the Averaged Lipschitz Condition (ALC) with parameters $c_{i},i\in[n]$ , with respect to the random variables $X_{1},\dots,X_{n}$ if for any $a_{i},a^{\prime}_{i}$ ,

\displaystyle\lvert\mathop{\mathbb{E}}\left[f\mid\mathbf{X}_{i-1},X_{i}=a_{i}% \right]-\mathop{\mathbb{E}}\left[f\mid\mathbf{X}_{i-1},X_{i}=a^{\prime}_{i}% \right]\rvert\leq c_{i},

for $1\leq i\leq n$ .

The following result establishes concentration bounds for functions that satisfy the ALC.

Theorem 6 (Corollary 5.1 (Dubhashi and Panconesi, 2009)).

Let $f$ satisfy the ALC with parameters $c_{i},i\in[n]$ , with respect to the random variables $X_{1},\dots,X_{n}$ , and let $C=\sum_{i=1}^{n}c_{i}^{2}$ . Then it holds

\displaystyle\Pr\left(f>\mathop{\mathbb{E}}[f]+t\right),\Pr\left(f<\mathop{% \mathbb{E}}[f]-t\right)\leq\exp(-2t^{2}/C).

We now prove that the sum $\sum_{i=1}^{n}\ell_{i}$ is a function that satisfy the ALC, thus is sharply concentrated toward its expectation.

Theorem 7.

Let $\{\ell_{i},i\in[1,n]\}$ be a set of random variables distributed according to $W_{n}(\{m_{i},w_{i}\})$ . Denote $X_{i}=\ell_{i}$ and the function $f=\sum_{i=1}^{n}X_{i}$ . Then, it holds

\displaystyle\Pr\left(f>\mathop{\mathbb{E}}[f]+t\right),\Pr\left(f<\mathop{% \mathbb{E}}[f]-t\right)\leq\exp(-2t^{2}/n).

Proof.

We prove that $f$ satisfy the ALC. For any $i\in[1,n]$ ,

	$\displaystyle\mathop{\mathbb{E}}\Bigl{[}\sum_{j=1}^{n}X_{j}\mid\mathbf{X}_{i-1% },X_{i}=1\Bigr{]}-\mathop{\mathbb{E}}\Bigl{[}\sum_{j=1}^{n}X_{j}\mid\mathbf{X}% _{i-1},X_{i}=0\Bigr{]}$
	$\displaystyle=1+\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_{j}\mid\mathbf{X}% _{i-1},X_{i}=1\Bigr{]}-\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_{j}\mid% \mathbf{X}_{i-1},X_{i}=0\Bigr{]}$
	$\displaystyle\leq 1+\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_{j}\mid% \mathbf{X}_{i-1},X_{i}=0\Bigr{]}-\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_% {j}\mid\mathbf{X}_{i-1},X_{i}=0\Bigr{]}=1.$

We then prove the other direction:

	$\displaystyle\mathop{\mathbb{E}}\Bigl{[}\sum_{j=1}^{n}X_{j}\mid\mathbf{X}_{i-1% },X_{i}=0\Bigr{]}-\mathop{\mathbb{E}}\Bigl{[}\sum_{j=1}^{n}X_{j}\mid\mathbf{X}% _{i-1},X_{i}=1\Bigr{]}$
	$\displaystyle=\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_{j}\mid\mathbf{X}_{% i-1},X_{i}=0\Bigr{]}-1-\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_{j}\mid% \mathbf{X}_{i-1},X_{i}=1\Bigr{]}$
	$\displaystyle\leq 1+\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_{j}\mid% \mathbf{X}_{i-1},X_{i}=1\Bigr{]}-1-\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}% X_{j}\mid\mathbf{X}_{i-1},X_{i}=1\Bigr{]}$
	$\displaystyle=0.$

We conclude that $f$ satisfy the ALC with $c_{i}=1$ , for all $1\leq i\leq n$ , and the concentration bounds follow from Theorem 6. ∎

Using the result above, we prove a probabilitic lower bound to the observed quality ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ of a pattern $\mathcal{P}$ . For any $\mathcal{P}\in\mathcal{L}$ , let $n=m\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ be the number of transactions of $\mathcal{D}$ where $\mathcal{P}$ is supported, and assume w.l.o.g. that $\{i:\mathcal{P}\in s_{i},i\in[1,m]\}=[1,n]$ . We define $X_{i}=\ell_{i}$ for all $i\in[1,n]$ , where $\ell_{i}$ follows the biased sampling distribution described above.

Proposition 8.

It holds

\displaystyle\Pr\left({\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\leq\mathsf% {q}_{\mathcal{P}}-t\right)\leq\exp\left(\frac{-2mt^{2}}{\mathsf{f}_{\mathcal{P% }}(\mathcal{D})}\right).

Proof.

Define the function $g=\frac{1}{m}\sum_{i=1}^{n}(\ell_{i}-\mu(\mathcal{D}))$ . Considering the function $f$ in the statement of Theorem 7, we observe that it holds $g=f/m-\mu(\mathcal{D})$ , and that $n=m\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ . From these observations, we apply Theorem 7 to the function $f=m(g+\mu(\mathcal{D}))$ , obtaining the statement after simple manipulations of the r.h.s.. ∎

We now extend Proposition 8 to obtain a bound valid simoultaneously for all patterns of the language $\mathcal{L}$ .

Proposition 9.

Define $\hat{\mathsf{f}}(\mathcal{D})=\sup_{\mathcal{P}}\mathsf{f}_{\mathcal{P}}(% \mathcal{D})$ . With probability $\geq 1-\delta$ , it holds

\displaystyle{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\geq\mathsf{q}_{% \mathcal{P}}-\sqrt{\frac{2\hat{\mathsf{f}}(\mathcal{D})\ln(N_{\mathcal{L}}(% \mathcal{D})/\delta)}{m}},\forall\mathcal{P}\in\mathcal{L}.

Proof.

Define the event

\displaystyle E_{\mathcal{P}}=\text{``}{\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D})}\leq\mathsf{q}_{\mathcal{P}}-\sqrt{\frac{2\hat{\mathsf{f}}(% \mathcal{D})\ln(N_{\mathcal{L}}(\mathcal{D})/\delta)}{m}}\text{''}.

The statement holds if $\Pr(\cup_{\mathcal{P}}E_{\mathcal{P}})\leq\delta$ . Now, denote two patterns $\mathcal{P}_{1},\mathcal{P}_{2}$ , such that they are supported by the same set of transactions: $\{i:\mathcal{P}_{1}\in s_{i}\}=\{i:\mathcal{P}_{2}\in s_{i}\}$ . This implies that ${\bar{\mathsf{q}}_{\mathcal{P}_{1}}(\mathcal{D})}={\bar{\mathsf{q}}_{\mathcal{% P}_{2}}(\mathcal{D})}$ for all possible labels, but also that $\mathsf{q}_{\mathcal{P}_{1}}=\mathsf{q}_{\mathcal{P}_{2}}$ , since the two qualities are functions of the same set of labels. From these observations, it holds $E_{\mathcal{P}_{1}}\cup E_{\mathcal{P}_{2}}=E_{\mathcal{P}_{1}}$ . Therefore, the union over all patterns can be replaced by the union over the ones with distinct projections, whose number is at most $N_{\mathcal{L}}(\mathcal{D})$ . From an union bound over these events, and using the tail bound of Proposition 8 with the fact $\mathsf{f}_{\mathcal{P}}(\mathcal{D})\leq\hat{\mathsf{f}}(\mathcal{D})$ , we have

\displaystyle\Pr\left(\cup_{\mathcal{P}}E_{\mathcal{P}}\right)\leq N_{\mathcal% {L}}(\mathcal{D})\exp\bigl{(}-2mt^{2}/\hat{\mathsf{f}}(\mathcal{D})\bigr{)}.

Imposing the r.h.s. of the inequality to be $\leq\delta$ , and solving for $t$ , we obtain the statement. ∎

Finally, we prove the power guarantees for FSR-C.

Proof of Theorem 8.

To obtain the statement, we show that the condition $\mathsf{q}_{\mathcal{P}}\geq\xi$ , for some $\xi>0$ sufficiently large, implies that $\mathcal{P}$ is reported in output with probability $\geq 1-\delta$ . In fact, we have that the implication $\mathsf{q}_{\mathcal{P}}\geq\xi\implies{\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D})}\geq\xi-\varepsilon_{1}$ holds with probability $\geq 1-\delta/2$ for all $\mathcal{P}\in\mathcal{L}$ , where $\varepsilon_{1}=\sqrt{\frac{2\hat{\mathsf{f}}(\mathcal{D})z\ln(\frac{e^{3}m^{2% }d}{2z^{3}\delta})}{m}}$ is obtained from Proposition 9 (replacing $\delta$ by $\delta/2$ , and using the upper bound to $N_{\mathcal{L}}(\mathcal{D})$ from Lemma 3). Consequently, we seek a sufficient condition to ensure that $\mathcal{P}$ is reported in output; this condition is ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\geq\xi-\varepsilon_{1}\geq\varepsilon$ , and also $\xi\geq\varepsilon_{1}+\varepsilon$ , where $\varepsilon$ is as defined in Corollary 7. Combining this inequality with the upper bound to $\tilde{d}(\mathcal{R}^{\star},\check{\mu})$ of Theorem 2 (setting $\lambda=\delta/2$ ), we obtain the lower bound to $\mathsf{q}_{\mathcal{P}}$ in the statement, which holds with probability $\geq 1-\delta$ from a union bound. ∎

A.5.2. Power analysis of FSR-U

The following is a restatement of Theorem 5 of (Li et al., 2001), that provides uniform convergence bounds for families of functions with bounded pseudodimension (Pollard, 2012; Shalev-Shwartz and Ben-David, 2014).

Theorem 10.

Let $\mathcal{F}$ be a family of functions from a domain $\mathcal{X}$ to $[a,b]\subset\mathbb{R}$ with pseudodimention $P(\mathcal{F})\leq r$ . Let $\mathcal{S}=\{s_{1},\dots,s_{m}\}$ be a random sample of size $m$ taken i.i.d. from a distribution $\gamma$ . It holds

\displaystyle\Pr\left(\sup_{f\in\mathcal{F}}\left\lvert\mathop{\mathbb{E}}_{% \mathcal{S}}\left[\frac{1}{m}\sum_{i=1}^{m}f(s_{i})\right]-\frac{1}{m}\sum_{i=% 1}^{m}f(s_{i})\right\rvert>t\right)\leq\exp\left(r-\frac{t^{2}m}{\tilde{c}(b-a% )^{2}}\right),

where $\tilde{c}$ is an absolute constant.

We note that the absolute constant $\tilde{c}$ in the Theorem above is estimated to be at most $0.5$ (Löffler and Phillips, 2009).

Using the result above, we prove the following bounds to the supports and qualities for all subgroups.

Proposition 11.

Let $\mathcal{L}$ be the language of subgroups composed by conjunctions with at most $z$ conditions over $d$ continuous features, and define $\mathsf{f}_{\mathcal{P}}=\mathop{\mathbb{E}}_{\mathcal{D}}[\mathsf{f}_{% \mathcal{P}}(\mathcal{D})]$ . With probability $\geq 1-\delta$ w.r.t. $\mathcal{D}$ it holds, for all $\mathcal{P}\in\mathcal{L}$ ,

	$\displaystyle\mathsf{f}_{\mathcal{P}}(\mathcal{D})$	$\displaystyle\leq\mathsf{f}_{\mathcal{P}}+\sqrt{\frac{2z+\ln\bigl{(}\sum_{i=1}% ^{z}\binom{d}{i}\bigr{)}+\ln\bigl{(}\frac{2}{\delta}\bigr{)}}{2m}},$
	$\displaystyle\mathsf{q}_{\mathcal{P}}(\mathcal{D})$	$\displaystyle\geq\mathsf{q}_{\mathcal{P}}-\sqrt{\frac{2z+\ln\bigl{(}\sum_{i=1}% ^{z}\binom{d}{i}\bigr{)}+\ln\bigl{(}\frac{2}{\delta}\bigr{)}}{2m}}.$

Proof.

We first prove the bound for $\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ . For any $\mathcal{P}\in\mathcal{L}_{A}$ , recall the functions $f_{\mathcal{P}}(s)=\mathds{1}\left[\mathcal{P}\in s\right]$ , and let $\mathcal{F}_{A}=\{f_{\mathcal{P}}:\mathcal{P}\in\mathcal{L}_{A}\}$ be the family of such functions. It is immediate to observe that the average value of $f_{\mathcal{P}}$ over $\mathcal{D}$ is equal to $\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ . As discussed previously, $\mathcal{F}_{A}$ is equivalent to the class of axis-aligned rectangles in $\mathbb{R}^{v}$ , with VC-dimension $2v$ (Problem 6.5 in (Shalev-Shwartz and Ben-David, 2014)). We apply Theorem 10 to the particular case of the binary function family $\mathcal{F}_{A}$ , obtaining that

\displaystyle\Pr\Bigl{(}\sup_{f_{\mathcal{P}}\in\mathcal{F}_{A}}\left\lvert% \mathsf{f}_{\mathcal{P}}-\mathsf{f}_{\mathcal{P}}(\mathcal{D})\right\rvert>t% \Bigr{)}\leq\exp(2v-2t^{2}m).

From an union bound over all subsets $A\subseteq[1,d]$ with cardinality $\leq z$ , we have

	$\displaystyle\Pr\Bigl{(}\sup_{A}\sup_{f_{\mathcal{P}}\in\mathcal{F}_{A}}\left% \lvert\mathsf{f}_{\mathcal{P}}-\mathsf{f}_{\mathcal{P}}(\mathcal{D})\right% \rvert>t\Bigr{)}$
	$\displaystyle\leq\sum_{A}\Pr\Bigl{(}\sup_{f_{\mathcal{P}}\in\mathcal{F}_{A}}% \left\lvert\mathsf{f}_{\mathcal{P}}-\mathsf{f}_{\mathcal{P}}(\mathcal{D})% \right\rvert>t\Bigr{)}\leq\sum_{i=1}^{z}\binom{d}{i}\exp(2i-2t^{2}m).$

Setting the r.h.s. to be $\leq\delta/2$ , we obtain the first bound.

We now focus on the bound for $\mathsf{q}_{\mathcal{P}}(\mathcal{D})$ . For any $\mathcal{P}\in\mathcal{L}_{A}$ , recall the functions $g_{\mathcal{P}}(s,\ell)=\mathds{1}\left[\mathcal{P}\in s\right](\ell-\mu)$ , and let $\mathcal{G}_{A}=\{g_{\mathcal{P}}:\mathcal{P}\in\mathcal{L}_{A}\}$ be the family of such functions. It is immediate to observe that the average value of $g_{\mathcal{P}}$ over $\mathcal{D}$ is equal to $\mathsf{q}_{\mathcal{P}}(\mathcal{D})$ . By using Lemma 3.6 of (Riondato and Vandin, 2020), we obtain that the pseudodimension of $\mathcal{G}_{A}$ is equal to the VC-dimension of $\mathcal{F}_{A}$ , which is $2v$ . Therefore, we apply Theorem 10 to $\mathcal{G}_{A}$ , having

\displaystyle\Pr\Bigl{(}\sup_{g_{\mathcal{P}}\in\mathcal{G}_{A}}\left\lvert% \mathsf{q}_{\mathcal{P}}-\mathsf{q}_{\mathcal{P}}(\mathcal{D})\right\rvert>t% \Bigr{)}\leq\exp(2v-2t^{2}m).

We complete the proof following the same steps for the frequencies. ∎

We are now ready to prove Theorem 13.

Proof of Theorem 13.

From the guarantees of Corollary 4, the value of $\varepsilon$ defined in the statement is an upper bound to the value of $\varepsilon$ computed by FSR-U in Algorithm 1 with probability $\geq 1-\delta/3$ (since $\lambda=\delta/3$ in the definition of $\hat{r}$ ). We now prove a condition on the quality $\mathsf{q}_{\mathcal{P}}$ for a pattern to be reported in output. From Proposition 11 (replacing $\delta/2$ by $\delta/3$ ), a pattern with $\mathsf{q}_{\mathcal{P}}\geq\xi$ has, with probability $\geq 1-\delta/3$ , its estimate $\mathsf{q}_{\mathcal{P}}(\mathcal{D})\geq\xi-\varepsilon_{1}$ , where

\displaystyle\varepsilon_{1}=\sqrt{\frac{2z+\ln\bigl{(}\sum_{i=1}^{z}\binom{d}% {i}\bigr{)}+\ln\bigl{(}\frac{3}{\delta}\bigr{)}}{2m}}.

Moreover, from the relation between $\mathsf{q}_{\mathcal{P}}(\mathcal{D})$ and ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}$ , it holds ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\geq\xi-\varepsilon_{1}-% \varepsilon_{T}\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ . Consequently, $\mathcal{P}$ is reported in output if ${\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\geq\xi-\varepsilon_{1}-% \varepsilon_{T}\mathsf{f}_{\mathcal{P}}(\mathcal{D})\geq\varepsilon+% \varepsilon_{T}\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ , so if $\xi\geq\varepsilon_{1}+2\varepsilon_{T}\mathsf{f}_{\mathcal{P}}(\mathcal{D})+\varepsilon$ . Using the upper bound to $\mathsf{f}_{\mathcal{P}}(\mathcal{D})$ proved in Proposition 11 (replacing $\delta/2$ by $\delta/3$ ), the condition $\mathsf{q}_{\mathcal{P}}\geq\xi\geq\varepsilon_{1}+2\varepsilon_{T}(\mathsf{f}% _{\mathcal{P}}+\varepsilon_{1})+\varepsilon$ is sufficient to guarantee that, with probability $\geq 1-\delta$ , $\mathcal{P}$ is reported in output, obtaining the statement. ∎

A.6. Baseline methods

In this Section we provide additional details on the baseline algorithm FSR-U-UB considered in our experimental evaluation. As anticipated in Section 5, it is not possible to apply standard statistical techniques (e.g., Bonferroni correction) to bound the FWER when the number $|\mathcal{L}|$ of tested hypothesis is unbounded. Therefore, to overcome this issue we proceed as follows. Recall that $N_{\mathcal{L}}(\mathcal{D})$ is the number of distinct projections of the language $\mathcal{L}$ on the dataset $\mathcal{D}$ (defined in Section A.5). The key idea for FSR-U-UB is to apply a Bonferroni correction on the $N_{\mathcal{L}}(\mathcal{D})$ distinct tested hypothesis only. We do so with the following result, that defines the value of $\varepsilon$ used by FSR-U-UB in Algorithm 1 to bound the FWER. Note that FSR-U-UB does not use any resamples, but uses $N_{\mathcal{L}}(\mathcal{D})$ to obtain an analytical value for $\varepsilon$ instead.

Theorem 12.

Let $\mathcal{D}$ be a dataset of $m$ samples taken i.i.d. from a distribution $\gamma$ . For any $\delta\in(0,1)$ , define $\nu_{T}\geq\mu(1-\mu)$ , $\nu\geq\nu_{T}\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{\mathop{\mathbb{E% }}_{\mathcal{D}}\left[\mathsf{f}_{\mathcal{P}}(\mathcal{D})\right]\right\}$ , $\hat{N}\geq N_{\mathcal{L}}(\mathcal{D})$ , and $\varepsilon$ as

	$\displaystyle\hat{r}=\sqrt{\frac{\ln(\frac{4\hat{N}}{\delta})}{2m}}$
	$\displaystyle\hat{d}=\hat{r}+\sqrt{\left(\frac{2\nu_{T}\ln\bigl{(}\frac{4}{% \delta}\bigr{)}}{m}\right)^{2}+\frac{2\hat{r}\ln\bigl{(}\frac{4}{\delta}\bigr{% )}}{m}}+\frac{2\nu_{T}\ln\bigl{(}\frac{4}{\delta}\bigr{)}}{m}$
	$\displaystyle\varepsilon\doteq\hat{d}+\sqrt{\frac{2\ln\bigl{(}\frac{4}{\delta}% \bigr{)}\left(\nu+2\hat{d}\right)}{m}}+\frac{\ln\bigl{(}\frac{4}{\delta}\bigr{% )}}{3m}.$

With probability at least $1-\delta$ over $\mathcal{D}$ it holds

\displaystyle\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{\mathsf{q}_{% \mathcal{P}}(\mathcal{D})\right\}\leq\varepsilon.

Proof.

We follow similar steps taken in the proof of Theorem 11; however, the key difference is in the upper bound $\hat{r}$ to $\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}[\tilde{d}(\mathcal{R}^{\star})\>|\>% \mathcal{A}]$ . We have, using Hoeffding’s inequality and an union bound, that for any $t\geq 0$

\displaystyle\Pr_{\mathcal{R}^{\star}}\left(\tilde{d}(\mathcal{R}^{\star})\geq t% \right)\leq N_{\mathcal{L}}(\mathcal{D})\exp\left(-2mt^{2}\right)\leq\hat{N}% \exp\left(-2mt^{2}\right).

This implies that, with probability $\geq 1-\delta/4$ over $\mathcal{R}^{\star}$ , it holds

\displaystyle\tilde{d}(\mathcal{R}^{\star})\leq\sqrt{\frac{\ln(\frac{4\hat{N}}% {\delta})}{2m}}.

Taking the expectation w.r.t. $\mathcal{R}^{\star}$ , conditionally on $\mathcal{A}$ , proves that $\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}[\tilde{d}(\mathcal{R}^{\star})\>|\>% \mathcal{A}]\leq\hat{r}$ and gives the statement. ∎

A.7. Details on CNN interpretation

In this section we provide all details regarding the application of FSR to interpret the activation of neurons in a CNN. We considered the MNIST handwritten digit dataset, and train a CNN to predict the digit contained in each image. Following the experimental setup of (Fischer et al., 2021), we employ a simple CNN composed by: two convolutional layers, with respectively $10$ and $40$ filters and $3\times 3$ kernels; each convolutional layer is followed by $2\times 2$ maxpooling and dropout of rate $0.25$ ; a fully connected layer with $64$ nodes and ReLu activations; a dropout layer with rate $0.5$ ; the output layer of size $10$ with softmax activations. We trained the network over the $60000$ images of the training set, using SGD based on categorical cross entropy loss, with learning rate $0.01$ , momentum $0.9$ , $12$ epochs, and batch size of $128$ , obtaining an accuracy of $0.987$ on the $10000$ holdout instances. Then, analogously to (Fischer et al., 2021), we computed the activations of neurons in the first filter of the first convolutional layer on the $10000$ testing instances, obtaining $d=676$ continuous features for $m=10000$ transactions. As binary target, we use the value $1$ for all digits containing straight lines only (the digits $1$ and $7$ ), and $0$ for all other digits. Doing so, we obtain a fraction $\mu(\mathcal{D})=0.216$ of samples with label $1$ . We ran FSR-C on this dataset, using $c=10$ resamples and $z=1$ , to identify significant subgroups with FWER $\leq 0.05$ . For each neuron, we identify the most significant subgroup containing a condition over the corresponding continuous feature $x$ . We focus on conditions of the type $x\geq t$ and $x<t$ , where $t$ is any real valued threshold; the condition $x\geq t$ denotes that the neuron $x$ is activated, with values at least $t$ , while $x<t$ means that $x$ is inactive, with a value at most $t$ .

In Figure 4 we show the results of this experiment. Figures 4-(a) and (b) show the average activation values of all neurons with label $1$ (corresponding to the digits $1$ and $7$ ) and label $0$ (all other digits). Then, in Figure 4-(c) we show the neurons with activation conditions significantly associated to the target label $1$ : neurons that are significantly activated, i.e., with conditions $x\geq t$ , are shown in red, while neurons that are significantly inactive, i.e., with conditions $x<t$ , are shown in blue. The color intensity (i.e., toward red or blue) depends on the threshold $t$ : for activated neurons it saturates at the maximum activation value, while for inactive neurons at the minimum. Interestingly, we observe that the neurons in this filter are significantly activated in the presence of a central vertical stroke (red pixels), that specifically characterizes the digits $1$ and $7$ (Figure 4-(a)). We remark, however, that this observation is not immediate from the average activations alone, and may be lost when the activation values are binarized, since FSR identifies subgroups that precisely separate the two classes of digits by using granular threshould values. On the countrary, neurons that are significantly inactive (blue pixels) are the ones surrounding this vertical area, forming two rounded shapes, which are typical of the other digits (Figure 4-(b)). Interestingly, these blue regions seem to shape the negative of the digits $1$ and $7$ .

From this experiment we conclude that FSR can be effectively applied to identify complex patterns, such as the ones for the interpretation of the activations within a neural layer.