Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Efficient Discovery of Significant Patterns
with Few-Shot Resampling

Leonardo Pellegrina Dept. of Information Engineering, University of PadovaVia Gradenigo 6bPadovaItaly35129 leonardo.pellegrina@unipd.it  and  Fabio Vandin Dept. of Information Engineering, University of PadovaVia Gradenigo 6bPadovaItaly35129 fabio.vandin@unipd.it
Abstract.

Significant pattern mining is a fundamental task in mining transactional data, requiring to identify patterns significantly associated with the value of a given feature, the target. In several applications, such as biomedicine, basket market analysis, and social networks, the goal is to discover patterns whose association with the target is defined with respect to an underlying population, or process, of which the dataset represents only a collection of observations, or samples. A natural way to capture the association of a pattern with the target is to consider its statistical significance, assessing its deviation from the (null) hypothesis of independence between the pattern and the target. While several algorithms have been proposed to find statistically significant patterns, it remains a computationally demanding task, and for complex patterns such as subgroups, no efficient solution exists.

We present FSR, an efficient algorithm to identify statistically significant patterns with rigorous guarantees on the probability of false discoveries. FSR builds on a novel general framework for mining significant patterns that captures some of the most commonly considered patterns, including itemsets, sequential patterns, and subgroups. FSR uses a small number of resampled datasets, obtained by assigning i.i.d. labels to each transaction, to rigorously bound the supremum deviation of a quality statistic measuring the significance of patterns. FSR builds on novel tight bounds on the supremum deviation that require to mine a small number of resampled datasets, while providing a high effectiveness in discovering significant patterns. As a test case, we consider significant subgroup mining, and our evaluation on several real datasets shows that FSR is effective in discovering significant subgroups, while requiring a small number of resampled datasets.

PVLDB Reference Format:
Efficient Discovery of Significant Patterns with Few-Shot Resampling. PVLDB, 17(10): XXX-XXX, 2024.
doi:XX.XX/XXX.XX This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 17, No. 10 ISSN 2150-8097.
doi:XX.XX/XXX.XX

PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at https://github.com/VandinLab/FSR

1. Introduction

Pattern mining is a fundamental task in data mining that, in its most common definition (Han et al., 2007), requires to find patterns that occur more often than a given frequency threshold in a database of transactions. Pattern mining finds applications in several areas such as market basket analysis (Agrawal et al., 1993), graph databases (Aggarwal et al., 2010; Al Hasan and Zaki, 2009; Chen et al., 2009), and the analysis of spatial and temporal data (Cao et al., 2019; Ceccarello and Gamper, 2022; Ho et al., 2022).

Significant pattern mining (Pellegrina et al., 2019a; Hämäläinen and Webb, 2019) is an extension of pattern mining that, in its most general formulation, requires to discover patterns with a significant association with a binary label from a dataset consisting of a collection of elements, where each element comprises the values of features, which may be categorical, binary, or continuous, and the value of the binary label of interests, also called the target. Such formulation captures various types of patterns, such as itemsets, when all features are binary, or subgroups (Atzmueller, 2015), with more general features. This task finds applications in a wide range of domains, such as market basket analysis, medicine, and molecular biology, where finding reliable associations is paramount.

Significance is usually assessed using the statistical hypothesis testing framework. In such framework one defines a measure of quality for patterns, and assumes the null hypothesis of no association between a pattern and the target label. The significant patterns are then the ones with quality that significantly deviate from the null distribution, that is, the distribution of the quality under the null hypothesis. The deviation from the null distribution is usually measured by a p𝑝pitalic_p-value, that is the probability, under the null distribution, that the pattern has quality as large as the one observed in the dataset.

A major complication in the use of the statistical hypothesis testing framework in data mining is given by the huge number of candidate patterns that are considered, resulting in a multiple hypothesis testing problem. With a huge number of candidate patterns, some non-significant patterns display a substantial deviation from the null distribution just by chance. Therefore, it is critical to account for testing multiple hypotheses when mining significant patterns, in order to avoid reporting a large number of spurious discoveries. Several methods have been proposed to deal with multiple hypothesis testing (Benjamini and Hochberg, 1995; Bonferroni, 1936; Westfall and Young, 1993). While these methods provide various guarantees, the one most commonly considered is the Family-Wise Error Rate (FWER), which is the probability of reporting in output one or more false discoveries.

Current approaches for significant pattern mining with guarantees on the FWER belong to one of two classes. The first class is given by approaches that assess the significance of each single pattern (e.g., through a p𝑝pitalic_p-value or related quantities), and then perform an analytical correction to account for multiple hypothesis testing (Webb, 2006, 2007, 2008). A widely used procedure in this class is given by Bonferroni correction (Benjamini and Hochberg, 1995), which computes a corrected p𝑝pitalic_p-value by multiplying the p𝑝pitalic_p-value p𝒫subscript𝑝𝒫p_{\mathcal{P}}italic_p start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT of a pattern 𝒫𝒫\mathcal{P}caligraphic_P by the number hhitalic_h of candidate hypotheses. If patterns with corrected p𝑝pitalic_p-value h×p𝒫subscript𝑝𝒫h\times p_{\mathcal{P}}italic_h × italic_p start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT below a threshold α𝛼\alphaitalic_α are flagged as significant, the FWER of the output is guaranteed to be αabsent𝛼\leq\alpha≤ italic_α. While these approaches and their improved versions (Terada et al., 2013; Minato et al., 2014) are fairly efficient, thanks to the use of analytical derivations, they suffer from a low statistical power, that is they often fail in identifying significant patterns, due to the multiple hypothesis corrections that must provide guarantees for every situation, independently of the observed data.

The second class is given by approaches that use the dataset to estimate the overall distribution of the patterns’ statistics (or corresponding measures, such as the p𝑝pitalic_p-values) under the null hypothesis (Llinares-López et al., 2015; Pellegrina and Vandin, 2020; Terada et al., 2015). The distribution is estimated using permuted versions of the data, obtained by keeping the features in each element of the dataset fixed and randomly permuting the target labels among elements. For example, the Westfall-Young (WY) method (Westfall and Young, 1993) uses permuted datasets to estimate the quantiles of the smallest p𝑝pitalic_p-value under the null hypotheses (or, equivalently, largest qualities or test statistics), and uses such estimate to derive a corrected threshold to flag patterns as significant. These approaches usually improve the statistical power for detecting significant patterns compared to approaches in the first class, but are often computationally demanding, since they need to mine a large number of permuted datasets to obtain good estimates of the overall distribution of the patterns’ statistics. While this can be achieved fairly efficiently for simple patterns such as itemsets (Llinares-López et al., 2015; Pellegrina and Vandin, 2020), the overall approach is impractical for more complex patterns such as subgroups, for which mining even a single dataset is extremely time-consuming.

An additional limitation of permutational approaches is that they focus on conditional testing (Fisher, 1922). In conditional testing one assumes that the variables of interest, in our case the frequency of patterns and the fraction of elements with target label 1111, are the same in every dataset from the null distribution. In contrast, in unconditional testing (Barnard, 1945) one assumes that the variables of interest are the realization of corresponding random variables. Conditional testing and unconditional testing capture different assumptions regarding how data is generated and collected, that is, whether the variables of interest would be the same in different repetitions of the experiment. The choice between the two types of testing depends on the specific scenarios. However, in practice conditional tests are often used for computational reasons, since unconditional tests are much more demanding from the computational standpoint due to the need to account for uncertainties in the observed quantities. In fact, while for simple patterns such as itemsets (Pellegrina et al., 2019b) significant pattern mining procedures with (partial) unconditional testing have been designed, for more complex patterns such as subgroups no unconditional testing procedure is available.

1.1. Contributions

This work focuses on the efficient discovery of significant patterns. Our contributions are four-fold. Firstly, we propose the first general framework to discover significant patterns that can be used for both conditional and unconditional testing. Our framework is based on a natural definition of a pattern’s quality that captures its significance, and applies to any type of pattern for which the appearance of the pattern in an element of the dataset is well defined. Such patterns include widely used patterns such as itemsets, subgroups, sequential patterns, and subgraphs. Second, we propose FSR, an algorithm for the efficient discovery of a rigorous approximation of significant patterns while controlling the Family-Wise Error Rate, which is the probability of reporting even a single false discovery. FSR uses a few-shot resampling approach, that is, it mines a small number of resampled datasets, obtained by keeping the features of each element of the dataset fixed and assigning i.i.d. values to the target. Moreover, FSR can leverage any existing algorithm for mining the patterns of interest. Third, we provide novel tight theoretical results relating the distribution of patterns’ qualities under the conditional and the unconditional distributions, and relating the estimated maximum deviation of patterns’ qualities in resampled datasets with their (unknown) true quality in the corresponding distribution. These results are crucial in making our approach practical, since they imply that mining a small number of resampled datasets is enough to identify significant patterns. Fourth, we consider significant subgroups mining as a test case in our extensive empirical evaluation. We use our algorithm FSR to derive the first approach for mining significant subgroups with unconditional testing, and an approach for the conditional testing scenario which is much more efficient than permutation testing while maintaining an extremely high power. For the most challenging datasets, FSR is the only approach that allows to mine significant subgroups within reasonable time. More importantly, we remark that the considered test case of subgroups is well representative of other settings. In fact, we expect the sensible improvements obtained by FSR to transfer to other cases (i.e., other pattern types), given the generality of our framework and the characteristics of our permutational approaches, which are shared by all significant pattern tasks.

Due to space constraints, we defer some of the proofs and additional results to the Appendix.

2. Related Works

Our work focuses on efficiently mining significant patterns. We now discuss the previous works most related to our contributions. We refer the reader to recent comprehensive reviews and tutorials for an overview of commonly used techniques for mining significant patterns (Hämäläinen and Webb, 2019; Pellegrina et al., 2019a).

Several approaches (Webb, 2006, 2007, 2008) have used general methods for multiple hypotheses testing, such as Bonferroni (Bonferroni, 1936) and Holm methods (Holm, 1979), within significant pattern mining. Such methods result in low statistical power, since they correct the p𝑝pitalic_p-value, or a measure of the significance of a pattern, by the number of candidate hypotheses (i.e., the number of patterns), which is extremely large in data mining applications. LAMP (Terada et al., 2013; Minato et al., 2014) is a recently introduced method that partially addresses such issue by selecting a subset of patterns for testing, while discarding the patterns with no chance of being significant. Such approach leads to identifying an improved (i.e., smaller) correction factor, resulting in higher statistical power while controlling the FWER. However, since each selected pattern is still tested as a distinct hypothesis, LAMP still leads to reduced statistical power (Llinares-López et al., 2015; Terada et al., 2015). Our algorithm FSR instead uses resampled datasets, taking into account dependencies among patterns, to achieve high statistical power.

Several permutation-based methods have been proposed to identify patterns significantly associated with a binary target label. (Llinares-López et al., 2015; Pellegrina and Vandin, 2020; Terada et al., 2015) use the Westfall-Young (WY) permutation test (Westfall and Young, 1993). The WY permutation procedure requires to estimate the δ𝛿\deltaitalic_δ-quantile of the distribution of the minimum (overall all patterns) p𝑝pitalic_p-value (or, equivalently, of the distribution of the maximum deviation, over all patterns, of the measure of significance). The use of such quantiles makes WY permutation testing more powerful than LAMP, but also more computationally expensive, since the estimation of such quantiles requires to mine a large number of permuted datasets. For these reasons, all such methods may need impractical resources. Our algorithm FSR instead requires to mine a small number of resampled datasets, leading to an efficient approach even for complex patterns such as subgroups. In addition, permutation-based methods only consider the conditional distribution, where patterns’ frequencies and the fraction of elements with target label equal to 1111 are assumed fixed in all permuted datasets. Our approach instead works for both the conditional and the unconditional distribution.

The identification of significant patterns with unconditional testing has received scant attention. This is due, in part, to the higher computational cost required for assessing the significance of even a single hypothesis in the unconditional setting, for example using Barnard’s test (Barnard, 1945). As far as we know, the only approach for mining significant patterns in a partially unconditional setting is SPuManTE (Pellegrina et al., 2019b). In the partially unconditional setting considered by SPuManTE only the frequencies of the patterns are not fixed, while the target is fixed by design. In contrast our algorithm FSR considers a fully unconditional setting, where the fraction of elements with target label equal to 1111 is not fixed either. Moreover, SPuManTE leverages specialized techniques for mining significant itemsets that cannot be easily generalized to other pattern types, while our approach applies directly to several pattern languages, including itemsets, sequential patterns, and subgroups.

Other works, orthogonal to ours, consider improving the diversity or limiting redundancy of the output (Van Leeuwen and Knobbe, 2012; Kalofolias et al., 2017; Dalleiger and Vreeken, 2022).

3. Preliminaries

We consider a dataset 𝒟𝒟\mathcal{D}caligraphic_D as a collection of m𝑚mitalic_m transactions 𝒟={(s1,1),,(sm,m)}𝒟subscript𝑠1subscript1subscript𝑠𝑚subscript𝑚\mathcal{D}=\left\{(s_{1},\ell_{1}),\dots,(s_{m},\ell_{m})\right\}caligraphic_D = { ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) }, where each transaction (s,)𝒟𝑠𝒟(s,\ell)\in\mathcal{D}( italic_s , roman_ℓ ) ∈ caligraphic_D is composed by a set s𝑠sitalic_s of d𝑑ditalic_d features, either binary, categorical, or continuous, and a binary target variable {0,1}01\ell\in\{0,1\}roman_ℓ ∈ { 0 , 1 }. More generally, we assume that s𝑠sitalic_s belongs to a domain 𝒳𝒳\mathcal{X}caligraphic_X. By defining the multisets 𝒜={s1,,sm}𝒜subscript𝑠1subscript𝑠𝑚\mathcal{A}=\left\{s_{1},\dots,s_{m}\right\}caligraphic_A = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } and 𝒯={1,,m}𝒯subscript1subscript𝑚\mathcal{T}=\left\{\ell_{1},\dots,\ell_{m}\right\}caligraphic_T = { roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, a dataset 𝒟𝒟\mathcal{D}caligraphic_D is also represented by the pair 𝒟=(𝒜,𝒯)𝒟𝒜𝒯\mathcal{D}=(\mathcal{A},\mathcal{T})caligraphic_D = ( caligraphic_A , caligraphic_T ). We assume to have a language \mathcal{L}caligraphic_L containing the patterns of potential interest. This scenario captures widely used pattern mining tasks, such as: itemset mining, where all features correspond to (binary) items and the language \mathcal{L}caligraphic_L corresponds to the set of all itemsets, i.e., (non-empty) subsets of items; subgroup mining, where the language \mathcal{L}caligraphic_L contains all subgroups, i.e., the sets of conjunctions with at most z𝑧zitalic_z conditions over features from 𝒜𝒜\mathcal{A}caligraphic_A, where each condition is either an equality, on a categorical feature, or an interval, on a continuous feature.

Given a transaction (s,)𝑠(s,\ell)( italic_s , roman_ℓ ), we use the notation 𝒫s𝒫𝑠\mathcal{P}\in scaligraphic_P ∈ italic_s to say that the set s𝑠sitalic_s of features supports pattern 𝒫𝒫\mathcal{P}caligraphic_P, where the meaning of 𝒫s𝒫𝑠\mathcal{P}\in scaligraphic_P ∈ italic_s depends on the specific data mining task. For example, for itemset mining, 𝒫s𝒫𝑠\mathcal{P}\in scaligraphic_P ∈ italic_s means that the pattern 𝒫𝒫\mathcal{P}caligraphic_P is contained in the set s𝑠sitalic_s, while for subgroup mining it means that the conditions defined by 𝒫𝒫\mathcal{P}caligraphic_P are all satisfied by the features of s𝑠sitalic_s. We define the set 𝖢𝒫(𝒟)subscript𝖢𝒫𝒟\mathsf{C}_{\mathcal{P}}(\mathcal{D})sansserif_C start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) of transactions in the dataset 𝒟𝒟\mathcal{D}caligraphic_D that support a pattern 𝒫𝒫\mathcal{P}caligraphic_P as 𝖢𝒫(𝒟)={(s,)𝒟:𝒫s}subscript𝖢𝒫𝒟conditional-set𝑠𝒟𝒫𝑠\mathsf{C}_{\mathcal{P}}(\mathcal{D})=\left\{(s,\ell)\in\mathcal{D}:\mathcal{P% }\in s\right\}sansserif_C start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) = { ( italic_s , roman_ℓ ) ∈ caligraphic_D : caligraphic_P ∈ italic_s }. The frequency 𝖿𝒫(𝒟)subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) of a pattern 𝒫𝒫\mathcal{P}caligraphic_P in the dataset 𝒟𝒟\mathcal{D}caligraphic_D is the fraction of transactions of 𝒟𝒟\mathcal{D}caligraphic_D that support 𝒫𝒫\mathcal{P}caligraphic_P: 𝖿𝒫(𝒟)=1mi=1m𝟙[𝒫si]=|𝖢𝒫(𝒟)|msubscript𝖿𝒫𝒟1𝑚superscriptsubscript𝑖1𝑚1delimited-[]𝒫subscript𝑠𝑖subscript𝖢𝒫𝒟𝑚\mathsf{f}_{\mathcal{P}}(\mathcal{D})=\frac{1}{m}\sum_{i=1}^{m}\mathds{1}\left% [\mathcal{P}\in s_{i}\right]=\frac{|\mathsf{C}_{\mathcal{P}}(\mathcal{D})|}{m}sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_1 [ caligraphic_P ∈ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = divide start_ARG | sansserif_C start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) | end_ARG start_ARG italic_m end_ARG.

Finally, let μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) denote the average value of the target \ellroman_ℓ for transactions in the dataset 𝒟𝒟\mathcal{D}caligraphic_D: μ(𝒟)=1|𝒟|(s,)𝒟𝜇𝒟1𝒟subscript𝑠𝒟\mu(\mathcal{D})=\frac{1}{|\mathcal{D}|}\sum_{(s,\ell)\in\mathcal{D}}\ellitalic_μ ( caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∈ caligraphic_D end_POSTSUBSCRIPT roman_ℓ.

3.1. Significant Patterns

Our goal is to find significant patterns, where a pattern 𝒫𝒫\mathcal{P}caligraphic_P is significant if the presence of 𝒫𝒫\mathcal{P}caligraphic_P in a transaction is associated with the target variable of the transaction being 1111. In particular, we consider the significance of a pattern in the framework of statistical significance, assuming that the transactions {(s1,1),,(sm,m)}subscript𝑠1subscript1subscript𝑠𝑚subscript𝑚\left\{(s_{1},\ell_{1}),\dots,(s_{m},\ell_{m})\right\}{ ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } constituting the dataset 𝒟𝒟\mathcal{D}caligraphic_D are samples from an unknown distribution γ𝛾\gammaitalic_γ. A pattern 𝒫𝒫\mathcal{P}caligraphic_P is associated with the target variable if the probability of the event “𝒫s𝒫𝑠\mathcal{P}\in scaligraphic_P ∈ italic_s and =11\ell=1roman_ℓ = 1” is higher than the corresponding probability when the event “𝒫s𝒫𝑠\mathcal{P}\in scaligraphic_P ∈ italic_s” and the event “=11\ell=1roman_ℓ = 1” are independent. Formally, this corresponds to consider the following null hypothesis for a pattern 𝒫𝒫\mathcal{P}caligraphic_P:

Pr(s,)γ(𝒫s=1)=Pr(s,)γ(𝒫s)Pr(s,)γ(=1).subscriptPrsimilar-to𝑠𝛾𝒫𝑠1subscriptPrsimilar-to𝑠𝛾𝒫𝑠subscriptPrsimilar-to𝑠𝛾1\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\wedge\ell=1\right)=\Pr_{(s,\ell% )\sim\gamma}\left(\mathcal{P}\in s\right)\Pr_{(s,\ell)\sim\gamma}\left(\ell=1% \right).roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( caligraphic_P ∈ italic_s ∧ roman_ℓ = 1 ) = roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( caligraphic_P ∈ italic_s ) roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( roman_ℓ = 1 ) .

Since we are interested in patterns with a significant association with the target variable being 1111, we are interested in finding patterns for which the following alternative hypothesis holds:

Pr(s,)γ(𝒫s=1)>Pr(s,)γ(𝒫s)Pr(s,)γ(=1).subscriptPrsimilar-to𝑠𝛾𝒫𝑠1subscriptPrsimilar-to𝑠𝛾𝒫𝑠subscriptPrsimilar-to𝑠𝛾1\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\wedge\ell=1\right)>\Pr_{(s,\ell% )\sim\gamma}\left(\mathcal{P}\in s\right)\Pr_{(s,\ell)\sim\gamma}\left(\ell=1% \right).roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( caligraphic_P ∈ italic_s ∧ roman_ℓ = 1 ) > roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( caligraphic_P ∈ italic_s ) roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( roman_ℓ = 1 ) .

To this end, we define the quality111The quality 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT is often called leverage for general patterns (Hämäläinen and Webb, 2019), but we use the term quality given its relation to the 1111-quality commonly employed to find interesting subgroups. of a pattern 𝒫𝒫\mathcal{P}caligraphic_P as

𝗊𝒫=Pr(s,)γ(𝒫s=1)Pr(s,)γ(𝒫s)Pr(s,)γ(=1).subscript𝗊𝒫subscriptPrsimilar-to𝑠𝛾𝒫𝑠1subscriptPrsimilar-to𝑠𝛾𝒫𝑠subscriptPrsimilar-to𝑠𝛾1\mathsf{q}_{\mathcal{P}}=\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\wedge% \ell=1\right)-\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\right)\Pr_{(s,% \ell)\sim\gamma}\left(\ell=1\right).sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( caligraphic_P ∈ italic_s ∧ roman_ℓ = 1 ) - roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( caligraphic_P ∈ italic_s ) roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( roman_ℓ = 1 ) .

Note that the alternative hypothesis is equivalent to 𝗊𝒫>0subscript𝗊𝒫0\mathsf{q}_{\mathcal{P}}>0sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT > 0, and the null hypothesis is equivalent to 𝗊𝒫=0subscript𝗊𝒫0\mathsf{q}_{\mathcal{P}}=0sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = 0. Therefore, finding significant patterns is equivalent to finding patterns with quality 𝗊𝒫>0subscript𝗊𝒫0\mathsf{q}_{\mathcal{P}}>0sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT > 0.

For example, consider the study of the association between the characteristics of the users of an online social network and users’ interests in a given topic. In this case, each user is a transaction, users’ characteristics are the features, and being interested or not in the topic defines the target variable. Significant patterns in this example are associations between users’ characteristics and users’ interests that are significantly stronger than expected under the null hypothesis of independence between characteristics and interests. For example, if the probability that a user has a given binary feature f𝑓fitalic_f is 0.30.30.30.3, and the probability that a user is interested in the topic is 0.50.50.50.5, then under the null hypothesis of independence we have that the probability that a user has feature f𝑓fitalic_f and is interested in the topic is 0.150.150.150.15. Therefore, the feature f𝑓fitalic_f is significantly associated with the topic of interest if the actual probability that a user has feature f𝑓fitalic_f and is interested in the topic is >0.15absent0.15>0.15> 0.15.

Task Definition. Given a dataset 𝒟𝒟\mathcal{D}caligraphic_D and the corresponding pattern language \mathcal{L}caligraphic_L, our goal is to identify significant patterns, that is, patterns 𝒫𝒫\mathcal{P}\in\mathcal{L}caligraphic_P ∈ caligraphic_L with quality 𝗊𝒫>0subscript𝗊𝒫0\mathsf{q}_{\mathcal{P}}>0sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT > 0. Since the distribution γ𝛾\gammaitalic_γ is unknown and we have access only to the dataset 𝒟𝒟\mathcal{D}caligraphic_D comprising transactions sampled from γ𝛾\gammaitalic_γ, we cannot hope to discover all significant patterns without errors. As a consequence, we must resort to approximations. In particular, define the subset superscript\mathcal{L}^{\star}caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT of the language \mathcal{L}caligraphic_L of patterns for which the null hypothesis holds:

superscript\displaystyle\mathcal{L}^{\star}caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ={𝒫:𝗊𝒫=0}.absentconditional-set𝒫subscript𝗊𝒫0\displaystyle=\left\{\mathcal{P}\in\mathcal{L}:\mathsf{q}_{\mathcal{P}}=0% \right\}.= { caligraphic_P ∈ caligraphic_L : sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = 0 } .

Our task is then to produce a subset O𝑂O\subseteq\mathcal{L}italic_O ⊆ caligraphic_L of all patterns in the language with Family-Wise Error Rate (FWER) below a user-defined threshold δ𝛿\deltaitalic_δ, such that the probability that O𝑂Oitalic_O contains at least one element from superscript\mathcal{L}^{\star}caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is at most δ𝛿\deltaitalic_δ:

(1) Pr𝒟(O)δ.subscriptPr𝒟𝑂superscript𝛿\Pr_{\mathcal{D}}\left(O\cap\mathcal{L}^{\star}\neq\emptyset\right)\leq\delta.roman_Pr start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_O ∩ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≠ ∅ ) ≤ italic_δ .

Note that (1) implies that O𝑂Oitalic_O is false discovery free approximation, which have previously been considered for other pattern mining tasks (Riondato and Vandin, 2020; Santoro et al., 2020).

Since 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT depends on the unknown distribution γ𝛾\gammaitalic_γ, we define a statistic 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) that corresponds to an estimate of 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT from data and that we will use in our algorithm FSR to identify significant patterns. For each pattern 𝒫𝒫\mathcal{P}caligraphic_P, we define the functions f𝒫subscript𝑓𝒫f_{\mathcal{P}}italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT and g𝒫subscript𝑔𝒫g_{\mathcal{P}}italic_g start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT as follows: each f𝒫:𝒳{0,1}:subscript𝑓𝒫𝒳01f_{\mathcal{P}}:\mathcal{X}\rightarrow\{0,1\}italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT : caligraphic_X → { 0 , 1 } is defined as f𝒫(s)=𝟙[𝒫s]subscript𝑓𝒫𝑠1delimited-[]𝒫𝑠f_{\mathcal{P}}(s)=\mathds{1}\left[\mathcal{P}\in s\right]italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s ) = blackboard_1 [ caligraphic_P ∈ italic_s ], such that f𝒫(s)=1subscript𝑓𝒫𝑠1f_{\mathcal{P}}(s)=1italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s ) = 1 if 𝒫s𝒫𝑠\mathcal{P}\in scaligraphic_P ∈ italic_s, and f𝒫(s)=0subscript𝑓𝒫𝑠0f_{\mathcal{P}}(s)=0italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s ) = 0 otherwise; each g𝒫:𝒳×{0,1}[μ(𝒟),1μ(𝒟)]:subscript𝑔𝒫𝒳01𝜇𝒟1𝜇𝒟g_{\mathcal{P}}:\mathcal{X}\times\{0,1\}\rightarrow[-\mu(\mathcal{D}),1-\mu(% \mathcal{D})]italic_g start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT : caligraphic_X × { 0 , 1 } → [ - italic_μ ( caligraphic_D ) , 1 - italic_μ ( caligraphic_D ) ] is defined as g𝒫(s,)=f𝒫(s)(μ(𝒟))subscript𝑔𝒫𝑠subscript𝑓𝒫𝑠𝜇𝒟g_{\mathcal{P}}(s,\ell)=f_{\mathcal{P}}(s)(\ell-\mu(\mathcal{D}))italic_g start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s , roman_ℓ ) = italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s ) ( roman_ℓ - italic_μ ( caligraphic_D ) ). Then, the estimate 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) of 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT for pattern 𝒫𝒫\mathcal{P}caligraphic_P from the dataset 𝒟𝒟\mathcal{D}caligraphic_D is

𝗊¯𝒫(𝒟)=1mi=1mg𝒫(si,i).subscript¯𝗊𝒫𝒟1𝑚superscriptsubscript𝑖1𝑚subscript𝑔𝒫subscript𝑠𝑖subscript𝑖\displaystyle{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}=\frac{1}{m}\sum_{i=% 1}^{m}g_{\mathcal{P}}(s_{i},\ell_{i}).over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Interestingly, 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) corresponds to the 1111-quality commonly used in subgroup mining (Atzmueller, 2015) (see Section A.1), even if the way it is used in our algorithm FSR (see Section 4) is different from its usual application, due to our focus on statistically significant patterns.

Intuitively, we expect significant patterns to have a sufficiently high empirical quality 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) measured on the data 𝒟𝒟\mathcal{D}caligraphic_D, since 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) estimates the (true) quality 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT of 𝒫𝒫\mathcal{P}caligraphic_P. To take into account the multiple hypothesis testing issue described above, we need to identify a threshold ε𝜀\varepsilonitalic_ε such that reporting in output all patterns with quality εabsent𝜀\geq\varepsilon≥ italic_ε has bounded FWER. Note that ε𝜀\varepsilonitalic_ε should be as small as possible in order to have high statistical power (i.e., to report the largest set of results with guarantees on false discoveries). A critical quantity we study to address this issue is the supremum deviation of the empirical qualities of non-significant patterns, defined as

(2) sup𝒫{𝗊¯𝒫(𝒟)}.subscriptsupremum𝒫superscriptsubscript¯𝗊𝒫𝒟\displaystyle\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{{\bar{\mathsf{q}}_% {\mathcal{P}}(\mathcal{D})}\right\}.roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) } .

We address this challenge with FSR: we derive novel analytical bounds on the concentration of the supremum deviation (2), and build on these tools an efficient few-shot resampling algorithm to sharply estimate it. FSR achieves high statistical power while scaling to large datasets and complex languages. Our approach applies to both conditional and unconditional testing, both of great interest in data mining.

3.2. Conditional and Unconditional Testing

When assessing the statistical significance of a pattern 𝒫𝒫\mathcal{P}caligraphic_P, one has to choose between conditional and unconditional tests. A conditional test assumes that the data-generating process represented by the unknown distribution γ𝛾\gammaitalic_γ only produces datasets with m𝑚mitalic_m transactions in which both the frequency 𝖿𝒫(𝒟)subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) of pattern 𝒫𝒫\mathcal{P}caligraphic_P and the fraction μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) of transactions with target value 1111 are the same as in the observed dataset; that is, it conditions on the observed variables of interest. In contrast, unconditional tests assume that 𝖿𝒫(𝒟)subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) and μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) are the realization of corresponding random variables. Unconditional tests therefore assess the association between a pattern and class labels considering also scenarios (i.e., datasets) where all frequencies of the patterns and/or the average target value may differ from what is observed in the data. Equivalently, conditional tests and unconditional tests are based on different assumptions regarding how data is generated and collected, namely, whether the variables of interest would be the same in a different repetition of the experiment (conditional tests) or not (unconditional tests).

Consider for example the scenario of online social networks described in Section 3.1, and assume for simplicity that we are interested in associations between the single features and the target. If the data is collected so that the total number of transactions for each value of the target is fixed and the fraction of transactions with given values of the features is fixed as well, then conditional testing is more appropriate. If instead one collects the data without constraints on the features/target values (e.g., simply collecting as many transactions as possible), then unconditional testing is more appropriate.

Conditional tests and unconditional tests are therefore both valid and of interest for data mining applications, and the choice between the two classes depends on the specific scenario, even if in practice conditional tests are usually preferred for computational reasons, since unconditional tests need to take into account more uncertainties in the observed quantities. In what follows, we introduce a general algorithm to identify significant patterns for both conditional testing and unconditional testing. Our algorithm is extremely efficient in both cases due to the use of few-shot resampling.

4. FSR Algorithm

We now describe our algorithm FSR (Algorithm 1) to find significant patterns for both conditional testing and unconditional testing. We first present the general approach, that is common to both testing scenarios, and then present the details for conditional testing in Section 4.1, and the details for unconditional testing in Section 4.2.

In a nutshell, FSR identifies significant patterns from a dataset 𝒟={(s1,1),,(sm,m)}𝒟subscript𝑠1subscript1subscript𝑠𝑚subscript𝑚\mathcal{D}=\left\{(s_{1},\ell_{1}),\dots,(s_{m},\ell_{m})\right\}caligraphic_D = { ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } using a few-shot resampling approach to compute rigorous probabilistic bounds to the deviation of the estimated qualities 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) of the patterns, under the null hypothesis of no association between the patterns and the target label. It then reports in output all patterns with estimated quality 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) above such deviation.

FSR considers a collection ={𝒟1,,𝒟c}superscriptsubscriptsuperscript𝒟1subscriptsuperscript𝒟𝑐\mathcal{R}^{\star}=\{\mathcal{D}^{\star}_{1},\dots,\mathcal{D}^{\star}_{c}\}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = { caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } of c1𝑐1c\geq 1italic_c ≥ 1 i.i.d. resampled datasets, each obtained by resampling the target labels of 𝒟𝒟\mathcal{D}caligraphic_D while maintaining the same features of the transactions of 𝒟𝒟\mathcal{D}caligraphic_D. Each resampled dataset is 𝒟j={(s1,ξ1,j),,(sm,ξm,j)}subscriptsuperscript𝒟𝑗subscript𝑠1subscript𝜉1𝑗subscript𝑠𝑚subscript𝜉𝑚𝑗\mathcal{D}^{\star}_{j}=\left\{(s_{1},\xi_{1,j}),\dots,(s_{m},\xi_{m,j})\right\}caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ) , … , ( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_m , italic_j end_POSTSUBSCRIPT ) }, where ξi,jsubscript𝜉𝑖𝑗\xi_{i,j}italic_ξ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT are i.i.d. random variables with

ξi,jBern(p),i[1,m],j[1,c],formulae-sequencesimilar-tosubscript𝜉𝑖𝑗𝐵𝑒𝑟𝑛𝑝formulae-sequencefor-all𝑖1𝑚for-all𝑗1𝑐\displaystyle\xi_{i,j}\sim Bern(p),\forall i\in[1,m],\forall j\in[1,c],italic_ξ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∼ italic_B italic_e italic_r italic_n ( italic_p ) , ∀ italic_i ∈ [ 1 , italic_m ] , ∀ italic_j ∈ [ 1 , italic_c ] ,

and Bern(p)𝐵𝑒𝑟𝑛𝑝Bern(p)italic_B italic_e italic_r italic_n ( italic_p ) is the Bernoulli random variable with parameter p𝑝pitalic_p (i.e., it is 1111 with probability p𝑝pitalic_p, and 00 otherwise).

We now describe the general approach followed by FSR (Algorithm 1). FSR starts by computing an upper bound εTsubscript𝜀𝑇\varepsilon_{T}italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to the deviation between the average target value μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) observed in 𝒟𝒟\mathcal{D}caligraphic_D and its expected value μ=𝔼𝒟[μ(𝒟)]𝜇subscript𝔼𝒟delimited-[]𝜇𝒟\mu=\mathop{\mathbb{E}}_{\mathcal{D}}\left[\mu(\mathcal{D})\right]italic_μ = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_μ ( caligraphic_D ) ] under the null hypothesis (line 1). Note that μ𝜇\muitalic_μ is the probability that a sample from the unknown distribution γ𝛾\gammaitalic_γ has target label \ellroman_ℓ equal to 1111, that is μ=Pr(s,)(=1)𝜇subscriptPr𝑠1\mu=\Pr_{(s,\ell)}\left(\ell=1\right)italic_μ = roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) end_POSTSUBSCRIPT ( roman_ℓ = 1 ). The computation of εTsubscript𝜀𝑇\varepsilon_{T}italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is performed by the procedure boundTarget and it depends on whether one is interested in conditional testing or in unconditional testing. The details of boundTarget for the two settings are described in Section 4.1 and in Section 4.2, respectively. Then FSR uses εTsubscript𝜀𝑇\varepsilon_{T}italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to obtain the upper bound μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG (line 1) and lower bound μˇˇ𝜇\check{\mu}overroman_ˇ start_ARG italic_μ end_ARG (line 1) to μ𝜇\muitalic_μ (we trivially assume 0μˇμ^10ˇ𝜇^𝜇10\leq\check{\mu}\leq\hat{\mu}\leq 10 ≤ overroman_ˇ start_ARG italic_μ end_ARG ≤ over^ start_ARG italic_μ end_ARG ≤ 1). It then uses the procedure resampleTarget to generate c𝑐citalic_c resampled datasets superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT (line 1) assigning to each transaction in the datasets of superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT the target label 1111 with probability p=μ^𝑝^𝜇p=\hat{\mu}italic_p = over^ start_ARG italic_μ end_ARG (i.e, the upper bound to μ𝜇\muitalic_μ). Note that the resampleTarget procedure is the same for both conditional and unconditional testing. The algorithm then computes, from each of the c𝑐citalic_c resampled datasets of superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, an estimate of the maximum deviation of the empirical quality 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) for non-significant patterns. This is achieved by computing, for every resampled dataset 𝒟jsubscriptsuperscript𝒟𝑗\mathcal{D}^{\star}_{j}caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the quantity sup𝒫𝗊¯𝒫(𝒟j,μˇ)subscriptsupremum𝒫subscript¯𝗊𝒫subscriptsuperscript𝒟𝑗ˇ𝜇\sup_{\mathcal{P}\in\mathcal{L}}\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{% \star}_{j},\check{\mu})roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ), where 𝗊¯𝒫(𝒟j,μˇ)=1mi=1mf𝒫(si)(ξi,jμˇ)subscript¯𝗊𝒫subscriptsuperscript𝒟𝑗ˇ𝜇1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝜉𝑖𝑗ˇ𝜇\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{\star}_{j},\check{\mu})=\frac{1}{m% }\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\xi_{i,j}-\check{\mu})over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_ξ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - overroman_ˇ start_ARG italic_μ end_ARG ). The value sup𝒫𝗊¯𝒫(𝒟j,μˇ)subscriptsupremum𝒫subscript¯𝗊𝒫subscriptsuperscript𝒟𝑗ˇ𝜇\sup_{\mathcal{P}\in\mathcal{L}}\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{% \star}_{j},\check{\mu})roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) can be interpreted as an empirical estimate of the maximum quality of non-significant patterns, measured from datasets sampled from the null distribution. Note that we use 𝗊¯𝒫(𝒟j,μˇ)subscript¯𝗊𝒫subscriptsuperscript𝒟𝑗ˇ𝜇\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{\star}_{j},\check{\mu})over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) as μˇˇ𝜇\check{\mu}overroman_ˇ start_ARG italic_μ end_ARG is a lower bound to μ𝜇\muitalic_μ; consequently, 𝗊¯𝒫(𝒟j,μˇ)subscript¯𝗊𝒫subscriptsuperscript𝒟𝑗ˇ𝜇\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{\star}_{j},\check{\mu})over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) provides a proper upper bound to such maximum deviation. Then, note that it is necessary to consider the supremum over the language \mathcal{L}caligraphic_L, since the set of true null hypothesis superscript\mathcal{L}^{\star}caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is unknown. We remark that the computation of sup𝒫𝗊¯𝒫(𝒟j,μˇ)subscriptsupremum𝒫subscript¯𝗊𝒫subscriptsuperscript𝒟𝑗ˇ𝜇\sup_{\mathcal{P}\in\mathcal{L}}\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{% \star}_{j},\check{\mu})roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) can be performed with fast pattern enumeration strategies, similar to the ones leveraged by previous methods for frequent and significant pattern mining; in fact, FSR can be combined with any efficient exploration procedure, such as the ones that explore the search space of the pattern language of interest using pruning bounds, for instance depth-first (Minato et al., 2014; Llinares-López et al., 2015; Terada et al., 2015) or best-first searches (Pietracaprina and Vandin, 2007; Pellegrina and Vandin, 2020; Pellegrina et al., 2022). The algorithm then stores the empirical deviations sup𝒫𝗊¯𝒫(𝒟j,μˇ)subscriptsupremum𝒫subscript¯𝗊𝒫subscriptsuperscript𝒟𝑗ˇ𝜇\sup_{\mathcal{P}\in\mathcal{L}}\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D}^{% \star}_{j},\check{\mu})roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) in the variables djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Then, the average maximum deviation d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) over the c𝑐citalic_c resampled datasets is computed (line 1) as the mean of the values {dj,j[1,c]}subscript𝑑𝑗𝑗1𝑐\{d_{j},j\in[1,c]\}{ italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ [ 1 , italic_c ] }, and it is then used to compute a rigorous upper bound ε𝜀\varepsilonitalic_ε to the supremum deviation sup𝒫{𝗊¯𝒫(𝒟)}subscriptsupremum𝒫superscriptsubscript¯𝗊𝒫𝒟\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{{\bar{\mathsf{q}}_{\mathcal{P}}% (\mathcal{D})}\right\}roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) } (Eq. (2)) under the null hypothesis (line 1) through the procedure boundStatistic. The implementation of this procedure, i.e., the returned value of ε𝜀\varepsilonitalic_ε, depends on whether conditional testing or unconditional testing is considered, and it is described in Section 4.1 and Section 4.2, respectively. In both cases, we use advanced concentration bounds (Boucheron et al., 2013; McDiarmid, 1989) that allow us to obtain small values of ε𝜀\varepsilonitalic_ε with a small number c𝑐citalic_c of resampled datasets. Finally, the set of patterns with estimated quality 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) greater than ε+εT𝖿𝒫(𝒟)𝜀subscript𝜀𝑇subscript𝖿𝒫𝒟\varepsilon+\varepsilon_{T}\mathsf{f}_{\mathcal{P}}(\mathcal{D})italic_ε + italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) is reported in output (lines 1-1).

Input: Pattern language \mathcal{L}caligraphic_L; dataset 𝒟𝒟\mathcal{D}caligraphic_D of m𝑚mitalic_m transactions; c1𝑐1c\geq 1italic_c ≥ 1; δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ).
Output: Set O𝑂O\subseteq\mathcal{L}italic_O ⊆ caligraphic_L of significant patterns with FWER δabsent𝛿\leq\delta≤ italic_δ.
1 εTsubscript𝜀𝑇absent\varepsilon_{T}\leftarrowitalic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← boundTarget(μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ), m𝑚mitalic_m, δ𝛿\deltaitalic_δ);
2 μ^μ(𝒟)+εT^𝜇𝜇𝒟subscript𝜀𝑇\hat{\mu}\leftarrow\mu(\mathcal{D})+\varepsilon_{T}over^ start_ARG italic_μ end_ARG ← italic_μ ( caligraphic_D ) + italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT;
3 μˇμ(𝒟)εTˇ𝜇𝜇𝒟subscript𝜀𝑇\check{\mu}\leftarrow\mu(\mathcal{D})-\varepsilon_{T}overroman_ˇ start_ARG italic_μ end_ARG ← italic_μ ( caligraphic_D ) - italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT;
4 superscriptabsent\mathcal{R}^{\star}\leftarrowcaligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ← resampleTarget(𝒟,c,μ^𝒟𝑐^𝜇\mathcal{D},c,\hat{\mu}caligraphic_D , italic_c , over^ start_ARG italic_μ end_ARG);
5 forall j[1,c]𝑗1𝑐j\in[1,c]italic_j ∈ [ 1 , italic_c ] do djsup𝒫{𝗊¯𝒫(𝒟j,μˇ)}subscript𝑑𝑗subscriptsupremum𝒫subscript¯𝗊𝒫subscriptsuperscript𝒟𝑗ˇ𝜇d_{j}\leftarrow\sup_{\mathcal{P}\in\mathcal{L}}\bigl{\{}\bar{\mathsf{q}}_{% \mathcal{P}}(\mathcal{D}^{\star}_{j},\check{\mu})\bigr{\}}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) } ;
6 d~(,μˇ)1cj=1cdj~𝑑superscriptˇ𝜇1𝑐superscriptsubscript𝑗1𝑐subscript𝑑𝑗\tilde{d}(\mathcal{R}^{\star},\check{\mu})\leftarrow\frac{1}{c}\sum_{j=1}^{c}d% _{j}over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) ← divide start_ARG 1 end_ARG start_ARG italic_c end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT;
7 ε𝜀absent\varepsilon\leftarrowitalic_ε ← boundStatistic(𝒟𝒟\mathcal{D}caligraphic_D, \mathcal{L}caligraphic_L, d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ), δ𝛿\deltaitalic_δ);
8 O{𝒫:𝗊¯𝒫(𝒟)ε+εT𝖿𝒫(𝒟)}𝑂conditional-set𝒫subscript¯𝗊𝒫𝒟𝜀subscript𝜀𝑇subscript𝖿𝒫𝒟O\leftarrow\left\{\mathcal{P}\in\mathcal{L}:{\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D})}\geq\varepsilon+\varepsilon_{T}\mathsf{f}_{\mathcal{P}}(\mathcal{% D})\right\}italic_O ← { caligraphic_P ∈ caligraphic_L : over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≥ italic_ε + italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) };
9 return O𝑂Oitalic_O;
Algorithm 1 FSR

Note that while the computation of εTsubscript𝜀𝑇\varepsilon_{T}italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and ε𝜀\varepsilonitalic_ε depends on whether conditional testing or unconditional testing is considered, the overall approach followed by FSR is the same in both cases. In particular, for both cases FSR relies on the resampled datasets superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT to estimate the maximum deviation, over all patterns 𝒫𝒫\mathcal{P}\in\mathcal{L}caligraphic_P ∈ caligraphic_L, of the estimate 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) from the (unknown) value 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT under the null hypothesis. Moreover, the exact same procedure is used to generate the resampled datasets. Note that our approach is similar to permutation approaches for significant pattern mining, which use permuted datasets to estimate the significance of patterns, but with two crucial differences. First, our datasets are obtained by resampling the target values, and not by permuting them, which allows us to obtain rigorous bounds for both conditional and unconditional testing, while permutation approaches can be used for conditional testing only. Second, since our analysis depends on the expectation of the maximum deviation, we can employ advanced bounds on the concentration of the expected value of functions of independent random variables; this allows us to use a small number c𝑐citalic_c of resampled datasets, as shown by our analysis and experimental evaluation. This is in contrast with permutation approaches (e.g., the ones based on WY permutation testing (Llinares-López et al., 2015; Pellegrina and Vandin, 2020; Terada et al., 2015)) that instead estimate the quantiles of the distribution of the maximum deviation using a large number of permutations (see also Section A.2 in Appendix for a more detailed comparison).

4.1. FSR for Conditional Testing

We now describe the details of procedures boundTarget and boundStatistic for the version of FSR that uses conditional testing, which we refer to as FSR-C. As a reminder, a conditional test for our problem assumes that the average target value μ𝜇\muitalic_μ and patterns frequencies 𝖿𝒫(𝒟)subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) are fixed, for all 𝒫𝒫\mathcal{P}\in\mathcal{L}caligraphic_P ∈ caligraphic_L, to the values observed in the dataset 𝒟𝒟\mathcal{D}caligraphic_D.

For boundTarget, since the average target value μ𝜇\muitalic_μ is fixed to the value μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) observed in the dataset 𝒟𝒟\mathcal{D}caligraphic_D, the bound εTsubscript𝜀𝑇\varepsilon_{T}italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT on its deviation from the expectation is 00, that is, boundTarget simply returns 00. Note that this implies that the output of FSR-C consists of all patterns in \mathcal{L}caligraphic_L with 𝗊¯𝒫(𝒟)εsubscript¯𝗊𝒫𝒟𝜀{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\geq\varepsilonover¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≥ italic_ε (line 1).

For boundStatistic, we now show how to compute a rigorous probabilistic bound ε𝜀\varepsilonitalic_ε to the supremum deviation sup𝒫{𝗊¯𝒫(𝒟)}subscriptsupremum𝒫superscriptsubscript¯𝗊𝒫𝒟\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{{\bar{\mathsf{q}}_{\mathcal{P}}% (\mathcal{D})}\right\}roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) } (Eq. (2)) under the null hypothesis. The bound is obtained by computing the average maximum deviation d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) over c𝑐citalic_c resampled datasets, and then applying advanced concentration results (Boucheron et al., 2013). Note that since εT=0subscript𝜀𝑇0\varepsilon_{T}=0italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0, in this case d~(,μˇ)=d~(,μ(𝒟))~𝑑superscriptˇ𝜇~𝑑superscript𝜇𝒟\tilde{d}(\mathcal{R}^{\star},\check{\mu})=\tilde{d}(\mathcal{R}^{\star},\mu(% \mathcal{D}))over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) = over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_μ ( caligraphic_D ) ).

Since FSR-C is based on resampled datasets, where each target label is sampled independently, our estimate d~(,μ(𝒟))~𝑑superscript𝜇𝒟\tilde{d}(\mathcal{R}^{\star},\mu(\mathcal{D}))over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_μ ( caligraphic_D ) ) of the expected maximum deviation is not based on the conditional distribution assumed by conditional tests, where the fraction of target labels equal to 1111 is exactly μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) in every dataset (while in our resampled datasets such fraction may vary, and it is equal to μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) only in expectation). We therefore need to relate the supremum deviation of the observed quality of patterns on resampled datasets with the one observed in datasets sampled from the conditional distribution. Interestingly, we show that the resampling and conditional distributions are closely related, in the sense that high probability bounds for the former also apply to the latter.

We first prove a general result (Lemma 1) relating the expectation of monotone functions of permutations of binary vectors, corresponding to the conditional distribution, with the expectation taken w.r.t. to independent resamples, corresponding to resampled datasets. For an integer k𝑘kitalic_k with 0km0𝑘𝑚0\leq k\leq m0 ≤ italic_k ≤ italic_m, define the set B(k)𝐵𝑘B(k)italic_B ( italic_k ) of binary vectors with k𝑘kitalic_k entries equal to one as

B(k)={𝐯{0,1}m,i=1m𝐯i=k},𝐵𝑘formulae-sequence𝐯superscript01𝑚superscriptsubscript𝑖1𝑚subscript𝐯𝑖𝑘\displaystyle B(k)=\Bigl{\{}\mathbf{v}\in\{0,1\}^{m},\sum_{i=1}^{m}\mathbf{v}_% {i}=k\Bigr{\}},italic_B ( italic_k ) = { bold_v ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k } ,

and let U(B(k))𝑈𝐵𝑘U(B(k))italic_U ( italic_B ( italic_k ) ) be the uniform distribution over the set B(k)𝐵𝑘B(k)italic_B ( italic_k ). Equivalently, U(B(k))𝑈𝐵𝑘U(B(k))italic_U ( italic_B ( italic_k ) ) corresponds to the set of uniform permutations of a binary vector with k𝑘kitalic_k ones. Then, define I(p)𝐼𝑝I(p)italic_I ( italic_p ) as a probability distribution over {0,1}msuperscript01𝑚\{0,1\}^{m}{ 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, such that each entry 𝐯isubscript𝐯𝑖\mathbf{v}_{i}bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of a random vector 𝐯𝐯\mathbf{v}bold_v taken from I(p)𝐼𝑝I(p)italic_I ( italic_p ) is an i.i.d. Bernoulli r.v. with Pr(𝐯i=1)=p=1Pr(𝐯i=0)Prsubscript𝐯𝑖1𝑝1Prsubscript𝐯𝑖0\Pr(\mathbf{v}_{i}=1)=p=1-\Pr(\mathbf{v}_{i}=0)roman_Pr ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) = italic_p = 1 - roman_Pr ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ), for some p[0,1]𝑝01p\in[0,1]italic_p ∈ [ 0 , 1 ].

Lemma 1.

Let f:{0,1}m:𝑓superscript01𝑚f:\{0,1\}^{m}\rightarrow\mathbb{R}italic_f : { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R be a nonnegative function such that 𝔼𝐯U(B(k))[f(𝐯)]subscript𝔼similar-to𝐯𝑈𝐵𝑘delimited-[]𝑓𝐯\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[f(\mathbf{v})\right]blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k ) ) end_POSTSUBSCRIPT [ italic_f ( bold_v ) ] is either monotonically increasing or monotonically decreasing in k𝑘kitalic_k. It holds

𝔼𝐯U(B(k))[f(𝐯)]2𝔼𝐯I(k/m)[f(𝐯)].subscript𝔼similar-to𝐯𝑈𝐵𝑘delimited-[]𝑓𝐯2subscript𝔼similar-to𝐯𝐼𝑘𝑚delimited-[]𝑓𝐯\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[f(\mathbf{v})% \right]\leq 2\mathop{\mathbb{E}}_{\mathbf{v}\sim I(k/m)}\left[f(\mathbf{v})% \right].blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k ) ) end_POSTSUBSCRIPT [ italic_f ( bold_v ) ] ≤ 2 blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( italic_k / italic_m ) end_POSTSUBSCRIPT [ italic_f ( bold_v ) ] .

To prove Lemma 1, we need the following technical result, providing bounds to the probability that a Binomial random variable exceeds its expectation (Jogdeo and Samuels, 1968).

Lemma 2.

Let μm𝜇𝑚\mu mitalic_μ italic_m be an integer. Then it holds

Pr𝐯I(μ)(i=1m𝐯i>μm)<12<Pr𝐯I(μ)(i=1m𝐯iμm).subscriptPrsimilar-to𝐯𝐼𝜇superscriptsubscript𝑖1𝑚subscript𝐯𝑖𝜇𝑚12subscriptPrsimilar-to𝐯𝐼𝜇superscriptsubscript𝑖1𝑚subscript𝐯𝑖𝜇𝑚\displaystyle\Pr_{\mathbf{v}\sim I(\mu)}\left(\sum_{i=1}^{m}\mathbf{v}_{i}>\mu m% \right)<\frac{1}{2}<\Pr_{\mathbf{v}\sim I(\mu)}\left(\sum_{i=1}^{m}\mathbf{v}_% {i}\geq\mu m\right).roman_Pr start_POSTSUBSCRIPT bold_v ∼ italic_I ( italic_μ ) end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_μ italic_m ) < divide start_ARG 1 end_ARG start_ARG 2 end_ARG < roman_Pr start_POSTSUBSCRIPT bold_v ∼ italic_I ( italic_μ ) end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_μ italic_m ) .

We now prove Lemma 1.

Proof of Lemma 1.

We prove the result assuming that 𝔼𝐯U(B(k))[f(𝐯)]subscript𝔼similar-to𝐯𝑈𝐵𝑘delimited-[]𝑓𝐯\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[f(\mathbf{v})\right]blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k ) ) end_POSTSUBSCRIPT [ italic_f ( bold_v ) ] is monotonically increasing in k𝑘kitalic_k, as the other case is analogous. First, we note that

𝔼𝐯U(B(k))[f(𝐯)]=𝔼𝐯I(k/m)[f(𝐯)i=1m𝐯i=k].subscript𝔼similar-to𝐯𝑈𝐵𝑘delimited-[]𝑓𝐯subscript𝔼similar-to𝐯𝐼𝑘𝑚delimited-[]conditional𝑓𝐯superscriptsubscript𝑖1𝑚subscript𝐯𝑖𝑘\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[f(\mathbf{v})% \right]=\mathop{\mathbb{E}}_{\mathbf{v}\sim I(k/m)}\left[f(\mathbf{v})\mid\sum% _{i=1}^{m}\mathbf{v}_{i}=k\right].blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k ) ) end_POSTSUBSCRIPT [ italic_f ( bold_v ) ] = blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( italic_k / italic_m ) end_POSTSUBSCRIPT [ italic_f ( bold_v ) ∣ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k ] .

Therefore, we have

𝔼𝐯I(k/m)[f(𝐯)]subscript𝔼similar-to𝐯𝐼𝑘𝑚delimited-[]𝑓𝐯\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim I(k/m)}\left[f(\mathbf{v})\right]blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( italic_k / italic_m ) end_POSTSUBSCRIPT [ italic_f ( bold_v ) ]
=j=0m𝔼𝐯I(k/m)[f(𝐯)i=1m𝐯i=j]Pr𝐯I(k/m)(i=1m𝐯i=j)absentsuperscriptsubscript𝑗0𝑚subscript𝔼similar-to𝐯𝐼𝑘𝑚delimited-[]conditional𝑓𝐯superscriptsubscript𝑖1𝑚subscript𝐯𝑖𝑗subscriptPrsimilar-to𝐯𝐼𝑘𝑚superscriptsubscript𝑖1𝑚subscript𝐯𝑖𝑗\displaystyle=\sum_{j=0}^{m}\mathop{\mathbb{E}}_{\mathbf{v}\sim I(k/m)}\left[f% (\mathbf{v})\mid\sum_{i=1}^{m}\mathbf{v}_{i}=j\right]\Pr_{\mathbf{v}\sim I(k/m% )}\left(\sum_{i=1}^{m}\mathbf{v}_{i}=j\right)= ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( italic_k / italic_m ) end_POSTSUBSCRIPT [ italic_f ( bold_v ) ∣ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j ] roman_Pr start_POSTSUBSCRIPT bold_v ∼ italic_I ( italic_k / italic_m ) end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j )
j=km𝔼𝐯I(k/m)[f(𝐯)i=1m𝐯i=j]Pr𝐯I(k/m)(i=1m𝐯i=j)absentsuperscriptsubscript𝑗𝑘𝑚subscript𝔼similar-to𝐯𝐼𝑘𝑚delimited-[]conditional𝑓𝐯superscriptsubscript𝑖1𝑚subscript𝐯𝑖𝑗subscriptPrsimilar-to𝐯𝐼𝑘𝑚superscriptsubscript𝑖1𝑚subscript𝐯𝑖𝑗\displaystyle\geq\sum_{j=k}^{m}\mathop{\mathbb{E}}_{\mathbf{v}\sim I(k/m)}% \left[f(\mathbf{v})\mid\sum_{i=1}^{m}\mathbf{v}_{i}=j\right]\Pr_{\mathbf{v}% \sim I(k/m)}\left(\sum_{i=1}^{m}\mathbf{v}_{i}=j\right)≥ ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( italic_k / italic_m ) end_POSTSUBSCRIPT [ italic_f ( bold_v ) ∣ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j ] roman_Pr start_POSTSUBSCRIPT bold_v ∼ italic_I ( italic_k / italic_m ) end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j )
𝔼𝐯I(k/m)[f(𝐯)i=1m𝐯i=k]j=kmPr𝐯I(k/m)(i=1m𝐯i=j)absentsubscript𝔼similar-to𝐯𝐼𝑘𝑚delimited-[]conditional𝑓𝐯superscriptsubscript𝑖1𝑚subscript𝐯𝑖𝑘superscriptsubscript𝑗𝑘𝑚subscriptPrsimilar-to𝐯𝐼𝑘𝑚superscriptsubscript𝑖1𝑚subscript𝐯𝑖𝑗\displaystyle\geq\mathop{\mathbb{E}}_{\mathbf{v}\sim I(k/m)}\left[f(\mathbf{v}% )\mid\sum_{i=1}^{m}\mathbf{v}_{i}=k\right]\sum_{j=k}^{m}\Pr_{\mathbf{v}\sim I(% k/m)}\left(\sum_{i=1}^{m}\mathbf{v}_{i}=j\right)≥ blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( italic_k / italic_m ) end_POSTSUBSCRIPT [ italic_f ( bold_v ) ∣ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k ] ∑ start_POSTSUBSCRIPT italic_j = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_Pr start_POSTSUBSCRIPT bold_v ∼ italic_I ( italic_k / italic_m ) end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j )
=𝔼𝐯U(B(k))[f(𝐯)]Pr𝐯I(k/m)(i=1m𝐯ik)absentsubscript𝔼similar-to𝐯𝑈𝐵𝑘delimited-[]𝑓𝐯subscriptPrsimilar-to𝐯𝐼𝑘𝑚superscriptsubscript𝑖1𝑚subscript𝐯𝑖𝑘\displaystyle=\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[f(\mathbf{v})% \right]\Pr_{\mathbf{v}\sim I(k/m)}\left(\sum_{i=1}^{m}\mathbf{v}_{i}\geq k\right)= blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k ) ) end_POSTSUBSCRIPT [ italic_f ( bold_v ) ] roman_Pr start_POSTSUBSCRIPT bold_v ∼ italic_I ( italic_k / italic_m ) end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_k )
𝔼𝐯U(B(k))[f(𝐯)]12,absentsubscript𝔼similar-to𝐯𝑈𝐵𝑘delimited-[]𝑓𝐯12\displaystyle\geq\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[f(\mathbf{v% })\right]\frac{1}{2},≥ blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k ) ) end_POSTSUBSCRIPT [ italic_f ( bold_v ) ] divide start_ARG 1 end_ARG start_ARG 2 end_ARG ,

where the last inequality follows from Lemma 2. ∎

We make use of Lemma 1 to prove the following result.

Theorem 3.

Define the constant μ¯=μ(𝒟)¯𝜇𝜇𝒟\bar{\mu}=\mu(\mathcal{D})over¯ start_ARG italic_μ end_ARG = italic_μ ( caligraphic_D ) and let k=μ¯m𝑘¯𝜇𝑚k=\bar{\mu}mitalic_k = over¯ start_ARG italic_μ end_ARG italic_m. For any z0𝑧0z\geq 0italic_z ≥ 0, it holds

Pr𝐯U(B(k))(sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)z)subscriptPrsimilar-to𝐯𝑈𝐵𝑘subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇𝑧\displaystyle\Pr_{\mathbf{v}\sim U(B(k))}\left(\sup_{\mathcal{P}\in\mathcal{L}% ^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{% \mu})\geq z\right)roman_Pr start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k ) ) end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ≥ italic_z )
2Pr𝐯I(μ¯)(sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)z).absent2subscriptPrsimilar-to𝐯𝐼¯𝜇subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇𝑧\displaystyle\leq 2\Pr_{\mathbf{v}\sim I(\bar{\mu})}\left(\sup_{\mathcal{P}\in% \mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}% _{i}-\bar{\mu})\geq z\right).≤ 2 roman_Pr start_POSTSUBSCRIPT bold_v ∼ italic_I ( over¯ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ≥ italic_z ) .
Proof of Theorem 3.

Define the function g:{0,1}m{0,1}:𝑔superscript01𝑚01g:\{0,1\}^{m}\rightarrow\{0,1\}italic_g : { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → { 0 , 1 }

g(𝐯)=𝟙[sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)z].𝑔𝐯1delimited-[]subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇𝑧\displaystyle g(\mathbf{v})=\mathds{1}\left[\sup_{\mathcal{P}\in\mathcal{L}^{% \star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{\mu% })\geq z\right].italic_g ( bold_v ) = blackboard_1 [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ≥ italic_z ] .

In order to apply Lemma 1, we prove that 𝔼𝐯U(B(k))[g(𝐯)]subscript𝔼similar-to𝐯𝑈𝐵𝑘delimited-[]𝑔𝐯\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[g(\mathbf{v})\right]blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k ) ) end_POSTSUBSCRIPT [ italic_g ( bold_v ) ] is nondecreasing with k𝑘kitalic_k. This is equivalent to show that

𝔼𝐯U(B(k))[g(𝐯)]𝔼𝐯U(B(k))[g(𝐯)],subscript𝔼similar-to𝐯𝑈𝐵𝑘delimited-[]𝑔𝐯subscript𝔼similar-to𝐯𝑈𝐵superscript𝑘delimited-[]𝑔𝐯\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[g(\mathbf{v})% \right]\leq\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k^{\prime}))}\left[g(% \mathbf{v})\right],blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k ) ) end_POSTSUBSCRIPT [ italic_g ( bold_v ) ] ≤ blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) end_POSTSUBSCRIPT [ italic_g ( bold_v ) ] ,

where k=k+jsuperscript𝑘𝑘𝑗k^{\prime}=k+jitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_k + italic_j, for any pair of integers k[0,m]𝑘0𝑚k\in[0,m]italic_k ∈ [ 0 , italic_m ] and j[0,mk]𝑗0𝑚𝑘j\in[0,m-k]italic_j ∈ [ 0 , italic_m - italic_k ]. To do so, we build a coupling π(k,k)𝜋𝑘superscript𝑘\pi(k,k^{\prime})italic_π ( italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) between the two distributions U(B(k))𝑈𝐵𝑘U(B(k))italic_U ( italic_B ( italic_k ) ) and U(B(k))𝑈𝐵superscript𝑘U(B(k^{\prime}))italic_U ( italic_B ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) as follows. For any 𝐯𝐯\mathbf{v}bold_v taken from U(B(k))𝑈𝐵𝑘U(B(k))italic_U ( italic_B ( italic_k ) ), define 𝐯superscript𝐯\mathbf{v}^{\prime}bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as a copy of 𝐯𝐯\mathbf{v}bold_v (i.e., such that 𝐯i=𝐯i,i[1,m]formulae-sequencesubscript𝐯𝑖subscriptsuperscript𝐯𝑖for-all𝑖1𝑚\mathbf{v}_{i}=\mathbf{v}^{\prime}_{i},\forall i\in[1,m]bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i ∈ [ 1 , italic_m ]), that is modified according to the following procedure: for j𝑗jitalic_j times, select uniformly at random an index i𝑖iitalic_i such that 𝐯i=0subscriptsuperscript𝐯𝑖0\mathbf{v}^{\prime}_{i}=0bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0, and set 𝐯isubscriptsuperscript𝐯𝑖\mathbf{v}^{\prime}_{i}bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to 1111. Denote the pair 𝐯,𝐯𝐯superscript𝐯\mathbf{v},\mathbf{v}^{\prime}bold_v , bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT sampled according to π(k,k)𝜋𝑘superscript𝑘\pi(k,k^{\prime})italic_π ( italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) as the output of this procedure. It is immediate to observe that 𝐯U(B(k))similar-tosuperscript𝐯𝑈𝐵superscript𝑘\mathbf{v}^{\prime}\sim U(B(k^{\prime}))bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_U ( italic_B ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) (i.e., that the marginal distribution of 𝐯superscript𝐯\mathbf{v}^{\prime}bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is U(B(k)U(B(k^{\prime})italic_U ( italic_B ( italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )), and that 𝐯i𝐯i,i[1,m]formulae-sequencesubscript𝐯𝑖subscriptsuperscript𝐯𝑖for-all𝑖1𝑚\mathbf{v}_{i}\leq\mathbf{v}^{\prime}_{i},\forall i\in[1,m]bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i ∈ [ 1 , italic_m ]. This implies that, for any pair of vectors 𝐯,𝐯π(k,k)similar-to𝐯superscript𝐯𝜋𝑘superscript𝑘\mathbf{v},\mathbf{v}^{\prime}\sim\pi(k,k^{\prime})bold_v , bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π ( italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), it holds

sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯).subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscriptsuperscript𝐯𝑖¯𝜇\displaystyle\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}% f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{\mu})\leq\sup_{\mathcal{P}\in% \mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}% ^{\prime}_{i}-\bar{\mu}).roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ≤ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) .

A consequence of this fact is

𝔼𝐯U(B(k+j))[g(𝐯)]subscript𝔼similar-tosuperscript𝐯𝑈𝐵𝑘𝑗delimited-[]𝑔superscript𝐯\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}^{\prime}\sim U(B(k+j))}\left[g(% \mathbf{v}^{\prime})\right]blackboard_E start_POSTSUBSCRIPT bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_U ( italic_B ( italic_k + italic_j ) ) end_POSTSUBSCRIPT [ italic_g ( bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]
=Pr𝐯U(B(k+j))(sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)z)absentsubscriptPrsimilar-tosuperscript𝐯𝑈𝐵𝑘𝑗subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscriptsuperscript𝐯𝑖¯𝜇𝑧\displaystyle=\Pr_{\mathbf{v}^{\prime}\sim U(B(k+j))}\left(\sup_{\mathcal{P}% \in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf% {v}^{\prime}_{i}-\bar{\mu})\geq z\right)= roman_Pr start_POSTSUBSCRIPT bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_U ( italic_B ( italic_k + italic_j ) ) end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ≥ italic_z )
=Pr𝐯,𝐯π(k,k)(sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)z)absentsubscriptPrsimilar-to𝐯superscript𝐯𝜋𝑘superscript𝑘subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscriptsuperscript𝐯𝑖¯𝜇𝑧\displaystyle=\Pr_{\mathbf{v},\mathbf{v}^{\prime}\sim\pi(k,k^{\prime})}\left(% \sup_{\mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P% }}(s_{i})(\mathbf{v}^{\prime}_{i}-\bar{\mu})\geq z\right)= roman_Pr start_POSTSUBSCRIPT bold_v , bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π ( italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ≥ italic_z )
Pr𝐯,𝐯π(k,k)(sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)z)absentsubscriptPrsimilar-to𝐯superscript𝐯𝜋𝑘superscript𝑘subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇𝑧\displaystyle\geq\Pr_{\mathbf{v},\mathbf{v}^{\prime}\sim\pi(k,k^{\prime})}% \left(\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{% \mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{\mu})\geq z\right)≥ roman_Pr start_POSTSUBSCRIPT bold_v , bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π ( italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ≥ italic_z )
=Pr𝐯U(B(k))(sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)z)absentsubscriptPrsimilar-to𝐯𝑈𝐵𝑘subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇𝑧\displaystyle=\Pr_{\mathbf{v}\sim U(B(k))}\left(\sup_{\mathcal{P}\in\mathcal{L% }^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{% \mu})\geq z\right)= roman_Pr start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k ) ) end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ≥ italic_z )
=𝔼𝐯U(B(k))[g(𝐯)].absentsubscript𝔼similar-to𝐯𝑈𝐵𝑘delimited-[]𝑔𝐯\displaystyle=\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[g(\mathbf{v})% \right].= blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k ) ) end_POSTSUBSCRIPT [ italic_g ( bold_v ) ] .

Therefore, g𝑔gitalic_g is nonnegative and 𝔼𝐯U(B(k))[g(𝐯)]subscript𝔼similar-to𝐯𝑈𝐵𝑘delimited-[]𝑔𝐯\mathop{\mathbb{E}}_{\mathbf{v}\sim U(B(k))}\left[g(\mathbf{v})\right]blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_U ( italic_B ( italic_k ) ) end_POSTSUBSCRIPT [ italic_g ( bold_v ) ] is nondecreasing in k𝑘kitalic_k. We apply Lemma 1 to the function g𝑔gitalic_g, obtaining the statement. ∎

We note that Theorem 3 is precisely what we seek: it implies that large supremum deviations that are unlikely in the independent resamples distribution 𝐯I(μ¯)similar-to𝐯𝐼¯𝜇\mathbf{v}\sim I(\bar{\mu})bold_v ∼ italic_I ( over¯ start_ARG italic_μ end_ARG ) are also unlikely in the conditional distribution 𝐯U(B(k))similar-to𝐯𝑈𝐵𝑘\mathbf{v}\sim U(B(k))bold_v ∼ italic_U ( italic_B ( italic_k ) ). By using Theorem 3, we prove the following result, that implies strong concentration of the supremum deviation of pattern qualities w.r.t. their expectations, taken w.r.t. independent resamples of the target labels rather than permutations.

Theorem 4.

Define μ¯=μ(𝒟)¯𝜇𝜇𝒟\bar{\mu}=\mu(\mathcal{D})over¯ start_ARG italic_μ end_ARG = italic_μ ( caligraphic_D ), and

ω=(1μ¯)min{μ¯,sup𝒫1mi=1mf𝒫(si)}.𝜔1¯𝜇¯𝜇subscriptsupremum𝒫1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖\displaystyle\omega=(1-\bar{\mu})\min\Bigl{\{}\bar{\mu},\sup_{\mathcal{P}\in% \mathcal{L}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})\Bigr{\}}.italic_ω = ( 1 - over¯ start_ARG italic_μ end_ARG ) roman_min { over¯ start_ARG italic_μ end_ARG , roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } .

With probability at least 1δ1𝛿1-\delta1 - italic_δ over 𝐯U(B(k))similar-to𝐯𝑈𝐵𝑘\mathbf{v}\sim U(B(k))bold_v ∼ italic_U ( italic_B ( italic_k ) ), it holds

sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇\displaystyle\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}% f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{\mu})roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG )
𝔼𝐯I(μ¯)[sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)]+2ωlog(2δ)m.absentsubscript𝔼similar-to𝐯𝐼¯𝜇delimited-[]subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇2𝜔2𝛿𝑚\displaystyle\leq\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\bar{\mu})}\left[\sup_{% \mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{% i})(\mathbf{v}_{i}-\bar{\mu})\right]+\sqrt{\frac{2\omega\log\left(\frac{2}{% \delta}\right)}{m}}.≤ blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( over¯ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ] + square-root start_ARG divide start_ARG 2 italic_ω roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG .

To prove Theorem 4, we need the following technical result regarding the concentration of functions of independent random variables.

Theorem 5 (Theorem 6.7 of (Boucheron et al., 2013)).

Let g:𝒴m:𝑔superscript𝒴𝑚g:\mathcal{Y}^{m}\rightarrow\mathbb{R}italic_g : caligraphic_Y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R be a function, and let X=(X1,,Xm)𝒴m𝑋subscript𝑋1subscript𝑋𝑚superscript𝒴𝑚X=(X_{1},\dots,X_{m})\in\mathcal{Y}^{m}italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT a collection of m𝑚mitalic_m independent random variables. Define X¯j=(X1,,Xj1,Xj,Xj+1,,Xm)𝒴msuperscript¯𝑋𝑗subscript𝑋1subscript𝑋𝑗1subscriptsuperscript𝑋𝑗subscript𝑋𝑗1subscript𝑋𝑚superscript𝒴𝑚\bar{X}^{j}=(X_{1},\dots,X_{j-1},X^{\prime}_{j},X_{j+1},\dots,X_{m})\in% \mathcal{Y}^{m}over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT as a copy of X𝑋Xitalic_X, where its j𝑗jitalic_j-th element Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is replaced by an independent copy Xjsubscriptsuperscript𝑋𝑗X^{\prime}_{j}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Assume that, for some constant q0𝑞0q\geq 0italic_q ≥ 0,

j=1m𝔼[(g(X)g(X¯j))+2|X]qsuperscriptsubscript𝑗1𝑚𝔼delimited-[]conditionalsuperscriptsubscript𝑔𝑋𝑔superscript¯𝑋𝑗2𝑋𝑞\displaystyle\sum_{j=1}^{m}\mathop{\mathbb{E}}\left[\left(g(X)-g(\bar{X}^{j})% \right)_{+}^{2}\>|\>X\right]\leq q∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_E [ ( italic_g ( italic_X ) - italic_g ( over¯ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_X ] ≤ italic_q

holds almost surely. Then, it holds, for all t0𝑡0t\geq 0italic_t ≥ 0,

Pr(g(X)𝔼X[g(X)]+t)exp(t2/(2q)).Pr𝑔𝑋subscript𝔼𝑋delimited-[]𝑔𝑋𝑡superscript𝑡22𝑞\displaystyle\Pr\Bigl{(}g(X)\geq\mathop{\mathbb{E}}_{X}\left[g(X)\right]+t% \Bigr{)}\leq\exp(-t^{2}/(2q)).roman_Pr ( italic_g ( italic_X ) ≥ blackboard_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT [ italic_g ( italic_X ) ] + italic_t ) ≤ roman_exp ( - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 2 italic_q ) ) .

We use these results to prove Theorem 4.

Proof of Theorem 4.

Define the function g:{0,1}m:𝑔superscript01𝑚g:\{0,1\}^{m}\rightarrow\mathbb{R}italic_g : { 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R as

g(𝐯)=sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯).𝑔𝐯subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇\displaystyle g(\mathbf{v})=\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m% }\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\bar{\mu}).italic_g ( bold_v ) = roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) .

We first note that, from the result of Theorem 3, it is sufficient to show that, choosing the constants t𝑡titalic_t and z𝑧zitalic_z as

(3) t=2ωlog(2δ)m,z=𝔼𝐯I(μ¯)[g(𝐯)]+t,formulae-sequence𝑡2𝜔2𝛿𝑚𝑧subscript𝔼similar-to𝐯𝐼¯𝜇delimited-[]𝑔𝐯𝑡\displaystyle t=\sqrt{\frac{2\omega\log(\frac{2}{\delta})}{m}},\;\;\;z=\mathop% {\mathbb{E}}_{\mathbf{v}\sim I(\bar{\mu})}\left[g(\mathbf{v})\right]+t,italic_t = square-root start_ARG divide start_ARG 2 italic_ω roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG , italic_z = blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( over¯ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT [ italic_g ( bold_v ) ] + italic_t ,

it holds

Pr𝐯I(μ¯)(g(𝐯)z)δ/2.subscriptPrsimilar-to𝐯𝐼¯𝜇𝑔𝐯𝑧𝛿2\displaystyle\Pr_{\mathbf{v}\sim I(\bar{\mu})}\left(g(\mathbf{v})\geq z\right)% \leq\delta/2.roman_Pr start_POSTSUBSCRIPT bold_v ∼ italic_I ( over¯ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT ( italic_g ( bold_v ) ≥ italic_z ) ≤ italic_δ / 2 .

To do so, we make use of Theorem 5. Define 𝐯¯jsuperscript¯𝐯𝑗\bar{\mathbf{v}}^{j}over¯ start_ARG bold_v end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT a copy of 𝐯𝐯\mathbf{v}bold_v, where its j𝑗jitalic_j-th element 𝐯jsubscript𝐯𝑗\mathbf{v}_{j}bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is replaced by an independent copy 𝐯jsubscriptsuperscript𝐯𝑗\mathbf{v}^{\prime}_{j}bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We want to upper bound

j=1m𝔼[(g(𝐯)g(𝐯¯j))+2|𝐯]superscriptsubscript𝑗1𝑚𝔼delimited-[]conditionalsuperscriptsubscript𝑔𝐯𝑔superscript¯𝐯𝑗2𝐯\displaystyle\sum_{j=1}^{m}\mathop{\mathbb{E}}\left[\left(g(\mathbf{v})-g(\bar% {\mathbf{v}}^{j})\right)_{+}^{2}\>|\>\mathbf{v}\right]∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_E [ ( italic_g ( bold_v ) - italic_g ( over¯ start_ARG bold_v end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | bold_v ]

below some constant q0𝑞0q\geq 0italic_q ≥ 0. Define 𝒫superscript𝒫\mathcal{P}^{\star}caligraphic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT as one of the elements of superscript\mathcal{L}^{\star}caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT that achieve the supremum in g(𝐯)𝑔𝐯g(\mathbf{v})italic_g ( bold_v ). We observe that

(g(𝐯)g(𝐯¯j))+subscript𝑔𝐯𝑔superscript¯𝐯𝑗\displaystyle\left(g(\mathbf{v})-g(\bar{\mathbf{v}}^{j})\right)_{+}( italic_g ( bold_v ) - italic_g ( over¯ start_ARG bold_v end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT
(1mi=1mf𝒫(si)(𝐯iμ¯)1mi=1mf𝒫(si)(𝐯¯ijμ¯))+absentsubscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓superscript𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇1𝑚superscriptsubscript𝑖1𝑚subscript𝑓superscript𝒫subscript𝑠𝑖subscriptsuperscript¯𝐯𝑗𝑖¯𝜇\displaystyle\leq\biggl{(}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}^{\star}}(s_{% i})(\mathbf{v}_{i}-\bar{\mu})-\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}^{\star}}% (s_{i})(\bar{\mathbf{v}}^{j}_{i}-\bar{\mu})\biggr{)}_{+}≤ ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( over¯ start_ARG bold_v end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT
=(1mf𝒫(sj)(𝐯jμ¯)1mf𝒫(sj)(𝐯jμ¯))+absentsubscript1𝑚subscript𝑓superscript𝒫subscript𝑠𝑗subscript𝐯𝑗¯𝜇1𝑚subscript𝑓superscript𝒫subscript𝑠𝑗subscriptsuperscript𝐯𝑗¯𝜇\displaystyle=\biggl{(}\frac{1}{m}f_{\mathcal{P}^{\star}}(s_{j})(\mathbf{v}_{j% }-\bar{\mu})-\frac{1}{m}f_{\mathcal{P}^{\star}}(s_{j})(\mathbf{v}^{\prime}_{j}% -\bar{\mu})\biggr{)}_{+}= ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_f start_POSTSUBSCRIPT caligraphic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_f start_POSTSUBSCRIPT caligraphic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT
(4) =(1mf𝒫(sj)(𝐯j𝐯j))+.absentsubscript1𝑚subscript𝑓superscript𝒫subscript𝑠𝑗subscript𝐯𝑗subscriptsuperscript𝐯𝑗\displaystyle=\biggl{(}\frac{1}{m}f_{\mathcal{P}^{\star}}(s_{j})(\mathbf{v}_{j% }-\mathbf{v}^{\prime}_{j})\biggr{)}_{+}.= ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_f start_POSTSUBSCRIPT caligraphic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT .

We now observe that 𝐯j=0subscriptsuperscript𝐯𝑗0\mathbf{v}^{\prime}_{j}=0bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 is the only possible value of 𝐯jsubscriptsuperscript𝐯𝑗\mathbf{v}^{\prime}_{j}bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT that makes (4) be >0absent0>0> 0. Therefore, we have

j=1m𝔼[(g(𝐯)g(𝐯¯j))+2|𝐯]superscriptsubscript𝑗1𝑚𝔼delimited-[]conditionalsuperscriptsubscript𝑔𝐯𝑔superscript¯𝐯𝑗2𝐯\displaystyle\sum_{j=1}^{m}\mathop{\mathbb{E}}\left[\left(g(\mathbf{v})-g(\bar% {\mathbf{v}}^{j})\right)_{+}^{2}\>|\>\mathbf{v}\right]∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_E [ ( italic_g ( bold_v ) - italic_g ( over¯ start_ARG bold_v end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | bold_v ]
j=1m𝔼[(1mf𝒫(sj)(𝐯j𝐯j))+2|𝐯]absentsuperscriptsubscript𝑗1𝑚𝔼delimited-[]conditionalsuperscriptsubscript1𝑚subscript𝑓superscript𝒫subscript𝑠𝑗subscript𝐯𝑗subscriptsuperscript𝐯𝑗2𝐯\displaystyle\leq\sum_{j=1}^{m}\mathop{\mathbb{E}}\left[\left(\frac{1}{m}f_{% \mathcal{P}^{\star}}(s_{j})(\mathbf{v}_{j}-\mathbf{v}^{\prime}_{j})\right)_{+}% ^{2}\>|\>\mathbf{v}\right]≤ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_E [ ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_f start_POSTSUBSCRIPT caligraphic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - bold_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | bold_v ]
=(1μ¯)j=1m(1mf𝒫(sj)𝐯j)2absent1¯𝜇superscriptsubscript𝑗1𝑚superscript1𝑚subscript𝑓superscript𝒫subscript𝑠𝑗subscript𝐯𝑗2\displaystyle=(1-\bar{\mu})\sum_{j=1}^{m}\left(\frac{1}{m}f_{\mathcal{P}^{% \star}}(s_{j})\mathbf{v}_{j}\right)^{2}= ( 1 - over¯ start_ARG italic_μ end_ARG ) ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_f start_POSTSUBSCRIPT caligraphic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(1μ¯)mj=1m(1mf𝒫(sj)𝐯j)absent1¯𝜇𝑚superscriptsubscript𝑗1𝑚1𝑚subscript𝑓superscript𝒫subscript𝑠𝑗subscript𝐯𝑗\displaystyle=\frac{(1-\bar{\mu})}{m}\sum_{j=1}^{m}\left(\frac{1}{m}f_{% \mathcal{P}^{\star}}(s_{j})\mathbf{v}_{j}\right)= divide start_ARG ( 1 - over¯ start_ARG italic_μ end_ARG ) end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_f start_POSTSUBSCRIPT caligraphic_P start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
(1μ¯)msup𝒫j=1m(1mf𝒫(sj)𝐯j)ωm.absent1¯𝜇𝑚subscriptsupremum𝒫superscriptsuperscriptsubscript𝑗1𝑚1𝑚subscript𝑓𝒫subscript𝑠𝑗subscript𝐯𝑗𝜔𝑚\displaystyle\leq\frac{(1-\bar{\mu})}{m}\sup_{\mathcal{P}\in\mathcal{L}^{\star% }}\sum_{j=1}^{m}\left(\frac{1}{m}f_{\mathcal{P}}(s_{j})\mathbf{v}_{j}\right)% \leq\frac{\omega}{m}.≤ divide start_ARG ( 1 - over¯ start_ARG italic_μ end_ARG ) end_ARG start_ARG italic_m end_ARG roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_ω end_ARG start_ARG italic_m end_ARG .

We apply Theorem 5 to the function g𝑔gitalic_g with q=ω/m𝑞𝜔𝑚q=\omega/mitalic_q = italic_ω / italic_m, obtaining that

Pr𝐯I(μ¯)(g(𝐯)𝔼𝐯I(μ¯)[g(𝐯)]+t)exp(mt2/(2ω)).subscriptPrsimilar-to𝐯𝐼¯𝜇𝑔𝐯subscript𝔼similar-to𝐯𝐼¯𝜇delimited-[]𝑔𝐯𝑡𝑚superscript𝑡22𝜔\displaystyle\Pr_{\mathbf{v}\sim I(\bar{\mu})}\Bigl{(}g(\mathbf{v})\geq\mathop% {\mathbb{E}}_{\mathbf{v}\sim I(\bar{\mu})}[g(\mathbf{v})]+t\Bigr{)}\leq\exp(-% mt^{2}/(2\omega)).roman_Pr start_POSTSUBSCRIPT bold_v ∼ italic_I ( over¯ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT ( italic_g ( bold_v ) ≥ blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( over¯ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT [ italic_g ( bold_v ) ] + italic_t ) ≤ roman_exp ( - italic_m italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( 2 italic_ω ) ) .

Setting t𝑡titalic_t as in (3), it is immediate to observe that the probability above is δ/2absent𝛿2\leq\delta/2≤ italic_δ / 2, obtaining the statement. ∎

Note that Theorem 4 provides a probabilistic upper bound, holding with probability at least 1δ1𝛿1-\delta1 - italic_δ, to the maximum observed value of the pattern quality sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P% }}(s_{i})(\mathbf{v}_{i}-\bar{\mu})roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) when the conditional distribution 𝐯U(B(k))similar-to𝐯𝑈𝐵𝑘\mathbf{v}\sim U(B(k))bold_v ∼ italic_U ( italic_B ( italic_k ) ) is considered, in terms of the expectation of the maximum observed value of the pattern quality according to the resampled distribution 𝐯I(μ¯)similar-to𝐯𝐼¯𝜇\mathbf{v}\sim I(\bar{\mu})bold_v ∼ italic_I ( over¯ start_ARG italic_μ end_ARG ). Then, the following result proves that the estimation d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) of the expected deviation in the upper bound above is very accurate.

Theorem 6.

With probability at least 1δ/41𝛿41-\delta/41 - italic_δ / 4 over superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, it holds

𝔼𝐯I(μ¯)[sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)]d~(,μˇ)+log(4δ)2cm.subscript𝔼similar-to𝐯𝐼¯𝜇delimited-[]subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇~𝑑superscriptˇ𝜇4𝛿2𝑐𝑚\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\bar{\mu})}\left[\sup_{% \mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{% i})(\mathbf{v}_{i}-\bar{\mu})\right]\leq\tilde{d}(\mathcal{R}^{\star},\check{% \mu})+\sqrt{\frac{\log\bigl{(}\frac{4}{\delta}\bigr{)}}{2cm}}.blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( over¯ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ] ≤ over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) + square-root start_ARG divide start_ARG roman_log ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_c italic_m end_ARG end_ARG .

By combining the results above, we prove the guarantees provided by FSR for the task of finding significant patterns when conditional testing is used. In particular, the following Corollary proves that, when boundStatistic returns the value ε𝜀\varepsilonitalic_ε defined below, then the output set O𝑂Oitalic_O of significant patterns returned by FSR has FWER bounded by the user-defined parameter δ𝛿\deltaitalic_δ.

Corollary 7.

Fix δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ) and c1𝑐1c\geq 1italic_c ≥ 1. Let O𝑂Oitalic_O be the output of FSR with input parameters \mathcal{L}caligraphic_L, 𝒟𝒟\mathcal{D}caligraphic_D, c𝑐citalic_c, δ𝛿\deltaitalic_δ, and let

ε=d~(,μˇ)+2ωlog(4δ)m+log(4δ)2cm𝜀~𝑑superscriptˇ𝜇2𝜔4𝛿𝑚4𝛿2𝑐𝑚\displaystyle\varepsilon=\tilde{d}(\mathcal{R}^{\star},\check{\mu})+\sqrt{% \frac{2\omega\log\bigl{(}\frac{4}{\delta}\bigr{)}}{m}}+\sqrt{\frac{\log\bigl{(% }\frac{4}{\delta}\bigr{)}}{2cm}}italic_ε = over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) + square-root start_ARG divide start_ARG 2 italic_ω roman_log ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG + square-root start_ARG divide start_ARG roman_log ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_c italic_m end_ARG end_ARG

be the value returned by boundStatistic in line 1. Then, the set O𝑂Oitalic_O has FWER δabsent𝛿\leq\delta≤ italic_δ under the conditional null distribution.

Note that the value of ε𝜀\varepsilonitalic_ε returned by boundStatistic requires to compute sup𝒫{𝗊¯𝒫(𝒟j,μˇ)}subscriptsupremum𝒫subscript¯𝗊𝒫subscriptsuperscript𝒟𝑗ˇ𝜇\sup_{\mathcal{P}\in\mathcal{L}}\bigl{\{}\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D}^{\star}_{j},\check{\mu})\bigr{\}}roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) }, which is costly, but only needs to be performed on c𝑐citalic_c resampled datasets. As we will show in our experimental evaluation (see Section 5), small values of c𝑐citalic_c suffice. Moreover, the maximum frequency of a pattern in the language \mathcal{L}caligraphic_L is required in order to compute ω𝜔\omegaitalic_ω (see Theorem 4). Such maximum frequency can be computed very efficiently in most data mining tasks. For example: in itemset mining it corresponds to the frequency of the most frequent item; in subgroup mining, it is equal to 1111 whenever a continuous feature is present and the conditions defining the pattern language \mathcal{L}caligraphic_L include inequalities.

Power analysis. The results above show that FSR rigorously controls the probability of false positives (i.e., patterns for which the null hypothesis hold but are wrongly reported in output as significant). However, they do not provide guarantees on the power of FSR, that is, its ability to report patterns 𝒫𝒫\mathcal{P}caligraphic_P with sufficiently high quality 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT. The following result provides guarantees on the power of FSR-C, the version of FSR that uses conditional testing, for the pattern language of subgroups. Our analysis is based on a probabilistic upper bound to d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ), obtained from bounds to the pseudodimension (Li et al., 2001; Pollard, 2012; Shalev-Shwartz and Ben-David, 2014) of subgroups, and an advanced concentration bound for sums of dependent random variables (Dubhashi and Panconesi, 2009), that hold under mild (but necessary) assumptions on the distribution of alternative hypotheses (the set of patterns with 𝗊𝒫>0subscript𝗊𝒫0\mathsf{q}_{\mathcal{P}}>0sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT > 0). More precisely, we assume that the target labels of the transactions that support a pattern 𝒫𝒫\mathcal{P}caligraphic_P with 𝗊𝒫>0subscript𝗊𝒫0\mathsf{q}_{\mathcal{P}}>0sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT > 0 are distributed according to a noncentral hypergeometric distribution (Wallenius, 1963), i.e., a biased version of the standard hypergeometric distribution. We provide the proofs and additional details in Section A.5.

Theorem 8.

Fix δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ) and c,z1𝑐𝑧1c,z\geq 1italic_c , italic_z ≥ 1. Let O𝑂Oitalic_O be the output of FSR with input parameters \mathcal{L}caligraphic_L, 𝒟𝒟\mathcal{D}caligraphic_D, c𝑐citalic_c, and δ𝛿\deltaitalic_δ, where \mathcal{L}caligraphic_L is the language of subgroups composed by conjunctions with at most z𝑧zitalic_z conditions over d𝑑ditalic_d continuous features. Then, with probability at least 1δ1𝛿1-\delta1 - italic_δ, O𝑂Oitalic_O contains all patterns with quality 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT satisfying

𝗊𝒫subscript𝗊𝒫absent\displaystyle\mathsf{q}_{\mathcal{P}}\geqsansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ≥ 2ω^zln(e3dm24z3)m+ln(2δ)2cm+zln(e3dm24z3)3m2^𝜔𝑧superscript𝑒3𝑑superscript𝑚24superscript𝑧3𝑚2𝛿2𝑐𝑚𝑧superscript𝑒3𝑑superscript𝑚24superscript𝑧33𝑚\displaystyle\;\sqrt{\frac{2\hat{\omega}z\ln(\frac{e^{3}dm^{2}}{4z^{3}})}{m}}+% \sqrt{\frac{\ln\bigl{(}\frac{2}{\delta}\bigr{)}}{2cm}}+\frac{z\ln\bigl{(}\frac% {e^{3}dm^{2}}{4z^{3}}\bigr{)}}{3m}square-root start_ARG divide start_ARG 2 over^ start_ARG italic_ω end_ARG italic_z roman_ln ( divide start_ARG italic_e start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_d italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG + square-root start_ARG divide start_ARG roman_ln ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_c italic_m end_ARG end_ARG + divide start_ARG italic_z roman_ln ( divide start_ARG italic_e start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_d italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) end_ARG start_ARG 3 italic_m end_ARG
+2ωlog(4δ)m+log(4δ)2cm+2𝖿^(𝒟)zln(e3dm22z3δ)m,2𝜔4𝛿𝑚4𝛿2𝑐𝑚2^𝖿𝒟𝑧superscript𝑒3𝑑superscript𝑚22superscript𝑧3𝛿𝑚\displaystyle+\sqrt{\frac{2\omega\log\bigl{(}\frac{4}{\delta}\bigr{)}}{m}}+% \sqrt{\frac{\log\bigl{(}\frac{4}{\delta}\bigr{)}}{2cm}}+\sqrt{\frac{2\hat{% \mathsf{f}}(\mathcal{D})z\ln(\frac{e^{3}dm^{2}}{2z^{3}\delta})}{m}},+ square-root start_ARG divide start_ARG 2 italic_ω roman_log ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG + square-root start_ARG divide start_ARG roman_log ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_c italic_m end_ARG end_ARG + square-root start_ARG divide start_ARG 2 over^ start_ARG sansserif_f end_ARG ( caligraphic_D ) italic_z roman_ln ( divide start_ARG italic_e start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_d italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG ,

where 𝖿^(𝒟)=sup𝒫𝖿𝒫(𝒟)^𝖿𝒟subscriptsupremum𝒫subscript𝖿𝒫𝒟\hat{\mathsf{f}}(\mathcal{D})=\sup_{\mathcal{P}\in\mathcal{L}}\mathsf{f}_{% \mathcal{P}}(\mathcal{D})over^ start_ARG sansserif_f end_ARG ( caligraphic_D ) = roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) and ω^=μ^(1μˇ)𝖿^(𝒟)^𝜔^𝜇1ˇ𝜇^𝖿𝒟\hat{\omega}=\hat{\mu}(1-\check{\mu})\hat{\mathsf{f}}(\mathcal{D})over^ start_ARG italic_ω end_ARG = over^ start_ARG italic_μ end_ARG ( 1 - overroman_ˇ start_ARG italic_μ end_ARG ) over^ start_ARG sansserif_f end_ARG ( caligraphic_D ).

4.2. FSR for Unconditional Testing

We now describe the details of procedures boundTarget and boundStatistic for the version of FSR that uses unconditional testing, which we refer to as FSR-U. As a reminder, in our scenario an unconditional test assumes that the average target value and patterns frequencies 𝖿𝒫(𝒟)subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) are observations of random variables, whose expected values are unknown. In particular, here we consider the transactions {(s1,1),,(sm,m)}subscript𝑠1subscript1subscript𝑠𝑚subscript𝑚\left\{(s_{1},\ell_{1}),\dots,(s_{m},\ell_{m})\right\}{ ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) } constituting the dataset 𝒟𝒟\mathcal{D}caligraphic_D as i.i.d. samples from an unknown distribution γ𝛾\gammaitalic_γ. This scenario is more complex than the conditional testing one, since we need to account for the unknown deviation of all observed values when computing a bound to the maximum pattern quality under the null hypothesis. However, we show that the use of resampled datasets allow us to efficiently take into account such deviations.

For boundTarget, recall that μ𝜇\muitalic_μ is the the probability that a sample from the unknown distribution γ𝛾\gammaitalic_γ has target label \ellroman_ℓ equal to 1111, such that μ=𝔼𝒟[μ(𝒟)]=Pr(s,)γ(=1)𝜇subscript𝔼𝒟delimited-[]𝜇𝒟subscriptPrsimilar-to𝑠𝛾1\mu=\mathop{\mathbb{E}}_{\mathcal{D}}\left[\mu(\mathcal{D})\right]=\Pr_{(s,% \ell)\sim\gamma}\left(\ell=1\right)italic_μ = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_μ ( caligraphic_D ) ] = roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( roman_ℓ = 1 ). More importantly, in this setting μ𝜇\muitalic_μ is unknown (as γ𝛾\gammaitalic_γ is). However, given that the samples in 𝒟𝒟\mathcal{D}caligraphic_D are i.i.d., the following result provides a probabilistic bound εTsubscript𝜀𝑇\varepsilon_{T}italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to the deviation of the observed value μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) from its expectation μ𝜇\muitalic_μ.

Lemma 9.

Let 𝒟𝒟\mathcal{D}caligraphic_D be a collection of m𝑚mitalic_m samples taken i.i.d. from γ𝛾\gammaitalic_γ. For δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), it holds with probability 1δ/4absent1𝛿4\geq 1-\delta/4≥ 1 - italic_δ / 4

|μ(𝒟)μ|εT2min{μ(𝒟),14}ln(8δ)m+2ln(8δ)m.𝜇𝒟𝜇subscript𝜀𝑇approaches-limit2𝜇𝒟148𝛿𝑚28𝛿𝑚\displaystyle\lvert\mu(\mathcal{D})-\mu\rvert\leq\varepsilon_{T}\doteq\sqrt{% \frac{2\min\bigl{\{}\mu(\mathcal{D}),\frac{1}{4}\bigr{\}}\ln\left(\frac{8}{% \delta}\right)}{m}}+\frac{2\ln\left(\frac{8}{\delta}\right)}{m}.| italic_μ ( caligraphic_D ) - italic_μ | ≤ italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≐ square-root start_ARG divide start_ARG 2 roman_min { italic_μ ( caligraphic_D ) , divide start_ARG 1 end_ARG start_ARG 4 end_ARG } roman_ln ( divide start_ARG 8 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG + divide start_ARG 2 roman_ln ( divide start_ARG 8 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG .

boundTarget returns the value εTsubscript𝜀𝑇\varepsilon_{T}italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT as defined in Lemma 9, which requires the values μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ), δ𝛿\deltaitalic_δ, and m𝑚mitalic_m.

For boundStatistic, the computation of the upper bound ε𝜀\varepsilonitalic_ε to the maximum deviation between the observed pattern qualities and their expectations is more involved than in the conditional case. We show that the quality 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT of a pattern 𝒫𝒫\mathcal{P}caligraphic_P, which depends on several unknown quantities (see Section 3.1), can be sharply estimated from a dataset 𝒟𝒟\mathcal{D}caligraphic_D provided that μ𝜇\muitalic_μ is known, which is not the case for the unconditional distribution. We then show that using μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) in place of μ𝜇\muitalic_μ provides a good estimate of 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT, and prove a probabilistic upper bound on the difference between the two estimates.

First, we introduce the function family that we use to analyze the supremum deviation of the empirical quality of patterns from the dataset 𝒟𝒟\mathcal{D}caligraphic_D. We define the family of functions g𝒫:𝒳×{0,1}[μ,1μ]:subscriptsuperscript𝑔𝒫𝒳01𝜇1𝜇g^{*}_{\mathcal{P}}:\mathcal{X}\times\{0,1\}\rightarrow[-\mu,1-\mu]italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT : caligraphic_X × { 0 , 1 } → [ - italic_μ , 1 - italic_μ ], where g𝒫subscriptsuperscript𝑔𝒫g^{*}_{\mathcal{P}}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT is defined as g𝒫(s,)=f𝒫(s)(μ)subscriptsuperscript𝑔𝒫𝑠subscript𝑓𝒫𝑠𝜇g^{*}_{\mathcal{P}}(s,\ell)=f_{\mathcal{P}}(s)(\ell-\mu)italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s , roman_ℓ ) = italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s ) ( roman_ℓ - italic_μ ). Note that g𝒫(s,)=f𝒫(s)(μ)subscriptsuperscript𝑔𝒫𝑠subscript𝑓𝒫𝑠𝜇g^{*}_{\mathcal{P}}(s,\ell)=f_{\mathcal{P}}(s)(\ell-\mu)italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s , roman_ℓ ) = italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s ) ( roman_ℓ - italic_μ ) corresponds to the function g𝒫(s,)subscript𝑔𝒫𝑠g_{\mathcal{P}}(s,\ell)italic_g start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s , roman_ℓ ) used by the FSR statistic where the unknown value μ𝜇\muitalic_μ is used in place of its estimate μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) obtained from dataset 𝒟𝒟\mathcal{D}caligraphic_D. Define the estimator 𝗊𝒫(𝒟)subscript𝗊𝒫𝒟\mathsf{q}_{\mathcal{P}}(\mathcal{D})sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) of the quality 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT of 𝒫𝒫\mathcal{P}caligraphic_P from 𝒟𝒟\mathcal{D}caligraphic_D as the average of g𝒫subscriptsuperscript𝑔𝒫g^{*}_{\mathcal{P}}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT over 𝒟𝒟\mathcal{D}caligraphic_D, that is

𝗊𝒫(𝒟)=1mi=1mg𝒫(si,i).subscript𝗊𝒫𝒟1𝑚superscriptsubscript𝑖1𝑚subscriptsuperscript𝑔𝒫subscript𝑠𝑖subscript𝑖\displaystyle\mathsf{q}_{\mathcal{P}}(\mathcal{D})=\frac{1}{m}\sum_{i=1}^{m}g^% {*}_{\mathcal{P}}(s_{i},\ell_{i}).sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Note that 𝔼𝒟[𝗊𝒫(𝒟)]=𝗊𝒫subscript𝔼𝒟delimited-[]subscript𝗊𝒫𝒟subscript𝗊𝒫\mathop{\mathbb{E}}_{\mathcal{D}}\left[\mathsf{q}_{\mathcal{P}}(\mathcal{D})% \right]=\mathsf{q}_{\mathcal{P}}blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ] = sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT, as 𝗊𝒫(𝒟)subscript𝗊𝒫𝒟\mathsf{q}_{\mathcal{P}}(\mathcal{D})sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) is an unbiased estimator of 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT. However, 𝗊𝒫(𝒟)subscript𝗊𝒫𝒟\mathsf{q}_{\mathcal{P}}(\mathcal{D})sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) depends on the unknown quantity μ𝜇\muitalic_μ. Even if 𝗊𝒫(𝒟)𝗊¯𝒫(𝒟)subscript𝗊𝒫𝒟subscript¯𝗊𝒫𝒟\mathsf{q}_{\mathcal{P}}(\mathcal{D})\neq{\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D})}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≠ over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) (since μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) may be μabsent𝜇\neq\mu≠ italic_μ), FSR-U exploits the fact that μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) is sharply concentrated around μ𝜇\muitalic_μ, as proved in Lemma 9. This implies that the maximum deviation sup𝒫|𝗊𝒫(𝒟)𝗊¯𝒫(𝒟)|subscriptsupremum𝒫subscript𝗊𝒫𝒟subscript¯𝗊𝒫𝒟\sup_{\mathcal{P}\in\mathcal{L}}|\mathsf{q}_{\mathcal{P}}(\mathcal{D})-{\bar{% \mathsf{q}}_{\mathcal{P}}(\mathcal{D})}|roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT | sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) - over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) | for all patterns 𝒫𝒫\mathcal{P}\in\mathcal{L}caligraphic_P ∈ caligraphic_L can be sharply estimated from 𝒟𝒟\mathcal{D}caligraphic_D. To this aim, we prove the following result.

Theorem 10.

Let 𝒟𝒟\mathcal{D}caligraphic_D be a collection of m𝑚mitalic_m samples taken i.i.d. from γ𝛾\gammaitalic_γ. For δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), with probability 1δ/4absent1𝛿4\geq 1-\delta/4≥ 1 - italic_δ / 4 it holds

|𝗊𝒫(𝒟)𝗊¯𝒫(𝒟)|εT𝖿𝒫(𝒟),𝒫.formulae-sequencesubscript𝗊𝒫𝒟subscript¯𝗊𝒫𝒟subscript𝜀𝑇subscript𝖿𝒫𝒟for-all𝒫\displaystyle\lvert\mathsf{q}_{\mathcal{P}}(\mathcal{D})-{\bar{\mathsf{q}}_{% \mathcal{P}}(\mathcal{D})}\rvert\leq\varepsilon_{T}\mathsf{f}_{\mathcal{P}}(% \mathcal{D}),\forall\mathcal{P}\in\mathcal{L}.| sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) - over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) | ≤ italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) , ∀ caligraphic_P ∈ caligraphic_L .
Proof.

Using Lemma 9, 𝒫for-all𝒫\forall\mathcal{P}\in\mathcal{L}∀ caligraphic_P ∈ caligraphic_L it holds with prob. 1δ/4absent1𝛿4\geq 1-\delta/4≥ 1 - italic_δ / 4

||\displaystyle\lvert| 𝗊𝒫(𝒟)𝗊¯𝒫(𝒟)|=|1mi=1mg𝒫(si,i)1mi=1mg𝒫(si,i)|\displaystyle\mathsf{q}_{\mathcal{P}}(\mathcal{D})-{\bar{\mathsf{q}}_{\mathcal% {P}}(\mathcal{D})}\rvert=\left\lvert\frac{1}{m}\sum_{i=1}^{m}g^{*}_{\mathcal{P% }}(s_{i},\ell_{i})-\frac{1}{m}\sum_{i=1}^{m}g_{\mathcal{P}}(s_{i},\ell_{i})\right\rvertsansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) - over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) | = | divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) |
=|1mi=1mf𝒫(si)(μ(𝒟)μ)|εT𝖿𝒫(𝒟).absent1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖𝜇𝒟𝜇subscript𝜀𝑇subscript𝖿𝒫𝒟\displaystyle=\left\lvert\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mu(% \mathcal{D})-\mu)\right\rvert\leq\varepsilon_{T}\mathsf{f}_{\mathcal{P}}(% \mathcal{D}).\qed= | divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_μ ( caligraphic_D ) - italic_μ ) | ≤ italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) . italic_∎

We remark that the definition of 𝗊𝒫(𝒟)subscript𝗊𝒫𝒟\mathsf{q}_{\mathcal{P}}(\mathcal{D})sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) is crucial to the analysis of our algorithm FSR-U, as 𝗊𝒫(𝒟)subscript𝗊𝒫𝒟\mathsf{q}_{\mathcal{P}}(\mathcal{D})sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) is an average of m𝑚mitalic_m independent random variables, while 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) is not (since it depends on μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ), that is estimated from the observations in the whole dataset).

Recall the definition of the set superscript\mathcal{L}^{\star}caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT of non-significant patterns: ={𝒫:𝗊𝒫=0}superscriptconditional-set𝒫subscript𝗊𝒫0\mathcal{L}^{\star}=\left\{\mathcal{P}\in\mathcal{L}:\mathsf{q}_{\mathcal{P}}=% 0\right\}caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = { caligraphic_P ∈ caligraphic_L : sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = 0 }. To output significant patterns while controlling the FWER below δ𝛿\deltaitalic_δ, our goal is to bound the supremum deviation sup𝒫{𝗊¯𝒫(𝒟)}subscriptsupremum𝒫superscriptsubscript¯𝗊𝒫𝒟\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{{\bar{\mathsf{q}}_{\mathcal{P}}% (\mathcal{D})}\right\}roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) } (Eq. (2)) below some value η𝜂\etaitalic_η with probability at least 1δ1𝛿1-\delta1 - italic_δ, for some δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), and provide in output all patterns with 𝗊¯𝒫(𝒟)ηsubscript¯𝗊𝒫𝒟𝜂{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\geq\etaover¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≥ italic_η. To bound (2), we study the surrogate quantity sup𝒫{𝗊𝒫(𝒟)}subscriptsupremum𝒫superscriptsubscript𝗊𝒫𝒟\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{\mathsf{q}_{\mathcal{P}}(% \mathcal{D})\right\}roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) }, as Theorem 10 guarantees a small bound on |𝗊𝒫(𝒟)𝗊¯𝒫(𝒟)|subscript𝗊𝒫𝒟subscript¯𝗊𝒫𝒟\lvert\mathsf{q}_{\mathcal{P}}(\mathcal{D})-{\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D})}\rvert| sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) - over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) |. Consider the collection ={𝒟1,,𝒟c}superscriptsubscriptsuperscript𝒟1subscriptsuperscript𝒟𝑐\mathcal{R}^{\star}=\left\{\mathcal{D}^{\star}_{1},\dots,\mathcal{D}^{\star}_{% c}\right\}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = { caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } of c1𝑐1c\geq 1italic_c ≥ 1 i.i.d. resampled datasets computed by FSR (line 1). We prove the following result.

Theorem 11.

Let 𝒟𝒟\mathcal{D}caligraphic_D be a dataset of m𝑚mitalic_m samples taken i.i.d. from a distribution γ𝛾\gammaitalic_γ, and superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT a collection of c1𝑐1c\geq 1italic_c ≥ 1 i.i.d. resamples of the target labels of 𝒟𝒟\mathcal{D}caligraphic_D. For any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), define νTμ(1μ)subscript𝜈𝑇𝜇1𝜇\nu_{T}\geq\mu(1-\mu)italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≥ italic_μ ( 1 - italic_μ ), ννTsup𝒫{𝔼𝒟[𝖿𝒫(𝒟)]}𝜈subscript𝜈𝑇subscriptsupremum𝒫superscriptsubscript𝔼𝒟delimited-[]subscript𝖿𝒫𝒟\nu\geq\nu_{T}\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{\mathop{\mathbb{E% }}_{\mathcal{D}}\left[\mathsf{f}_{\mathcal{P}}(\mathcal{D})\right]\right\}italic_ν ≥ italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ] }, and ε𝜀\varepsilonitalic_ε as

d~(,μˇ)=1cj=1csup𝒫{𝗊¯𝒫(𝒟j,μˇ)}~𝑑superscriptˇ𝜇1𝑐superscriptsubscript𝑗1𝑐subscriptsupremum𝒫subscript¯𝗊𝒫subscriptsuperscript𝒟𝑗ˇ𝜇\displaystyle\tilde{d}(\mathcal{R}^{\star},\check{\mu})=\frac{1}{c}\sum_{j=1}^% {c}\sup_{\mathcal{P}\in\mathcal{L}}\bigl{\{}\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D}^{\star}_{j},\check{\mu})\bigr{\}}over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) = divide start_ARG 1 end_ARG start_ARG italic_c end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) }
r^=d~(,μˇ)+ln(4δ)2cm^𝑟~𝑑superscriptˇ𝜇4𝛿2𝑐𝑚\displaystyle\hat{r}=\tilde{d}(\mathcal{R}^{\star},\check{\mu})+\sqrt{\frac{% \ln\bigl{(}\frac{4}{\delta}\bigr{)}}{2cm}}over^ start_ARG italic_r end_ARG = over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) + square-root start_ARG divide start_ARG roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_c italic_m end_ARG end_ARG
d^=r^+(2νTln(4δ)m)2+2r^ln(4δ)m+2νTln(4δ)m^𝑑^𝑟superscript2subscript𝜈𝑇4𝛿𝑚22^𝑟4𝛿𝑚2subscript𝜈𝑇4𝛿𝑚\displaystyle\hat{d}=\hat{r}+\sqrt{\left(\frac{2\nu_{T}\ln\bigl{(}\frac{4}{% \delta}\bigr{)}}{m}\right)^{2}+\frac{2\hat{r}\ln\bigl{(}\frac{4}{\delta}\bigr{% )}}{m}}+\frac{2\nu_{T}\ln\bigl{(}\frac{4}{\delta}\bigr{)}}{m}over^ start_ARG italic_d end_ARG = over^ start_ARG italic_r end_ARG + square-root start_ARG ( divide start_ARG 2 italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 2 over^ start_ARG italic_r end_ARG roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG + divide start_ARG 2 italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG
(5) εd^+2ln(4δ)(ν+2d^)m+ln(4δ)3m.approaches-limit𝜀^𝑑24𝛿𝜈2^𝑑𝑚4𝛿3𝑚\displaystyle\varepsilon\doteq\hat{d}+\sqrt{\frac{2\ln\bigl{(}\frac{4}{\delta}% \bigr{)}\left(\nu+2\hat{d}\right)}{m}}+\frac{\ln\bigl{(}\frac{4}{\delta}\bigr{% )}}{3m}.italic_ε ≐ over^ start_ARG italic_d end_ARG + square-root start_ARG divide start_ARG 2 roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) ( italic_ν + 2 over^ start_ARG italic_d end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG + divide start_ARG roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 3 italic_m end_ARG .

With probability at least 1δ1𝛿1-\delta1 - italic_δ over the choice of 𝒟𝒟\mathcal{D}caligraphic_D and superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT it holds sup𝒫{𝗊𝒫(𝒟)}εsubscriptsupremum𝒫superscriptsubscript𝗊𝒫𝒟𝜀\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{\mathsf{q}_{\mathcal{P}}(% \mathcal{D})\right\}\leq\varepsilonroman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) } ≤ italic_ε.

We use the following simple upper bound for both νTsubscript𝜈𝑇\nu_{T}italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and ν𝜈\nuitalic_ν: νT=ν=sup|xμ(𝒟)|εTx(1x)subscript𝜈𝑇𝜈subscriptsupremum𝑥𝜇𝒟subscript𝜀𝑇𝑥1𝑥\nu_{T}=\nu=\sup_{|x-\mu(\mathcal{D})|\leq\varepsilon_{T}}x(1-x)italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_ν = roman_sup start_POSTSUBSCRIPT | italic_x - italic_μ ( caligraphic_D ) | ≤ italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x ( 1 - italic_x ), and note that while slightly more refined bounds are possible, we omit them to improve readability.

Note that Theorem 11 does not require the knowledge of the unknown parameter μ𝜇\muitalic_μ, but only of an upper bound μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG and of a lower bound μˇˇ𝜇\check{\mu}overroman_ˇ start_ARG italic_μ end_ARG. Theorem 11 allows to implement the procedure boundStatistic for the unconditional setting: boundStatistic computes ε𝜀\varepsilonitalic_ε as in Eq. (5). Analogously to the conditional case, evaluating ε𝜀\varepsilonitalic_ε requires to compute sup𝒫{𝗊¯𝒫(𝒟j,μˇ)}subscriptsupremum𝒫subscript¯𝗊𝒫subscriptsuperscript𝒟𝑗ˇ𝜇\sup_{\mathcal{P}\in\mathcal{L}}\bigl{\{}\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D}^{\star}_{j},\check{\mu})\bigr{\}}roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) }, which is costly, but only needs to be performed on c𝑐citalic_c resampled datasets, and c𝑐citalic_c is in practice small (as we show in Section 5).

By combining the results in this section, we prove the guarantees provided by FSR for the task of finding significant patterns when unconditional testing is used. In particular, the following Corollary proves that, when boundStatistic returns the value ε𝜀\varepsilonitalic_ε defined in Eq. 5, then the output set O𝑂Oitalic_O of significant patterns returned by FSR-U has FWER bounded by the user-defined parameter δ𝛿\deltaitalic_δ.

Corollary 12.

Fix δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ) and c1𝑐1c\geq 1italic_c ≥ 1. Let O𝑂Oitalic_O be the output of FSR with input parameters \mathcal{L}caligraphic_L, 𝒟𝒟\mathcal{D}caligraphic_D, c𝑐citalic_c, δ𝛿\deltaitalic_δ, and let ε𝜀\varepsilonitalic_ε as in Eq. (5). Then, the set O𝑂Oitalic_O has FWER δabsent𝛿\leq\delta≤ italic_δ under the unconditional null distribution.

Power analysis. The following result provides guarantees on the power of FSR-U, the version of FSR that uses unconditional testing, building on the probabilistic upper bound to d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) proved in Section A.5, and on concentration bounds based on the pseudodimension of the language of subgroups.

Theorem 13.

Fix δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ) and c,z1𝑐𝑧1c,z\geq 1italic_c , italic_z ≥ 1. Let O𝑂Oitalic_O be the output of FSR with input parameters \mathcal{L}caligraphic_L, 𝒟𝒟\mathcal{D}caligraphic_D, c𝑐citalic_c, and δ𝛿\deltaitalic_δ, where \mathcal{L}caligraphic_L is the language of subgroups composed by conjunctions with at most z𝑧zitalic_z conditions over d𝑑ditalic_d continuous features. Define ω^=μ^(1μˇ)sup𝒫𝖿𝒫(𝒟)^𝜔^𝜇1ˇ𝜇subscriptsupremum𝒫subscript𝖿𝒫𝒟\hat{\omega}=\hat{\mu}(1-\check{\mu})\sup_{\mathcal{P}\in\mathcal{L}}\mathsf{f% }_{\mathcal{P}}(\mathcal{D})over^ start_ARG italic_ω end_ARG = over^ start_ARG italic_μ end_ARG ( 1 - overroman_ˇ start_ARG italic_μ end_ARG ) roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ), and let ε𝜀\varepsilonitalic_ε be defined as in Theorem 11, where r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG is replaced by

r^=2ω^zln(e3m2d4z3)m+ln(3δ)2cm+zln(e3m2d4z3)3m+ln(4δ)2cm.^𝑟2^𝜔𝑧superscript𝑒3superscript𝑚2𝑑4superscript𝑧3𝑚3𝛿2𝑐𝑚𝑧superscript𝑒3superscript𝑚2𝑑4superscript𝑧33𝑚4𝛿2𝑐𝑚\displaystyle\hat{r}=\sqrt{\frac{2\hat{\omega}z\ln(\frac{e^{3}m^{2}d}{4z^{3}})% }{m}}+\sqrt{\frac{\ln(\frac{3}{\delta})}{2cm}}+\frac{z\ln(\frac{e^{3}m^{2}d}{4% z^{3}})}{3m}+\sqrt{\frac{\ln\bigl{(}\frac{4}{\delta}\bigr{)}}{2cm}}.over^ start_ARG italic_r end_ARG = square-root start_ARG divide start_ARG 2 over^ start_ARG italic_ω end_ARG italic_z roman_ln ( divide start_ARG italic_e start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 4 italic_z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG + square-root start_ARG divide start_ARG roman_ln ( divide start_ARG 3 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_c italic_m end_ARG end_ARG + divide start_ARG italic_z roman_ln ( divide start_ARG italic_e start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 4 italic_z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) end_ARG start_ARG 3 italic_m end_ARG + square-root start_ARG divide start_ARG roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_c italic_m end_ARG end_ARG .

Then, define 𝖿𝒫=𝔼𝒟[𝖿𝒫(𝒟)]subscript𝖿𝒫subscript𝔼𝒟delimited-[]subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}=\mathop{\mathbb{E}}_{\mathcal{D}}[\mathsf{f}_{% \mathcal{P}}(\mathcal{D})]sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ]. With probability at least 1δ1𝛿1-\delta1 - italic_δ, O𝑂Oitalic_O contains all patterns 𝒫𝒫\mathcal{P}caligraphic_P with quality 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT satisfying

𝗊𝒫ε+2εT(𝖿𝒫+zln(2edz)+ln(3δ)2m)+zln(2edz)+ln(3δ)2m.subscript𝗊𝒫𝜀2subscript𝜀𝑇subscript𝖿𝒫𝑧2𝑒𝑑𝑧3𝛿2𝑚𝑧2𝑒𝑑𝑧3𝛿2𝑚\displaystyle\mathsf{q}_{\mathcal{P}}\geq\varepsilon+2\varepsilon_{T}\Biggl{(}% \mathsf{f}_{\mathcal{P}}+\sqrt{\frac{z\ln\bigl{(}\frac{2ed}{z}\bigr{)}+\ln% \bigl{(}\frac{3}{\delta}\bigr{)}}{2m}}\Biggr{)}+\sqrt{\frac{z\ln\bigl{(}\frac{% 2ed}{z}\bigr{)}+\ln\bigl{(}\frac{3}{\delta}\bigr{)}}{2m}}.sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ≥ italic_ε + 2 italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT + square-root start_ARG divide start_ARG italic_z roman_ln ( divide start_ARG 2 italic_e italic_d end_ARG start_ARG italic_z end_ARG ) + roman_ln ( divide start_ARG 3 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_m end_ARG end_ARG ) + square-root start_ARG divide start_ARG italic_z roman_ln ( divide start_ARG 2 italic_e italic_d end_ARG start_ARG italic_z end_ARG ) + roman_ln ( divide start_ARG 3 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_m end_ARG end_ARG .

5. Experiments

This section presents the results of our experiments. The goal of our experimental evaluation is to assess FSR’s capabilities of discovering significant patterns with high statistical power, analyzing efficiently large real-world datasets with few-shot resampling, in both conditional (FSR-C) and unconditional (FSR-U) settings.

Pattern Language. In our experimental evaluation we focus on the problem of discovering significant subgroups from large real-world datasets with mixed feature types (both categorical and continuous). The language \mathcal{L}caligraphic_L is composed of conjunctions of up to z𝑧zitalic_z conditions on the features of the data (Atzmueller, 2015), where z𝑧zitalic_z is a fixed parameter (see below). These conditions are either equalities (for categorical features), inequalities, or intervals (on continuous features).

Datasets. We tested FSR on 12121212 standard benchmarks and real-world datasets to evaluate subgroup discovery algorithms from UCI222https://archive.ics.uci.edu/. The statistics of the datasets are described in Table 1. These datasets cover a wide range of sizes, dimensionalities, and application domains. The column z𝑧zitalic_z of Table 1 reports the maximum number of conjunction terms for the subgroups in the language \mathcal{L}caligraphic_L for each dataset.

Implementation of FSR. We implemented FSR in Python. The code and the scripts to reproduce all experiments are available online333https://github.com/VandinLab/FSR. To mine subgroups, we make use of a fast depth-first enumeration algorithm included in the library pysubgroup444https://github.com/flemmerich/pysubgroup (Lemmerich and Becker, 2019).

Baselines. Since our algorithm FSR is the first algorithm that can be used for mining significant patterns with both conditional and unconditional testing, we consider different baselines for conditional testing and for unconditional testing.

For conditional testing, we compare FSR-C with a variant of TopKWY (Pellegrina and Vandin, 2020), the state-of-the-art method for significant pattern mining with conditional testing, based on the WY permutation testing procedure (Westfall and Young, 1993). The original implementation of TopKWY is only tailored to identify significant itemsets and subgraphs, and does not support subgroups from categorical and continuous features; however, we note that it is fairly simple to adapt its strategy to such case. In particular, we extended TopKWY to identify significant subgroups by estimating the distribution of the supremum deviation using permuted datasets (instead of the p𝑝pitalic_p-values as done in the original TopKWY implementation (Pellegrina and Vandin, 2020)). That is, our variant of TopKWY considers the same statistic of FSR-C (i.e., the supremum deviation), but instead of generating resamples and taking the average of supremum deviations as done by FSR-C, it efficiently computes its δ𝛿\deltaitalic_δ-quantiles considering permutations of the labels. In addition, in our variant of TopKWY we use pysubgroup to mine subgroups. Note that since our variant of TopKWY and FSR-C share the same, equally optimized, procedure to explore the search space of the language \mathcal{L}caligraphic_L, all comparisons in terms of running times are fair. For TopKWY we use 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT permutations, a good trade-off in terms of running time and accuracy for estimating the δ𝛿\deltaitalic_δ-quantile for typical values of δ𝛿\deltaitalic_δ (e.g., 0.050.050.050.05(Llinares-López et al., 2015).

Regarding unconditional testing, we note that FSR-U is the first method to discover significant patterns in a (fully) unconditional setting. Therefore, a baseline that may seem reasonable to control the FWER is the standard Bonferroni correction: each pattern 𝒫𝒫\mathcal{P}caligraphic_P is flagged as significant if the probability (under the null hypothesis) of observing a quality greater or equal than the one measured in the data is at most δ/||𝛿\delta/|\mathcal{L}|italic_δ / | caligraphic_L |, where |||\mathcal{L}|| caligraphic_L | is the size of the language \mathcal{L}caligraphic_L (see also Section 1). Note, however, that this simple method is not useful for subgroups as |||\mathcal{L}|| caligraphic_L | is infinite (e.g., as the number of inequalities over a continuous feature is unbounded)555We remark that, while there exist other correction procedures more powerful than Bonferroni (e.g., the Holm procedure), they all require to fix a finite set of hypothesis, therefore do not apply to our setting.. Therefore, we compare FSR-U with a novel non-trivial baseline, that we call FSR-U-UB, that resolves this issue. We describe FSR-U-UB at high level and defer additional details to Section A.6. For FSR-U-UB, we use part of the concentration bounds developed for FSR-U to upper bound the supremum deviation in terms of its expectation (taken w.r.t. the resamples but conditionally on the transactions, see Theorem 11 and Section A.6), but instead of estimating the expected supremum deviation with c𝑐citalic_c resamples (as done by FSR-U), we compute an upper bound to it via an union bound over the (finite) number of distinct subgroups that are observed in the data in at least one transaction, i.e., we consider the finite projection of \mathcal{L}caligraphic_L on 𝒟𝒟\mathcal{D}caligraphic_D, instead of all |||\mathcal{L}|| caligraphic_L | possible subgroups. Note that such approach bounds the FWER while being much less conservative than Bonferroni, since it corrects for the size of the projection instead of the size of \mathcal{L}caligraphic_L. By comparing FSR-U to FSR-U-UB we directly evaluate the advantage of computing bounds from resampled datasets that consider dependencies among patterns, as done by FSR-U.

Experimental setup. All the experiments were run on a machine with 2.30 GHz Intel Xeon CPU, 512 GB of RAM, on Ubuntu 20.04. In all experiments we use δ=0.05𝛿0.05\delta=0.05italic_δ = 0.05 (i.e., we control the FWER below 0.050.050.050.05). We repeated all experiments 10101010 times, and report averages ±plus-or-minus\pm± stds over the 10101010 repetitions.

Table 1. Statistics of the datasets considered in our experiments. m𝑚mitalic_m is the number of transactions, d𝑑ditalic_d is the number of features (categorical/continuous), μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) is the fraction of transactions with target equal to 1111, z𝑧zitalic_z is the maximum number of conjunction terms in the language \mathcal{L}caligraphic_L.
𝒟𝒟\mathcal{D}caligraphic_D m𝑚mitalic_m d𝑑ditalic_d μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) z𝑧zitalic_z
abalone 4177 1/7 0.663 5
adult 32561 8/6 0.241 5
bank 41188 10/10 0.113 3
brain-cancer 862 22/1 0.421 5
cancer-rna-seq 801 0/20531 0.375 2
covtype 581012 0/54 0.365 3
gisette 7000 0/5000 0.500 2
HIGGS 11000000 0/28 0.529 3
kdd-cup 95370 73/405 0.050 2
mushroom 8124 22/0 0.482 5
SUSY 5000000 0/18 0.457 3
theorem-prover 3059 0/51 0.420 3
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 1. Effect of the number of resamples c𝑐citalic_c on the deviation bounds (a)-(b) and running times (c)-(d) for FSR algorithms.
Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2. Comparison of FSR-C with TopKWY in terms of deviation bounds (a), running times (b), and number of results (c)-(d).
Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 3. Comparison of FSR-U with the baseline FSR-U-UB (Section A.6) in terms of deviation bounds (a), running times (b), and number of results (c)-(d).

Impact of parameters on FSR. In the first set of experiments we evaluate the effect of the number of resamples c𝑐citalic_c on the deviation bound ε𝜀\varepsilonitalic_ε computed by FSR and its running time. We consider both FSR-C and FSR-U, respectively designed to compute significant patterns with conditional and unconditional testing. To ease the presentation, for this first experiment we focus on 3333 of the datasets we considered; the results for the other datasets are very similar. In Figure 1-(a) we show the deviation bounds computed by FSR-C for different values of c𝑐citalic_c, while Figure 1-(b) is analogous for FSR-U. From these plots we clearly conclude that using c=10𝑐10c=10italic_c = 10 resamples is sufficient to obtain a small deviation bound, and that using more than 10101010 resamples is marginally beneficial as all curves flatten. Remarkably, for all datasets (and in particular for the larger dataset adult), and for both methods, even using one resample is enough to compute a meaningful deviation bound. This is in striking contrast with state-of-the-art methods based on permutation testing, that instead require a number of 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT permutations of the target labels to properly estimate the δ𝛿\deltaitalic_δ-quantile. Furthermore, we observe that the deviation bounds computed by FSR-C, in the conditional setting, are smaller than the bounds computed by FSR-U, for the unconditional scenario; this confirms the fact that the assumptions made on the process generating the data have a sensible effect on this aspect, a consequence of properly taking into account the uncertainty of the collected data.

Figures 1-(c)-(d) show the running time of the two methods FSR-C and FSR-U as functions of c𝑐citalic_c. Not surprisingly, we clearly observe that the running time increases linearly with the number of resamples c𝑐citalic_c for both methods. Therefore, we expect that processing a small number of resampled datasets is extremely advantageous in terms of running time. We also observe that FSR-U is faster than FSR-C for all values of c𝑐citalic_c. This is due to the fact that FSR-C, by computing a smaller deviation bound, explores a wider portion of the pattern language \mathcal{L}caligraphic_L. In any case, using at most 10101010 resamples is always feasible (since both algorithms terminate after at most 17171717 minutes for these datasets). From these observation, we fix c𝑐citalic_c to 10101010 for FSR-C and FSR-U in all our experiments.

Evaluation of FSR-C. In this experiment we report the performance of FSR-C to identify significant patterns with conditional testing, comparing it with a variant of TopKWY, the state-of-the-art method to identify significant patterns with permutation testing. In Figure 2 we show the deviation bounds (a), the running times (b), and the number of reported results (c)-(d) for both algorithms. From Figure 2-(a), we observe that the deviation bounds computed by FSR-C are larger than the ones computed by TopKWY. This is not surprising, as the deviation bound returned by TopKWY, that is based on the WY procedure, is a very accurate estimate of the optimal corrected threshold to bound the FWER (since the WY estimator converges asymptotically to it (Meinshausen et al., 2011)); on the other hand, FSR-C computes an upper bound to such quantity with stronger probabilistic guarantees (i.e., the bound does not only converge asymptotically and in expectation, but rather holds in finite samples with high probability). While the deviation bounds computed by TopKWY are smaller than FSR-C, from Figure 2-(b) we observe a significant gap in terms of running time between the two methods. In fact, TopKWY requires two orders of magnitude more time than FSR-C to compute its deviation bound; this is mainly due to the fact that TopKWY has to process two orders of magnitude more permutations of the target label than FSR-C. Note that using 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT permutations (instead of 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, to have a more accurate estimation of the δ𝛿\deltaitalic_δ-quantile, as often done in practice) in TopKWY would result in an even more substantial gap.

We now compare the two methods in terms of number of reported significant patterns. To do so, we follow a typical subgroup discovery analysis, based on the task of identifying the k𝑘kitalic_k most “interesting” subgroups in terms of quality: first, we mine the set of top-k𝑘kitalic_k subgroups with highest quality; then, we count the number of them that are flagged as significant by the two methods. We use k{5103,104}𝑘5superscript103superscript104k\in\{5\cdot 10^{3},10^{4}\}italic_k ∈ { 5 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT }. Figures 2-(c)-(d) show the number of patterns that are reported in output by both methods. We observe that, for k=5103𝑘5superscript103k=5\cdot 10^{3}italic_k = 5 ⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, for 6666 of the 12121212 datasets we considered, both algorithms output all the top-k𝑘kitalic_k patterns, i.e., they have the same output; for such datasets TopKWY reports all top-k𝑘kitalic_k patterns also for k=104𝑘superscript104k=10^{4}italic_k = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, while FSR-C outputs the same set of results for all but two datasets, for which it reports more than 70%percent7070\%70 % of them. For the remaining 6666 datasets, which consist of the 4 largest datasets (in terms of the number m𝑚mitalic_m of transactions) and the 2 datasets with highest number of features, TopKWY could not complete in reasonable time (i.e., we stopped it after 10101010 days), while FSR-C finished the analysis while returning a large number of significant results (i.e., either all the top-k𝑘kitalic_k patterns or more than 600 of them for the kdd-cup dataset). For the most challenging datasets (HIGGS, with 11111111 millions transactions, and cancer-rna-seq, with >104absentsuperscript104>10^{4}> 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT continuous features), FSR-C concludes after 5 days, while TopKWY would require (approximately) 1.51.51.51.5 years of computation. Therefore, in such cases FSR-C enables the analysis of these challenging instances while identifying many significant patterns. These results confirm that FSR-C, while providing a slightly more conservative upper bound to the supremum deviation, is still capable of discovering the same (or almost the same) most significant patterns, while being more than two orders of magnitude faster, and reporting many patterns as significant for challenging instances out of reach for the state-of-the-art. We conclude that FSR-C provides an excellent trade-off between the number of patterns identified as significant and the computational requirement of the analysis.

Evaluation of FSR-U. We now evaluate the performance of FSR-U to discover significant patterns with unconditional testing. We show the results in Figure 3. We compare FSR-U with the baseline FSR-U-UB (described above) in terms of deviation bounds (a), running time (b), and number of results (c)-(d). From Figure 3-(a) we clearly observe that FSR-U computes deviation bounds that are always smaller than FSR-U-UB. We conclude that processing the resamples of the target label provides much more accurate deviation bounds w.r.t. more standard techniques (i.e., a Bonferroni correction). This is a consequence of taking into account the dependencies among patterns when upper bounding the expected supremum deviation. In terms of running time, FSR-U always conclude in reasonable time (similarly to FSR-C), while FSR-U-UB is faster (since it does not consider any resamples). On the other hand, in many cases the baseline FSR-U-UB terminates quickly but without reporting anything in output: Figures 3-(c)-(d) show that for 5555 datasets it does not report significant patterns, while for the other datasets it outputs a significantly smaller amount (e.g., for adult, FSR-U finds almost an order of magnitude more results than FSR-U-UB). This experiment shows that FSR-U is a practical and powerful method to discover significant patterns with unconditional testing, and that it significantly improves over more standard techniques such as Bonferroni correction.

Application to Neural Network interpretation. In this final experiment we evaluate a practical application of FSR to the task of Neural Network interpretation (Fischer et al., 2021). More precisely, we consider the MNIST dataset (LeCun and Cortes, 2010) and train a Convolutional Neural Network (CNN), with the goal of identifying correlations between the activation values of neurons with the predicted target. To do so, we evaluate the association of the activation values of neurons in a convolutional filter with a binary target, composed by drawings of digits composed by straight lines only (1111 and 7777), versus the other digits. While we defer most details of this experiment to Section A.7, we observed that FSR successfully identifies interpretable activation patterns of several neurons, while requiring orders of magnitude less time than previous methods (as discussed previously).

6. Conclusions

We presented FSR, a novel algorithm to identify statistically significant patterns with rigorous bounds on the FWER. FSR uses a few-shot resampling strategy, which leads to an efficient and practical approach that can be used for both conditional and unconditional testing. Our experimental evaluation shows that FSR is an effective and accurate method for significant patterns discovery, while significantly reducing the computational cost of state-of-the-art multiple comparisons procedures, such as permutation testing, that hardly scale the analysis to complex languages, such as subgroups, and large datasets. While the experiments presented in this work are focused on subgroups, we expect the relative improvements obtained by FSR to directly transfer to other pattern types, given the generality of our framework and of the design of our resampling procedures, which are shared by all types of patterns.

Acknowledgements.
This work was supported by the “National Center for HPC, Big Data, and Quantum Computing”, project CN00000013, and by the PRIN Project n. 2022TS4Y3N - EXPAND: scalable algorithms for EXPloratory Analyses of heterogeneous and dynamic Networked Data, funded by the Italian Ministry of University and Research (MUR).

References

  • (1)
  • Aggarwal et al. (2010) Charu C Aggarwal, Yao Li, Philip S Yu, and Ruoming Jin. 2010. On dense pattern mining in graph streams. Proceedings of the VLDB Endowment 3, 1-2 (2010), 975–984.
  • Agrawal et al. (1993) Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data. 207–216.
  • Al Hasan and Zaki (2009) Mohammad Al Hasan and Mohammed J Zaki. 2009. Output space sampling for graph patterns. Proceedings of the VLDB Endowment 2, 1 (2009), 730–741.
  • Atzmueller (2015) Martin Atzmueller. 2015. Subgroup discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5, 1 (2015), 35–49.
  • Barnard (1945) GA Barnard. 1945. A new test for 2×\times× 2 tables. Nature 156, 3954 (1945).
  • Benjamini and Hochberg (1995) Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57, 1 (1995), 289–300.
  • Bonferroni (1936) Carlo Bonferroni. 1936. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8 (1936), 3–62.
  • Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. 2013. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
  • Cao et al. (2019) Lei Cao, Yizhou Yan, Samuel Madden, Elke A Rundensteiner, and Mathan Gopalsamy. 2019. Efficient discovery of sequence outlier patterns. Proceedings of the VLDB Endowment 12, 8 (2019), 920–932.
  • Ceccarello and Gamper (2022) Matteo Ceccarello and Johann Gamper. 2022. Fast and Scalable Mining of Time Series Motifs with Probabilistic Guarantees. Proceedings of the VLDB Endowment 15, 13 (2022), 3841–3853.
  • Chen et al. (2009) Chen Chen, Cindy X Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, and Jiawei Han. 2009. Mining graph patterns efficiently via randomized summaries. Proceedings of the VLDB Endowment 2, 1 (2009), 742–753.
  • Dalleiger and Vreeken (2022) Sebastian Dalleiger and Jilles Vreeken. 2022. Discovering significant patterns under sequential false discovery control. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  • Dubhashi and Panconesi (2009) Devdatt P Dubhashi and Alessandro Panconesi. 2009. Concentration of measure for the analysis of randomized algorithms. Cambridge University Press.
  • Fischer et al. (2021) Jonas Fischer, Anna Olah, and Jilles Vreeken. 2021. What’s in the Box? Exploring the Inner Life of Neural Networks with Robust Rules. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 139. PMLR, 3352–3362.
  • Fisher (1922) Ronald A Fisher. 1922. On the interpretation of χ𝜒\chiitalic_χ 2 from contingency tables, and the calculation of P. Journal of the royal statistical society 85, 1 (1922).
  • Hämäläinen and Webb (2019) Wilhelmiina Hämäläinen and Geoffrey I Webb. 2019. A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery 33 (2019), 325–377.
  • Han et al. (2007) Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. 2007. Frequent pattern mining: current status and future directions. Data mining and knowledge discovery 15, 1 (2007), 55–86.
  • Ho et al. (2022) Nguyen Thi Thao Ho, Torben Bach Pedersen, et al. 2022. Efficient temporal pattern mining in big time series using mutual information. Proceedings of the VLDB Endowment 15, 3 (2022), 673–685.
  • Holm (1979) Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics (1979), 65–70.
  • Jogdeo and Samuels (1968) Kumar Jogdeo and Stephen M Samuels. 1968. Monotone convergence of binomial probabilities and a generalization of Ramanujan’s equation. The Annals of Mathematical Statistics 39, 4 (1968), 1191–1195.
  • Kalofolias et al. (2017) Janis Kalofolias, Mario Boley, and Jilles Vreeken. 2017. Efficiently discovering locally exceptional yet globally representative subgroups. In 2017 IEEE International Conference on Data Mining (ICDM). IEEE.
  • LeCun and Cortes (2010) Yann LeCun and Corinna Cortes. 2010. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/
  • Lemmerich and Becker (2019) Florian Lemmerich and Martin Becker. 2019. pysubgroup: Easy-to-use subgroup discovery in python. In ECML PKDD 2018. Springer, 658–662.
  • Li et al. (2001) Yi Li, Philip M Long, and Aravind Srinivasan. 2001. Improved bounds on the sample complexity of learning. J. Comput. System Sci. 62, 3 (2001), 516–527.
  • Llinares-López et al. (2015) Felipe Llinares-López, Mahito Sugiyama, Laetitia Papaxanthos, and Karsten Borgwardt. 2015. Fast and memory-efficient significant pattern mining via permutation testing. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 725–734.
  • Löffler and Phillips (2009) Maarten Löffler and Jeff M Phillips. 2009. Shape fitting on point sets with probability distributions. In Algorithms-ESA 2009: 17th Annual European Symposium, Copenhagen, Denmark, September 7-9, 2009. Proceedings 17. Springer, 313–324.
  • McDiarmid (1989) Colin McDiarmid. 1989. On the method of bounded differences. Surveys in combinatorics 141, 1 (1989), 148–188.
  • Meinshausen et al. (2011) Nicolai Meinshausen, Marloes H Maathuis, and Peter Bühlmann. 2011. Asymptotic optimality of the Westfall-Young permutation procedure for multiple testing under dependence. The Annals of Statistics (2011), 3369–3391.
  • Minato et al. (2014) Shin-ichi Minato, Takeaki Uno, Koji Tsuda, Aika Terada, and Jun Sese. 2014. A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In ECML PKDD 2014. Springer.
  • Pellegrina et al. (2022) Leonardo Pellegrina, Cyrus Cousins, Fabio Vandin, and Matteo Riondato. 2022. MCRapper: Monte-Carlo Rademacher averages for poset families and approximate pattern mining. ACM Transactions on Knowledge Discovery from Data (TKDD) 16, 6 (2022), 1–29.
  • Pellegrina et al. (2019a) Leonardo Pellegrina, Matteo Riondato, and Fabio Vandin. 2019a. Hypothesis Testing and Statistically-sound Pattern Mining. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). ACM, New York, NY, USA, 3215–3216.
  • Pellegrina et al. (2019b) Leonardo Pellegrina, Matteo Riondato, and Fabio Vandin. 2019b. SPuManTE: Significant Pattern Mining with Unconditional Testing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). ACM, New York, NY, USA, 1528–1538.
  • Pellegrina and Vandin (2020) Leonardo Pellegrina and Fabio Vandin. 2020. Efficient mining of the most significant patterns with permutation testing. Data Mining and Knowledge Discovery 34 (2020), 1201–1234.
  • Pietracaprina and Vandin (2007) Andrea Pietracaprina and Fabio Vandin. 2007. Efficient incremental mining of top-K frequent closed itemsets. In International Conference on Discovery Science. Springer, 275–280.
  • Pollard (2012) David Pollard. 2012. Convergence of stochastic processes. Springer Science & Business Media.
  • Riondato and Vandin (2020) Matteo Riondato and Fabio Vandin. 2020. MiSoSouP: Mining interesting subgroups with sampling and pseudodimension. ACM Transactions on Knowledge Discovery from Data (TKDD) 14, 5 (2020), 1–31.
  • Santoro et al. (2020) Diego Santoro, Andrea Tonon, and Fabio Vandin. 2020. Mining Sequential Patterns with VC-Dimension and Rademacher Complexity. Algorithms 13, 5 (2020), 123.
  • Shalev-Shwartz and Ben-David (2014) Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
  • Terada et al. (2015) Aika Terada, Hanyoung Kim, and Jun Sese. 2015. High-speed Westfall-Young permutation procedure for genome-wide association studies. In Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics. 17–26.
  • Terada et al. (2013) Aika Terada, Mariko Okada-Hatakeyama, Koji Tsuda, and Jun Sese. 2013. Statistical significance of combinatorial regulations. Proceedings of the National Academy of Sciences 110, 32 (2013), 12996–13001.
  • Van Leeuwen and Knobbe (2012) Matthijs Van Leeuwen and Arno Knobbe. 2012. Diverse subgroup set discovery. Data Mining and Knowledge Discovery 25 (2012), 208–242.
  • Wallenius (1963) Kenneth Ted Wallenius. 1963. Biased sampling: the noncentral hypergeometric probability distribution. Technical Report. https://purl.stanford.edu/wh056vj9347
  • Webb (2006) Geoffrey I Webb. 2006. Discovering significant rules. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 434–443.
  • Webb (2007) Geoffrey I Webb. 2007. Discovering significant patterns. Machine learning 68 (2007), 1–33.
  • Webb (2008) Geoffrey I Webb. 2008. Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Machine Learning 71 (2008), 307–323.
  • Westfall and Young (1993) Peter H Westfall and S Stanley Young. 1993. Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons.

Appendix A Appendix

A.1. Relation to Quality Measures

Interesting subgroups are often identified using a quality measure, defined by combining the generality and the unusualness of a pattern. The generality 𝖿𝒫(𝒟)subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) of a pattern 𝒫𝒫\mathcal{P}caligraphic_P in the dataset 𝒟𝒟\mathcal{D}caligraphic_D is the fraction of samples of 𝒟𝒟\mathcal{D}caligraphic_D that support 𝒫𝒫\mathcal{P}caligraphic_P

𝖿𝒫(𝒟)=1mi=1m𝟙[𝒫si]=|𝖢𝒫(𝒟)|m.subscript𝖿𝒫𝒟1𝑚superscriptsubscript𝑖1𝑚1delimited-[]𝒫subscript𝑠𝑖subscript𝖢𝒫𝒟𝑚\displaystyle\mathsf{f}_{\mathcal{P}}(\mathcal{D})=\frac{1}{m}\sum_{i=1}^{m}% \mathds{1}\left[\mathcal{P}\in s_{i}\right]=\frac{|\mathsf{C}_{\mathcal{P}}(% \mathcal{D})|}{m}.sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_1 [ caligraphic_P ∈ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = divide start_ARG | sansserif_C start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) | end_ARG start_ARG italic_m end_ARG .

For any bag of samples B𝒟𝐵𝒟B\subseteq\mathcal{D}italic_B ⊆ caligraphic_D, define the average target value μ(B)𝜇𝐵\mu(B)italic_μ ( italic_B ) of samples tB𝑡𝐵t\in Bitalic_t ∈ italic_B as

μ(B)=1|B|(s,)B.𝜇𝐵1𝐵subscript𝑠𝐵\displaystyle\mu(B)=\frac{1}{|B|}\sum_{(s,\ell)\in B}\ell.italic_μ ( italic_B ) = divide start_ARG 1 end_ARG start_ARG | italic_B | end_ARG ∑ start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∈ italic_B end_POSTSUBSCRIPT roman_ℓ .

The unusualness 𝗎𝒫(𝒟)subscript𝗎𝒫𝒟\mathsf{u}_{\mathcal{P}}(\mathcal{D})sansserif_u start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) of the pattern 𝒫𝒫\mathcal{P}caligraphic_P on the dataset 𝒟𝒟\mathcal{D}caligraphic_D is defined as the difference of the target variable of the samples 𝖢𝒫(𝒟)absentsubscript𝖢𝒫𝒟\in\mathsf{C}_{\mathcal{P}}(\mathcal{D})∈ sansserif_C start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) and the average target in the entire data 𝒟𝒟\mathcal{D}caligraphic_D

𝗎𝒫(𝒟)=μ(𝖢𝒫(𝒟))μ(𝒟).subscript𝗎𝒫𝒟𝜇subscript𝖢𝒫𝒟𝜇𝒟\displaystyle\mathsf{u}_{\mathcal{P}}(\mathcal{D})=\mu(\mathsf{C}_{\mathcal{P}% }(\mathcal{D}))-\mu(\mathcal{D}).sansserif_u start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) = italic_μ ( sansserif_C start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ) - italic_μ ( caligraphic_D ) .

The α𝛼\alphaitalic_α-quality 𝗊¯𝒫,α(𝒟)subscriptsuperscript¯𝗊𝒫𝛼𝒟{\bar{\mathsf{q}}^{*}_{\mathcal{P},\alpha}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_P , italic_α end_POSTSUBSCRIPT ( caligraphic_D ) of a pattern 𝒫𝒫\mathcal{P}caligraphic_P on a dataset 𝒟𝒟\mathcal{D}caligraphic_D is defined as

𝗊¯𝒫,α(𝒟)=𝖿𝒫(𝒟)α𝗎𝒫(𝒟).subscriptsuperscript¯𝗊𝒫𝛼𝒟subscript𝖿𝒫superscript𝒟𝛼subscript𝗎𝒫𝒟\displaystyle{\bar{\mathsf{q}}^{*}_{\mathcal{P},\alpha}(\mathcal{D})}=\mathsf{% f}_{\mathcal{P}}(\mathcal{D})^{\alpha}\mathsf{u}_{\mathcal{P}}(\mathcal{D}).over¯ start_ARG sansserif_q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_P , italic_α end_POSTSUBSCRIPT ( caligraphic_D ) = sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT sansserif_u start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) .

Commonly used quality measures are the 1111-quality, the 1212\frac{1}{2}divide start_ARG 1 end_ARG start_ARG 2 end_ARG-quality, and the 2222-quality.

Given a subgroup 𝒫𝒫\mathcal{P}caligraphic_P, its quality 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT can be written as

𝗊𝒫=Pr(s,)γ(𝒫s=1)Pr(s,)γ(𝒫s)Pr(s,)γ(=1)subscript𝗊𝒫subscriptPrsimilar-to𝑠𝛾𝒫𝑠1subscriptPrsimilar-to𝑠𝛾𝒫𝑠subscriptPrsimilar-to𝑠𝛾1\displaystyle\mathsf{q}_{\mathcal{P}}=\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P% }\in s\wedge\ell=1\right)-\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\right% )\Pr_{(s,\ell)\sim\gamma}\left(\ell=1\right)sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( caligraphic_P ∈ italic_s ∧ roman_ℓ = 1 ) - roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( caligraphic_P ∈ italic_s ) roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( roman_ℓ = 1 )
=Pr(s,)γ(𝒫s)(Pr(s,)γ(=1|𝒫s)Pr(s,)γ(=1)).absentsubscriptPrsimilar-to𝑠𝛾𝒫𝑠subscriptPrsimilar-to𝑠𝛾conditional1𝒫𝑠subscriptPrsimilar-to𝑠𝛾1\displaystyle=\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\right)\left(\Pr_{% (s,\ell)\sim\gamma}\left(\ell=1|\mathcal{P}\in s\right)-\Pr_{(s,\ell)\sim% \gamma}\left(\ell=1\right)\right).= roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( caligraphic_P ∈ italic_s ) ( roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( roman_ℓ = 1 | caligraphic_P ∈ italic_s ) - roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( roman_ℓ = 1 ) ) .

Note that the generality 𝖿𝒫(𝒟)subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) is the estimate (on dataset 𝒟𝒟\mathcal{D}caligraphic_D) of Pr(s,)γ(𝒫s)subscriptPrsimilar-to𝑠𝛾𝒫𝑠\Pr_{(s,\ell)\sim\gamma}\left(\mathcal{P}\in s\right)roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( caligraphic_P ∈ italic_s ), and that the unusualness 𝗎𝒫(𝒟)subscript𝗎𝒫𝒟\mathsf{u}_{\mathcal{P}}(\mathcal{D})sansserif_u start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) is the estimate (on dataset 𝒟𝒟\mathcal{D}caligraphic_D) of Pr(s,)γ(=1|𝒫s)Pr(s,)γ(=1)subscriptPrsimilar-to𝑠𝛾conditional1𝒫𝑠subscriptPrsimilar-to𝑠𝛾1\Pr_{(s,\ell)\sim\gamma}\left(\ell=1|\mathcal{P}\in s\right)-\Pr_{(s,\ell)\sim% \gamma}\left(\ell=1\right)roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( roman_ℓ = 1 | caligraphic_P ∈ italic_s ) - roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( roman_ℓ = 1 ), where Pr(s,)γ(=1|𝒫s)subscriptPrsimilar-to𝑠𝛾conditional1𝒫𝑠\Pr_{(s,\ell)\sim\gamma}\left(\ell=1|\mathcal{P}\in s\right)roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( roman_ℓ = 1 | caligraphic_P ∈ italic_s ) is estimated by μ(𝖢𝒫(𝒟))𝜇subscript𝖢𝒫𝒟\mu(\mathsf{C}_{\mathcal{P}}(\mathcal{D}))italic_μ ( sansserif_C start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ) and Pr(s,)γ(=1)subscriptPrsimilar-to𝑠𝛾1\Pr_{(s,\ell)\sim\gamma}\left(\ell=1\right)roman_Pr start_POSTSUBSCRIPT ( italic_s , roman_ℓ ) ∼ italic_γ end_POSTSUBSCRIPT ( roman_ℓ = 1 ) is estimated by μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ). This shows that the 1111-quality 𝗊¯𝒫,1(𝒟)=𝖿𝒫(𝒟)𝗎𝒫(𝒟)subscriptsuperscript¯𝗊𝒫1𝒟subscript𝖿𝒫𝒟subscript𝗎𝒫𝒟{\bar{\mathsf{q}}^{*}_{\mathcal{P},1}(\mathcal{D})}=\mathsf{f}_{\mathcal{P}}(% \mathcal{D})\mathsf{u}_{\mathcal{P}}(\mathcal{D})over¯ start_ARG sansserif_q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_P , 1 end_POSTSUBSCRIPT ( caligraphic_D ) = sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) sansserif_u start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) corresponds to an estimate, obtained from 𝒟𝒟\mathcal{D}caligraphic_D, of the quality 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT of subgroup 𝒫𝒫\mathcal{P}caligraphic_P. With this relation in mind, we can consider mining subgroups with high 1111-quality 𝗊¯𝒫,1(𝒟)subscriptsuperscript¯𝗊𝒫1𝒟{\bar{\mathsf{q}}^{*}_{\mathcal{P},1}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_P , 1 end_POSTSUBSCRIPT ( caligraphic_D ) as an heuristic for finding significant subgroups, which ignores the random fluctuations of the estimates 𝖿𝒫(𝒟)subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) and 𝗎𝒫(𝒟)subscript𝗎𝒫𝒟\mathsf{u}_{\mathcal{P}}(\mathcal{D})sansserif_u start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ).

A.2. Comparison with Permutation Approaches

In this section we provide a more detailed comparison between our few-shot approach and commonly used permutation approaches.

Our few-shot approach leverages our analytical results (e.g., Theorem 6) to obtain high probability bounds on the maximum deviation of patterns’ quality by estimating only the expectation of maximum deviation of patterns’ quality. Estimating such expectation requires a small number c𝑐citalic_c of resampled datasets, such as c=10𝑐10c=10italic_c = 10 that we used in our experimental evaluation.

The same approach cannot be used by permutation approaches (e.g., (Llinares-López et al., 2015; Pellegrina and Vandin, 2020; Terada et al., 2015)), since they are estimating the δ𝛿\deltaitalic_δ-quantile of the distribution of the maximum deviation, that is, the value q𝑞qitalic_q for which the maximum deviation is below q𝑞qitalic_q with probability δ𝛿\deltaitalic_δ, for a (relatively) small value of δ𝛿\deltaitalic_δ. Accurately estimating such quantile requires many more permutations than estimating the expectation (as done by our approach). For example, if only 10101010 permutations (e.g., corresponding to the value c𝑐citalic_c used in our experiments) are used to estimate the δ𝛿\deltaitalic_δ-quantile, with δ=0.05𝛿0.05\delta=0.05italic_δ = 0.05, the WY procedure returns the maximum deviation over the 10101010 permutations (i.e., the element in position δ10=1𝛿101\lceil\delta\cdot 10\rceil=1⌈ italic_δ ⋅ 10 ⌉ = 1 in the list of deviations, sorted in decreasing order). This implies that, with probability >12absent12>\frac{1}{2}> divide start_ARG 1 end_ARG start_ARG 2 end_ARG, the FWER will not be controlled at level δ𝛿\deltaitalic_δ (since the probability that the deviation of one permutation will be above the δ𝛿\deltaitalic_δ-quantile is 0.950.950.950.95, and the probability that all deviations are above the δ𝛿\deltaitalic_δ-quantile is 0.95100.599>12superscript0.95100.599120.95^{10}\approx 0.599>\frac{1}{2}0.95 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT ≈ 0.599 > divide start_ARG 1 end_ARG start_ARG 2 end_ARG). Moreover, with probability >13absent13>\frac{1}{3}> divide start_ARG 1 end_ARG start_ARG 3 end_ARG, the FWER will not be controlled even at level 2δ=0.12𝛿0.12\delta=0.12 italic_δ = 0.1 (since 0.9100.3487>13superscript0.9100.3487130.9^{10}\approx 0.3487>\frac{1}{3}0.9 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT ≈ 0.3487 > divide start_ARG 1 end_ARG start_ARG 3 end_ARG). For such a reason, previous works suggest to use at least 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT permutations for permutation approaches (as we do in our experiments), while 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT is the suggested number of permutations to have a stable FWER estimation ((Terada et al., 2015; Llinares-López et al., 2015; Pellegrina and Vandin, 2020) all use 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT).

A.3. Proofs of Section 4.1

This section presents additional proofs for the results of Section 4.1.

First, we need the following technical result.

Theorem 1 (McDiarmid’s inequality (McDiarmid, 1989)).

Let 𝒴𝒴\mathcal{Y}caligraphic_Y be a domain, and let g:𝒴m:𝑔superscript𝒴𝑚g:\mathcal{Y}^{m}\rightarrow\mathbb{R}italic_g : caligraphic_Y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R be a function such that, for each i𝑖iitalic_i, 1im1𝑖𝑚1\leq i\leq m1 ≤ italic_i ≤ italic_m, there is a nonnegative constant cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that:

supx1,,xm𝒴mxi𝒴|g(x1,,xm)g(x1,,xi1,xi,xi+1,,xm)|ci.subscriptsupremumsubscript𝑥1subscript𝑥𝑚superscript𝒴𝑚superscriptsubscript𝑥𝑖𝒴𝑔subscript𝑥1subscript𝑥𝑚𝑔subscript𝑥1subscript𝑥𝑖1subscriptsuperscript𝑥𝑖subscript𝑥𝑖1subscript𝑥𝑚subscript𝑐𝑖\sup_{\begin{subarray}{c}\langle x_{1},\dotsc,x_{m}\rangle\in\mathcal{Y}^{m}\\ x_{i}^{\prime}\in\mathcal{Y}\end{subarray}}\lvert g(x_{1},\dotsc,x_{m})-g(x_{1% },\dotsc,x_{i-1},x^{\prime}_{i},x_{i+1},\dotsc,x_{m})\rvert\leq c_{i}.roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL ⟨ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_Y end_CELL end_ROW end_ARG end_POSTSUBSCRIPT | italic_g ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) - italic_g ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) | ≤ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Let x1,,xmsubscript𝑥1subscript𝑥𝑚x_{1},\dotsc,x_{m}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT be m𝑚mitalic_m independent random variables such that x1,,xm𝒴msubscript𝑥1subscript𝑥𝑚superscript𝒴𝑚\langle x_{1},\dotsc,x_{m}\rangle\in\mathcal{Y}^{m}⟨ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. Then, for C=i=1mci2𝐶superscriptsubscript𝑖1𝑚superscriptsubscript𝑐𝑖2C=\sum_{i=1}^{m}c_{i}^{2}italic_C = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, it holds

Pr(𝔼[g]>g(x1,,xm)+t)e2t2/C.Pr𝔼delimited-[]𝑔𝑔subscript𝑥1subscript𝑥𝑚𝑡superscript𝑒2superscript𝑡2𝐶\Pr\Bigl{(}\mathop{\mathbb{E}}[g]>g(x_{1},\dotsc,x_{m})+t\Bigr{)}\leq e^{-2t^{% 2}/C}.roman_Pr ( blackboard_E [ italic_g ] > italic_g ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) + italic_t ) ≤ italic_e start_POSTSUPERSCRIPT - 2 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_C end_POSTSUPERSCRIPT .
Proof of Theorem 6.

First, we note that in the conditional setting it holds μ^=μˇ=μ¯^𝜇ˇ𝜇¯𝜇\hat{\mu}=\check{\mu}=\bar{\mu}over^ start_ARG italic_μ end_ARG = overroman_ˇ start_ARG italic_μ end_ARG = over¯ start_ARG italic_μ end_ARG. Then, from the fact superscript\mathcal{L}^{\star}\subseteq\mathcal{L}caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⊆ caligraphic_L, we have

𝔼𝐯I(μ¯)[sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)]subscript𝔼similar-to𝐯𝐼¯𝜇delimited-[]subscriptsupremum𝒫superscript1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖¯𝜇\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\bar{\mu})}\left[\sup_{% \mathcal{P}\in\mathcal{L}^{\star}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{% i})(\mathbf{v}_{i}-\bar{\mu})\right]blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( over¯ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ]
𝔼𝐯I(μ¯)[sup𝒫1mi=1mf𝒫(si)(𝐯iμ¯)]=𝔼[d~(,μˇ)],\displaystyle\leq\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\bar{\mu})}\left[\sup_{% \mathcal{P}\in\mathcal{L}}\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(% \mathbf{v}_{i}-\bar{\mu})\right]=\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}% \bigr{[}\tilde{d}(\mathcal{R}^{\star},\check{\mu})\bigl{]},≤ blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( over¯ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_μ end_ARG ) ] = blackboard_E start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) ] ,

therefore we focus on the concentration of d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) around its expectation taken w.r.t. superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. Our proof is based on McDiarmid’s inequality (Theorem 1). Define the function g()=d~(,μˇ)𝑔superscript~𝑑superscriptˇ𝜇g(\mathcal{R}^{\star})=\tilde{d}(\mathcal{R}^{\star},\check{\mu})italic_g ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ), and note that modifying any ξi,jsubscript𝜉𝑖𝑗\xi_{i,j}italic_ξ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, for any pair i,j𝑖𝑗i,jitalic_i , italic_j, changes g()𝑔superscriptg(\mathcal{R}^{\star})italic_g ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) by at most 1/(cm)1𝑐𝑚1/(cm)1 / ( italic_c italic_m ). Therefore, defining C=ij(1/(cm))2=1/(cm)𝐶subscript𝑖subscript𝑗superscript1𝑐𝑚21𝑐𝑚C=\sum_{i}\sum_{j}(1/(cm))^{2}=1/(cm)italic_C = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 / ( italic_c italic_m ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 / ( italic_c italic_m ), from Theorem 1 it holds

Pr(𝔼[d~(,μˇ)]>d~(,μˇ)+log(4δ)2cm)δ/4.\displaystyle\Pr\Biggl{(}\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}\bigr{[}% \tilde{d}(\mathcal{R}^{\star},\check{\mu})\bigl{]}>\tilde{d}(\mathcal{R}^{% \star},\check{\mu})+\sqrt{\frac{\log\bigl{(}\frac{4}{\delta}\bigr{)}}{2cm}}% \Biggr{)}\leq\delta/4.roman_Pr ( blackboard_E start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) ] > over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) + square-root start_ARG divide start_ARG roman_log ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_c italic_m end_ARG end_ARG ) ≤ italic_δ / 4 .

A.4. Proofs of Section 4.2

This Section presents the proofs for the results of Section 4.2.

Proof of Lemma 9.

Since μ(𝒟)𝜇𝒟\mu(\mathcal{D})italic_μ ( caligraphic_D ) is the average of m𝑚mitalic_m independent and bounded random variables, Hoeffding’s and Bernstein’s inequalities (Boucheron et al., 2013) yield, respectively, that

|μμ(𝒟)|𝜇𝜇𝒟\displaystyle|\mu-\mu(\mathcal{D})|| italic_μ - italic_μ ( caligraphic_D ) | ln(8δ)2m,absent8𝛿2𝑚\displaystyle\leq\sqrt{\frac{\ln\left(\frac{8}{\delta}\right)}{2m}},≤ square-root start_ARG divide start_ARG roman_ln ( divide start_ARG 8 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_m end_ARG end_ARG ,
|μμ(𝒟)|𝜇𝜇𝒟\displaystyle|\mu-\mu(\mathcal{D})|| italic_μ - italic_μ ( caligraphic_D ) | 2μ(𝒟)ln(8δ)m+2ln(8δ)m,absent2𝜇𝒟8𝛿𝑚28𝛿𝑚\displaystyle\leq\sqrt{\frac{2\mu(\mathcal{D})\ln\left(\frac{8}{\delta}\right)% }{m}}+\frac{2\ln\left(\frac{8}{\delta}\right)}{m},≤ square-root start_ARG divide start_ARG 2 italic_μ ( caligraphic_D ) roman_ln ( divide start_ARG 8 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG + divide start_ARG 2 roman_ln ( divide start_ARG 8 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG ,

hold simultaneously with probability 1δ/4absent1𝛿4\geq 1-\delta/4≥ 1 - italic_δ / 4. The statement follows from the observation that their minimum is εTabsentsubscript𝜀𝑇\leq\varepsilon_{T}≤ italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. ∎

Proof of Theorem 11.

Recall that ={𝒟1,,𝒟c}superscriptsubscriptsuperscript𝒟1subscriptsuperscript𝒟𝑐\mathcal{R}^{\star}=\left\{\mathcal{D}^{\star}_{1},\dots,\mathcal{D}^{\star}_{% c}\right\}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = { caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } is a collection of c1𝑐1c\geq 1italic_c ≥ 1 i.i.d. resampled datasets, each obtained by resampling the target labels of 𝒟𝒟\mathcal{D}caligraphic_D while maintaining the same features of 𝒟𝒟\mathcal{D}caligraphic_D. That is, each resampled dataset is 𝒟j={(s1,ξ1,j),(sm,ξm,j)}subscriptsuperscript𝒟𝑗subscript𝑠1subscript𝜉1𝑗subscript𝑠𝑚subscript𝜉𝑚𝑗\mathcal{D}^{\star}_{j}=\left\{(s_{1},\xi_{1,j}),\dots(s_{m},\xi_{m,j})\right\}caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT 1 , italic_j end_POSTSUBSCRIPT ) , … ( italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_m , italic_j end_POSTSUBSCRIPT ) }, where

ξi,jBern(p),i[1,m],j[1,c],formulae-sequencesimilar-tosubscript𝜉𝑖𝑗𝐵𝑒𝑟𝑛𝑝formulae-sequencefor-all𝑖1𝑚for-all𝑗1𝑐\displaystyle\xi_{i,j}\sim Bern(p),\forall i\in[1,m],\forall j\in[1,c],italic_ξ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∼ italic_B italic_e italic_r italic_n ( italic_p ) , ∀ italic_i ∈ [ 1 , italic_m ] , ∀ italic_j ∈ [ 1 , italic_c ] ,

and Bern(p)𝐵𝑒𝑟𝑛𝑝Bern(p)italic_B italic_e italic_r italic_n ( italic_p ) is the Bernoulli distribution with parameter p𝑝pitalic_p. In the proof we set p=μ=𝔼𝒟[μ(𝒟)]𝑝𝜇subscript𝔼𝒟delimited-[]𝜇𝒟p=\mu=\mathop{\mathbb{E}}_{\mathcal{D}}[\mu(\mathcal{D})]italic_p = italic_μ = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_μ ( caligraphic_D ) ]. Then, we define the collection of resampled datasets ^={𝒟^1,,𝒟^c}superscript^subscriptsuperscript^𝒟1subscriptsuperscript^𝒟𝑐\hat{\mathcal{R}}^{\star}=\{\hat{\mathcal{D}}^{\star}_{1},\dots,\hat{\mathcal{% D}}^{\star}_{c}\}over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = { over^ start_ARG caligraphic_D end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG caligraphic_D end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } similarly to superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, where the parameter μ𝜇\muitalic_μ is replaced by its upper bound μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG (note that the roles of superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and ^superscript^\hat{\mathcal{R}}^{\star}over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT are swapped in the statement and in Alg. 1).

Let Y=𝔼𝒟[sup𝒫{𝗊𝒫(𝒟)}]𝑌subscript𝔼𝒟delimited-[]subscriptsupremum𝒫superscriptsubscript𝗊𝒫𝒟Y=\mathop{\mathbb{E}}_{\mathcal{D}}\bigl{[}\sup_{\mathcal{P}\in\mathcal{L}^{% \star}}\{\mathsf{q}_{\mathcal{P}}(\mathcal{D})\}\bigr{]}italic_Y = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) } ], and observe that it holds sup𝒫Var𝒟(𝗊𝒫(𝒟))νsubscriptsupremum𝒫superscript𝑉𝑎subscript𝑟𝒟subscript𝗊𝒫𝒟𝜈\sup_{\mathcal{P}\in\mathcal{L}^{\star}}Var_{\mathcal{D}}(\mathsf{q}_{\mathcal% {P}}(\mathcal{D}))\leq\nuroman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_V italic_a italic_r start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ) ≤ italic_ν. We apply Bousquet’s inequality (Theorem 12.5 of (Boucheron et al., 2013)) to prove that

sup𝒫{𝗊𝒫(𝒟)}Y+2ln(4δ)(ν+2Y)m+ln(4δ)3m,subscriptsupremum𝒫superscriptsubscript𝗊𝒫𝒟𝑌24𝛿𝜈2𝑌𝑚4𝛿3𝑚\displaystyle\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\{\mathsf{q}_{\mathcal{P}% }(\mathcal{D})\}\leq Y+\sqrt{\frac{2\ln\bigl{(}\frac{4}{\delta}\bigr{)}\left(% \nu+2Y\right)}{m}}+\frac{\ln\bigl{(}\frac{4}{\delta}\bigr{)}}{3m},roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) } ≤ italic_Y + square-root start_ARG divide start_ARG 2 roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) ( italic_ν + 2 italic_Y ) end_ARG start_ARG italic_m end_ARG end_ARG + divide start_ARG roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 3 italic_m end_ARG ,

with probability 1δ/4absent1𝛿4\geq 1-\delta/4≥ 1 - italic_δ / 4. We now show how to upper bound Y𝑌Yitalic_Y. First, note that

𝔼𝒟[sup𝒫{𝗊𝒫(𝒟)}]subscript𝔼𝒟delimited-[]subscriptsupremum𝒫superscriptsubscript𝗊𝒫𝒟\displaystyle\mathop{\mathbb{E}}_{\mathcal{D}}\bigl{[}\sup_{\mathcal{P}\in% \mathcal{L}^{\star}}\{\mathsf{q}_{\mathcal{P}}(\mathcal{D})\}\bigr{]}blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) } ] =𝔼𝒟j[sup𝒫{𝗊𝒫(𝒟j)}]absentsubscript𝔼subscriptsuperscript𝒟𝑗delimited-[]subscriptsupremum𝒫superscriptsubscript𝗊𝒫subscriptsuperscript𝒟𝑗\displaystyle=\mathop{\mathbb{E}}_{\mathcal{D}^{\star}_{j}}\bigl{[}\sup_{% \mathcal{P}\in\mathcal{L}^{\star}}\{\mathsf{q}_{\mathcal{P}}(\mathcal{D}^{% \star}_{j})\}\bigr{]}= blackboard_E start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } ]
=𝔼[1cj=1csup𝒫{𝗊𝒫(𝒟j)}]absentsubscript𝔼superscriptdelimited-[]1𝑐superscriptsubscript𝑗1𝑐subscriptsupremum𝒫superscriptsubscript𝗊𝒫subscriptsuperscript𝒟𝑗\displaystyle=\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}\biggl{[}\frac{1}{c}% \sum_{j=1}^{c}\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\{\mathsf{q}_{\mathcal{P% }}(\mathcal{D}^{\star}_{j})\}\biggr{]}= blackboard_E start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_c end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } ]

by definition of superscript\mathcal{L}^{\star}caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. Then, it follows that

𝔼[1cj=1csup𝒫{𝗊𝒫(𝒟j)}]𝔼[1cj=1csup𝒫{𝗊𝒫(𝒟j)}],subscript𝔼superscriptdelimited-[]1𝑐superscriptsubscript𝑗1𝑐subscriptsupremum𝒫superscriptsubscript𝗊𝒫subscriptsuperscript𝒟𝑗subscript𝔼superscriptdelimited-[]1𝑐superscriptsubscript𝑗1𝑐subscriptsupremum𝒫subscript𝗊𝒫subscriptsuperscript𝒟𝑗\displaystyle\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}\biggl{[}\frac{1}{c}\sum% _{j=1}^{c}\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\{\mathsf{q}_{\mathcal{P}}(% \mathcal{D}^{\star}_{j})\}\biggr{]}\leq\mathop{\mathbb{E}}_{\mathcal{R}^{\star% }}\biggl{[}\frac{1}{c}\sum_{j=1}^{c}\sup_{\mathcal{P}\in\mathcal{L}}\{\mathsf{% q}_{\mathcal{P}}(\mathcal{D}^{\star}_{j})\}\biggr{]},blackboard_E start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_c end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } ] ≤ blackboard_E start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_c end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } ] ,

since superscript\mathcal{L}^{\star}\subseteq\mathcal{L}caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⊆ caligraphic_L. We now prove bounds to the concentration of 𝔼[1cj=1csup𝒫{𝗊𝒫(𝒟j)}]subscript𝔼superscriptdelimited-[]1𝑐superscriptsubscript𝑗1𝑐subscriptsupremum𝒫subscript𝗊𝒫subscriptsuperscript𝒟𝑗\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}\bigl{[}\frac{1}{c}\sum_{j=1}^{c}\sup% _{\mathcal{P}\in\mathcal{L}}\{\mathsf{q}_{\mathcal{P}}(\mathcal{D}^{\star}_{j}% )\}\bigr{]}blackboard_E start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_c end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } ] w.r.t. the set of features 𝒜𝒜\mathcal{A}caligraphic_A. Let 𝒟=(𝒜,𝒯)superscript𝒟𝒜superscript𝒯\mathcal{D}^{\star}=(\mathcal{A},\mathcal{T}^{\star})caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = ( caligraphic_A , caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) be a generic 𝒟superscript𝒟superscript\mathcal{D}^{\star}\in\mathcal{R}^{\star}caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, with 𝒯={ξ1,,ξm}superscript𝒯subscript𝜉1subscript𝜉𝑚\mathcal{T}^{\star}=\{\xi_{1},\dots,\xi_{m}\}caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = { italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ξ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }. Define the random variable Z𝑍Zitalic_Z as

Z=𝔼𝒯[sup𝒫{𝗊𝒫(𝒟)}],𝑍subscript𝔼superscript𝒯delimited-[]subscriptsupremum𝒫subscript𝗊𝒫superscript𝒟\displaystyle Z=\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\bigl{[}\sup_{% \mathcal{P}\in\mathcal{L}}\{\mathsf{q}_{\mathcal{P}}(\mathcal{D}^{\star})\}% \bigr{]},italic_Z = blackboard_E start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) } ] ,

where the expectation is w.r.t. 𝒯superscript𝒯\mathcal{T}^{\star}caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, conditioning on 𝒜𝒜\mathcal{A}caligraphic_A. To show the concentration of Z𝑍Zitalic_Z w.r.t. its expectation 𝔼𝒜[Z]subscript𝔼𝒜delimited-[]𝑍\mathop{\mathbb{E}}_{\mathcal{A}}[Z]blackboard_E start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT [ italic_Z ], we prove that Z𝑍Zitalic_Z is a self-bounding function (Boucheron et al., 2013). Define the random variable Zjsubscript𝑍𝑗Z_{j}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, for j[1,m]𝑗1𝑚j\in[1,m]italic_j ∈ [ 1 , italic_m ], as

Zj=𝔼𝒯[sup𝒫{1mi=1,ijmf𝒫(si)(ξiμ)}]subscript𝑍𝑗subscript𝔼superscript𝒯delimited-[]subscriptsupremum𝒫1𝑚superscriptsubscriptformulae-sequence𝑖1𝑖𝑗𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝜉𝑖𝜇\displaystyle Z_{j}=\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\biggl{[}\sup_{% \mathcal{P}\in\mathcal{L}}\biggl{\{}\frac{1}{m}\sum_{i=1,i\neq j}^{m}f_{% \mathcal{P}}(s_{i})(\xi_{i}-\mu)\biggr{\}}\biggr{]}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 , italic_i ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ ) } ]

First, note that Z0𝑍0Z\geq 0italic_Z ≥ 0:

Z=𝔼𝒯[sup𝒫{𝗊𝒫(𝒟)}]sup𝒫{𝔼𝒯[𝗊𝒫(𝒟)]}0.𝑍subscript𝔼superscript𝒯delimited-[]subscriptsupremum𝒫subscript𝗊𝒫superscript𝒟subscriptsupremum𝒫subscript𝔼superscript𝒯delimited-[]subscript𝗊𝒫superscript𝒟0\displaystyle Z=\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\bigl{[}\sup_{% \mathcal{P}\in\mathcal{L}}\{\mathsf{q}_{\mathcal{P}}(\mathcal{D}^{\star})\}% \bigr{]}\geq\sup_{\mathcal{P}\in\mathcal{L}}\bigl{\{}\mathop{\mathbb{E}}_{% \mathcal{T}^{\star}}\bigl{[}\mathsf{q}_{\mathcal{P}}(\mathcal{D}^{\star})\bigr% {]}\bigr{\}}\geq 0.italic_Z = blackboard_E start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) } ] ≥ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { blackboard_E start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] } ≥ 0 .

Then, we prove that ZZj0𝑍subscript𝑍𝑗0Z-Z_{j}\geq 0italic_Z - italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ 0. Define f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG as one of the functions that attain the supremum for Zjsubscript𝑍𝑗Z_{j}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, for any choice of 𝒯superscript𝒯\mathcal{T}^{\star}caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. We have

ZjZsubscript𝑍𝑗𝑍\displaystyle Z_{j}-Zitalic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_Z Zj𝔼𝒯[1mi=1mf^(si)(ξiμ)]absentsubscript𝑍𝑗subscript𝔼superscript𝒯delimited-[]1𝑚superscriptsubscript𝑖1𝑚^𝑓subscript𝑠𝑖subscript𝜉𝑖𝜇\displaystyle\leq Z_{j}-\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\biggl{[}% \frac{1}{m}\sum_{i=1}^{m}\hat{f}(s_{i})(\xi_{i}-\mu)\biggr{]}≤ italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - blackboard_E start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over^ start_ARG italic_f end_ARG ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ ) ]
=𝔼ξj[1mf^(sj)(ξjμ)]=0.absentsubscript𝔼subscript𝜉𝑗delimited-[]1𝑚^𝑓subscript𝑠𝑗subscript𝜉𝑗𝜇0\displaystyle=\mathop{\mathbb{E}}_{\xi_{j}}\biggl{[}-\frac{1}{m}\hat{f}(s_{j})% (\xi_{j}-\mu)\biggr{]}=0.= blackboard_E start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG over^ start_ARG italic_f end_ARG ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ ) ] = 0 .

We now show that ZZjμ(1μ)/m𝑍subscript𝑍𝑗𝜇1𝜇𝑚Z-Z_{j}\leq\mu(1-\mu)/mitalic_Z - italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_μ ( 1 - italic_μ ) / italic_m. We have

Z𝑍\displaystyle Zitalic_Z 𝔼𝒯[sup𝒫{1mi=1,ijf𝒫(si)(ξiμ)}\displaystyle\leq\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\biggl{[}\sup_{% \mathcal{P}\in\mathcal{L}}\biggl{\{}\frac{1}{m}\sum_{i=1,i\neq j}f_{\mathcal{P% }}(s_{i})(\xi_{i}-\mu)\biggr{\}}≤ blackboard_E start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 , italic_i ≠ italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ ) }
+sup𝒫{1mf𝒫(sj)(ξjμ)}]\displaystyle\;\;\;\;\;\;\;+\sup_{\mathcal{P}\in\mathcal{L}}\left\{\frac{1}{m}% f_{\mathcal{P}}(s_{j})(\xi_{j}-\mu)\right\}\biggr{]}+ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ( italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ ) } ]
Zj+μ(1μ)m.absentsubscript𝑍𝑗𝜇1𝜇𝑚\displaystyle\leq Z_{j}+\frac{\mu(1-\mu)}{m}.≤ italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + divide start_ARG italic_μ ( 1 - italic_μ ) end_ARG start_ARG italic_m end_ARG .

We now prove that j=1mZZjZsuperscriptsubscript𝑗1𝑚𝑍subscript𝑍𝑗𝑍\sum_{j=1}^{m}Z-Z_{j}\leq Z∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_Z - italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_Z. It holds

j=1mZjsuperscriptsubscript𝑗1𝑚subscript𝑍𝑗\displaystyle\sum_{j=1}^{m}Z_{j}∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT 𝔼𝒯[sup𝒫{1mj=1mi=1,ijmf𝒫(si)(ξiμ)}]absentsubscript𝔼superscript𝒯delimited-[]subscriptsupremum𝒫1𝑚superscriptsubscript𝑗1𝑚superscriptsubscriptformulae-sequence𝑖1𝑖𝑗𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝜉𝑖𝜇\displaystyle\geq\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\biggl{[}\sup_{% \mathcal{P}\in\mathcal{L}}\biggl{\{}\frac{1}{m}\sum_{j=1}^{m}\sum_{i=1,i\neq j% }^{m}f_{\mathcal{P}}(s_{i})(\xi_{i}-\mu)\biggr{\}}\biggr{]}≥ blackboard_E start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 , italic_i ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ ) } ]
=𝔼𝒯[sup𝒫{m1mi=1mf𝒫(si)(ξiμ)}]absentsubscript𝔼superscript𝒯delimited-[]subscriptsupremum𝒫𝑚1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝜉𝑖𝜇\displaystyle=\mathop{\mathbb{E}}_{\mathcal{T}^{\star}}\biggl{[}\sup_{\mathcal% {P}\in\mathcal{L}}\biggl{\{}\frac{m-1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(% \xi_{i}-\mu)\biggr{\}}\biggr{]}= blackboard_E start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { divide start_ARG italic_m - 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ ) } ]
=(m1)Z.absent𝑚1𝑍\displaystyle=(m-1)Z.= ( italic_m - 1 ) italic_Z .

As Z𝑍Zitalic_Z is a self-bounding function, we have that

Pr(𝔼[Z]Zq)exp(mq22μ(1μ)𝔼[Z]).Pr𝔼delimited-[]𝑍𝑍𝑞𝑚superscript𝑞22𝜇1𝜇𝔼delimited-[]𝑍\displaystyle\Pr\left(\mathop{\mathbb{E}}[Z]-Z\geq q\right)\leq\exp\left(\frac% {-mq^{2}}{2\mu(1-\mu)\mathop{\mathbb{E}}[Z]}\right).roman_Pr ( blackboard_E [ italic_Z ] - italic_Z ≥ italic_q ) ≤ roman_exp ( divide start_ARG - italic_m italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_μ ( 1 - italic_μ ) blackboard_E [ italic_Z ] end_ARG ) .

Imposing the r.h.s. δ/4absent𝛿4\leq\delta/4≤ italic_δ / 4, solving for q𝑞qitalic_q, and finding the fixed point of the inequality we obtain d^^𝑑\hat{d}over^ start_ARG italic_d end_ARG. Let d~()=d~(,μ(𝒟))~𝑑superscript~𝑑superscript𝜇𝒟\tilde{d}(\mathcal{R}^{\star})=\tilde{d}(\mathcal{R}^{\star},\mu(\mathcal{D}))over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_μ ( caligraphic_D ) ). We apply McDiarmid inequality (Theorem 1) to show that 𝔼[d~()|𝒜]r^subscript𝔼superscriptdelimited-[]conditional~𝑑superscript𝒜^𝑟\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}[\tilde{d}(\mathcal{R}^{\star})\>|\>% \mathcal{A}]\leq\hat{r}blackboard_E start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) | caligraphic_A ] ≤ over^ start_ARG italic_r end_ARG with probability 1δ/4absent1𝛿4\geq 1-\delta/4≥ 1 - italic_δ / 4. In fact, define the function g()=d~()𝑔superscript~𝑑superscriptg(\mathcal{R}^{\star})=\tilde{d}(\mathcal{R}^{\star})italic_g ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) = over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ), and note that modifying any ξi,jsubscript𝜉𝑖𝑗\xi_{i,j}italic_ξ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, for any pair i,j𝑖𝑗i,jitalic_i , italic_j, changes g()𝑔superscriptg(\mathcal{R}^{\star})italic_g ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) by at most 1/(cm)1𝑐𝑚1/(cm)1 / ( italic_c italic_m ). Therefore, defining C=ij(1/(cm))2=1/(cm)𝐶subscript𝑖subscript𝑗superscript1𝑐𝑚21𝑐𝑚C=\sum_{i}\sum_{j}(1/(cm))^{2}=1/(cm)italic_C = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 / ( italic_c italic_m ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 / ( italic_c italic_m ), the upper bound r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG to 𝔼[d~()|𝒜]subscript𝔼superscriptdelimited-[]conditional~𝑑superscript𝒜\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}[\tilde{d}(\mathcal{R}^{\star})\>|\>% \mathcal{A}]blackboard_E start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) | caligraphic_A ] holds by Theorem 1. We now need to prove that the upper bound d^^𝑑\hat{d}over^ start_ARG italic_d end_ARG computed using ^superscript^\hat{\mathcal{R}}^{\star}over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is (probabilistically) not smaller than using superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. This is equivalent to show that it holds, for all x𝑥xitalic_x,

Pr^(d~(^)>x)Pr(d~()>x).subscriptPrsuperscript^~𝑑superscript^𝑥subscriptPrsuperscript~𝑑superscript𝑥\displaystyle\Pr_{\hat{\mathcal{R}}^{\star}}\left(\tilde{d}(\hat{\mathcal{R}}^% {\star})>x\right)\geq\Pr_{\mathcal{R}^{\star}}\left(\tilde{d}(\mathcal{R}^{% \star})>x\right).roman_Pr start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG italic_d end_ARG ( over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) > italic_x ) ≥ roman_Pr start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) > italic_x ) .

Equivalently, the probability of underestimating 𝔼[d~()]subscript𝔼superscriptdelimited-[]~𝑑superscript\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}[\tilde{d}(\mathcal{R}^{\star})]blackboard_E start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ] using d~()~𝑑superscript\tilde{d}(\mathcal{R}^{\star})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) does not increase when using d~(^)~𝑑superscript^\tilde{d}(\hat{\mathcal{R}}^{\star})over~ start_ARG italic_d end_ARG ( over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ). Since the two probabilities are taken w.r.t. to two different sample spaces, it is not possible to compare them directly. Therefore, we build an appropriate coupling between the two distributions. Define an m×c𝑚𝑐m\times citalic_m × italic_c matrix v𝑣vitalic_v of mc𝑚𝑐mcitalic_m italic_c i.i.d. Bernoulli random variables, such that Pr(vi,j=1)=εT/(1μ)Prsubscript𝑣𝑖𝑗1subscript𝜀𝑇1𝜇\Pr(v_{i,j}=1)=\varepsilon_{T}/(1-\mu)roman_Pr ( italic_v start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 ) = italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / ( 1 - italic_μ ) for all i,j𝑖𝑗i,jitalic_i , italic_j. (Note that we assume 0<μ<10𝜇10<\mu<10 < italic_μ < 1 and 0μˇμμ^10ˇ𝜇𝜇^𝜇10\leq\check{\mu}\leq\mu\leq\hat{\mu}\leq 10 ≤ overroman_ˇ start_ARG italic_μ end_ARG ≤ italic_μ ≤ over^ start_ARG italic_μ end_ARG ≤ 1, otherwise the statement holds trivially.) We observe that

Pr^(d~(^)>x)=Pr^(1cj=1csup𝒫{1mi=1mf𝒫(si)(ξ^i,jμ)}>x),subscriptPrsuperscript^~𝑑superscript^𝑥subscriptPrsuperscript^1𝑐superscriptsubscript𝑗1𝑐subscriptsupremum𝒫1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript^𝜉𝑖𝑗𝜇𝑥\displaystyle\Pr_{\hat{\mathcal{R}}^{\star}}\left(\tilde{d}(\hat{\mathcal{R}}^% {\star})>x\right)=\Pr_{\hat{\mathcal{R}}^{\star}}\left(\frac{1}{c}\sum_{j=1}^{% c}\sup_{\mathcal{P}\in\mathcal{L}}\left\{\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{% P}}(s_{i})\left(\hat{\xi}_{i,j}-\mu\right)\right\}>x\right),roman_Pr start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG italic_d end_ARG ( over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) > italic_x ) = roman_Pr start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_c end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_μ ) } > italic_x ) ,

where ξ^i,jsubscript^𝜉𝑖𝑗\hat{\xi}_{i,j}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT are i.i.d. Bernoulli with Pr(ξ^i,j=1)=μ^Prsubscript^𝜉𝑖𝑗1^𝜇\Pr(\hat{\xi}_{i,j}=1)=\hat{\mu}roman_Pr ( over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 ) = over^ start_ARG italic_μ end_ARG for all i,j𝑖𝑗i,jitalic_i , italic_j. We build the following coupling between the distributions of ^superscript^\hat{\mathcal{R}}^{\star}over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, using the fact that ξ^i,jmax{ξi,j,vi,j}similar-tosubscript^𝜉𝑖𝑗subscript𝜉𝑖𝑗subscript𝑣𝑖𝑗\hat{\xi}_{i,j}\sim\max\{\xi_{i,j},v_{i,j}\}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∼ roman_max { italic_ξ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT }. This allows us to obtain the lower bound stated above:

Pr^(1cj=1csup𝒫{1mi=1mf𝒫(si)(ξ^i,jμ)}>x)subscriptPrsuperscript^1𝑐superscriptsubscript𝑗1𝑐subscriptsupremum𝒫1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript^𝜉𝑖𝑗𝜇𝑥\displaystyle\Pr_{\hat{\mathcal{R}}^{\star}}\left(\frac{1}{c}\sum_{j=1}^{c}% \sup_{\mathcal{P}\in\mathcal{L}}\left\{\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}% }(s_{i})\left(\hat{\xi}_{i,j}-\mu\right)\right\}>x\right)roman_Pr start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_c end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_μ ) } > italic_x )
=Pr,v(1cj=1csup𝒫{1mi=1mf𝒫(si)(max{ξi,j,vi,j}μ)}>x)absentsubscriptPrsuperscript𝑣1𝑐superscriptsubscript𝑗1𝑐subscriptsupremum𝒫1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝜉𝑖𝑗subscript𝑣𝑖𝑗𝜇𝑥\displaystyle=\Pr_{\mathcal{R}^{\star},v}\left(\frac{1}{c}\sum_{j=1}^{c}\sup_{% \mathcal{P}\in\mathcal{L}}\left\{\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i% })\left(\max\{\xi_{i,j},v_{i,j}\}-\mu\right)\right\}>x\right)= roman_Pr start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_v end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_c end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( roman_max { italic_ξ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } - italic_μ ) } > italic_x )
Pr,v(1cj=1csup𝒫{1mi=1mf𝒫(si)(ξi,jμ)}>x)absentsubscriptPrsuperscript𝑣1𝑐superscriptsubscript𝑗1𝑐subscriptsupremum𝒫1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝜉𝑖𝑗𝜇𝑥\displaystyle\geq\Pr_{\mathcal{R}^{\star},v}\left(\frac{1}{c}\sum_{j=1}^{c}% \sup_{\mathcal{P}\in\mathcal{L}}\left\{\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}% }(s_{i})\left(\xi_{i,j}-\mu\right)\right\}>x\right)≥ roman_Pr start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , italic_v end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_c end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_ξ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_μ ) } > italic_x )
=Pr(d~()>x).absentsubscriptPrsuperscript~𝑑superscript𝑥\displaystyle=\Pr_{\mathcal{R}^{\star}}\left(\tilde{d}(\mathcal{R}^{\star})>x% \right).= roman_Pr start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) > italic_x ) .

Then, it is immediate to observe that

Pr^(d~(^,μˇ)>x)Pr^(d~(^)>x),subscriptPrsuperscript^~𝑑superscript^ˇ𝜇𝑥subscriptPrsuperscript^~𝑑superscript^𝑥\displaystyle\Pr_{\hat{\mathcal{R}}^{\star}}\left(\tilde{d}(\hat{\mathcal{R}}^% {\star},\check{\mu})>x\right)\geq\Pr_{\hat{\mathcal{R}}^{\star}}\left(\tilde{d% }(\hat{\mathcal{R}}^{\star})>x\right),roman_Pr start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG italic_d end_ARG ( over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) > italic_x ) ≥ roman_Pr start_POSTSUBSCRIPT over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG italic_d end_ARG ( over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) > italic_x ) ,

since μ𝜇\muitalic_μ is replaced by its lower bound μˇˇ𝜇\check{\mu}overroman_ˇ start_ARG italic_μ end_ARG in the definition of d~(^)~𝑑superscript^\tilde{d}(\hat{\mathcal{R}}^{\star})over~ start_ARG italic_d end_ARG ( over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ), as d~(^,μˇ)d~(^)~𝑑superscript^ˇ𝜇~𝑑superscript^\tilde{d}(\hat{\mathcal{R}}^{\star},\check{\mu})\geq\tilde{d}(\hat{\mathcal{R}% }^{\star})over~ start_ARG italic_d end_ARG ( over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) ≥ over~ start_ARG italic_d end_ARG ( over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) for all ^superscript^\hat{\mathcal{R}}^{\star}over^ start_ARG caligraphic_R end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. The statement follows observing that all other quantities in the definition of ε𝜀\varepsilonitalic_ε are constants independent of μ𝜇\muitalic_μ, and from an union bound over the 3333 concentration bounds considered in the proof, and the event |μμ(𝒟)|εT𝜇𝜇𝒟subscript𝜀𝑇\text{``}|\mu-\mu(\mathcal{D})|\leq\varepsilon_{T}\text{''}“ | italic_μ - italic_μ ( caligraphic_D ) | ≤ italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ”, each of them true with probability 1δ/4absent1𝛿4\geq 1-\delta/4≥ 1 - italic_δ / 4. ∎

A.5. Power Analysis

In this section we prove the results on the power of FSR stated in Sections 4.1 and 4.2.

We first provide a probabilistic upper bound to d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ), the estimate of the supremum deviation of false discoveries computed by Algorithm 1 (in line 1). This result can be applied to general languages; we then show how to apply it to the language of subgroups. We define N(𝒟)subscript𝑁𝒟N_{\mathcal{L}}(\mathcal{D})italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) as the number of distinct projections of the language \mathcal{L}caligraphic_L on the dataset 𝒟𝒟\mathcal{D}caligraphic_D:

N(𝒟)=|{{i:𝒫si},𝒫}|.subscript𝑁𝒟conditional-set𝑖𝒫subscript𝑠𝑖𝒫\displaystyle N_{\mathcal{L}}(\mathcal{D})=\lvert\left\{\{i:\mathcal{P}\in s_{% i}\},\mathcal{P}\in\mathcal{L}\right\}\rvert.italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) = | { { italic_i : caligraphic_P ∈ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , caligraphic_P ∈ caligraphic_L } | .

Note that, differently from |||\mathcal{L}|| caligraphic_L |, N(𝒟)subscript𝑁𝒟N_{\mathcal{L}}(\mathcal{D})italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) is always a finite value (a trivial upper bound is N(𝒟)2msubscript𝑁𝒟superscript2𝑚N_{\mathcal{L}}(\mathcal{D})\leq 2^{m}italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) ≤ 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT).

Theorem 2.

Let λ(0,1)𝜆01\lambda\in(0,1)italic_λ ∈ ( 0 , 1 ), and ω^=μ^(1μˇ)sup𝒫𝖿𝒫(𝒟)^𝜔^𝜇1ˇ𝜇subscriptsupremum𝒫subscript𝖿𝒫𝒟\hat{\omega}=\hat{\mu}(1-\check{\mu})\sup_{\mathcal{P}\in\mathcal{L}}\mathsf{f% }_{\mathcal{P}}(\mathcal{D})over^ start_ARG italic_ω end_ARG = over^ start_ARG italic_μ end_ARG ( 1 - overroman_ˇ start_ARG italic_μ end_ARG ) roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L end_POSTSUBSCRIPT sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ). The value of d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) computed by FSR in line 1 of Algorithm 1 is

d~(,μˇ)2ω^ln(N(𝒟))m+ln(1λ)2cm+ln(N(𝒟))3m~𝑑superscriptˇ𝜇2^𝜔subscript𝑁𝒟𝑚1𝜆2𝑐𝑚subscript𝑁𝒟3𝑚\displaystyle\tilde{d}(\mathcal{R}^{\star},\check{\mu})\leq\sqrt{\frac{2\hat{% \omega}\ln(N_{\mathcal{L}}(\mathcal{D}))}{m}}+\sqrt{\frac{\ln(\frac{1}{\lambda% })}{2cm}}+\frac{\ln(N_{\mathcal{L}}(\mathcal{D}))}{3m}over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) ≤ square-root start_ARG divide start_ARG 2 over^ start_ARG italic_ω end_ARG roman_ln ( italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) ) end_ARG start_ARG italic_m end_ARG end_ARG + square-root start_ARG divide start_ARG roman_ln ( divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG ) end_ARG start_ARG 2 italic_c italic_m end_ARG end_ARG + divide start_ARG roman_ln ( italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) ) end_ARG start_ARG 3 italic_m end_ARG

with probability at least 1λ1𝜆1-\lambda1 - italic_λ.

Proof.

To obtain the statement, we first prove an upper bound to the expectation 𝔼𝐯I(μ^)[d~(,μˇ)]subscript𝔼similar-to𝐯𝐼^𝜇delimited-[]~𝑑superscriptˇ𝜇\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\hat{\mu})}[\tilde{d}(\mathcal{R}^{\star% },\check{\mu})]blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( over^ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT [ over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) ], that is

𝔼𝐯I(μ^)[d~(,μˇ)]2ω^ln(N(𝒟))m+ln(N(𝒟))3m,subscript𝔼similar-to𝐯𝐼^𝜇delimited-[]~𝑑superscriptˇ𝜇2^𝜔subscript𝑁𝒟𝑚subscript𝑁𝒟3𝑚\displaystyle\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\hat{\mu})}[\tilde{d}(% \mathcal{R}^{\star},\check{\mu})]\leq\sqrt{\frac{2\hat{\omega}\ln(N_{\mathcal{% L}}(\mathcal{D}))}{m}}+\frac{\ln(N_{\mathcal{L}}(\mathcal{D}))}{3m},blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( over^ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT [ over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) ] ≤ square-root start_ARG divide start_ARG 2 over^ start_ARG italic_ω end_ARG roman_ln ( italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) ) end_ARG start_ARG italic_m end_ARG end_ARG + divide start_ARG roman_ln ( italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) ) end_ARG start_ARG 3 italic_m end_ARG ,

and then conclude with a concentration bound for d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) w.r.t. to its expectation 𝔼𝐯I(μ^)[d~(,μˇ)]subscript𝔼similar-to𝐯𝐼^𝜇delimited-[]~𝑑superscriptˇ𝜇\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\hat{\mu})}[\tilde{d}(\mathcal{R}^{\star% },\check{\mu})]blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( over^ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT [ over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) ], using analogous derivations of the proof of Theorem 6. For any 𝒫𝒫\mathcal{P}\in\mathcal{L}caligraphic_P ∈ caligraphic_L, define X𝒫=1mi=1mf𝒫(si)(𝐯iμˇ)subscript𝑋𝒫1𝑚superscriptsubscript𝑖1𝑚subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖ˇ𝜇X_{\mathcal{P}}=\frac{1}{m}\sum_{i=1}^{m}f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}% -\check{\mu})italic_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - overroman_ˇ start_ARG italic_μ end_ARG ). Let two patterns 𝒫1,𝒫2subscript𝒫1subscript𝒫2\mathcal{P}_{1},\mathcal{P}_{2}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT such that the projection of 𝒫1subscript𝒫1\mathcal{P}_{1}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on 𝒟𝒟\mathcal{D}caligraphic_D is equal to the projection of 𝒫2subscript𝒫2\mathcal{P}_{2}caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT on 𝒟𝒟\mathcal{D}caligraphic_D, i.e., it holds {i:𝒫1si}={i:𝒫2si}conditional-set𝑖subscript𝒫1subscript𝑠𝑖conditional-set𝑖subscript𝒫2subscript𝑠𝑖\{i:\mathcal{P}_{1}\in s_{i}\}=\{i:\mathcal{P}_{2}\in s_{i}\}{ italic_i : caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } = { italic_i : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. This implies that supi{1,2}X𝒫i=X𝒫1subscriptsupremum𝑖12subscript𝑋subscript𝒫𝑖subscript𝑋subscript𝒫1\sup_{i\in\{1,2\}}X_{\mathcal{P}_{i}}=X_{\mathcal{P}_{1}}roman_sup start_POSTSUBSCRIPT italic_i ∈ { 1 , 2 } end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Therefore, we can rewrite the supremum within 𝔼𝐯I(μ^)[d~(,μˇ)]subscript𝔼similar-to𝐯𝐼^𝜇delimited-[]~𝑑superscriptˇ𝜇\mathop{\mathbb{E}}_{\mathbf{v}\sim I(\hat{\mu})}[\tilde{d}(\mathcal{R}^{\star% },\check{\mu})]blackboard_E start_POSTSUBSCRIPT bold_v ∼ italic_I ( over^ start_ARG italic_μ end_ARG ) end_POSTSUBSCRIPT [ over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) ] over the set of patterns with distinct projections, recalling that the number of such distinct projections is N(𝒟)subscript𝑁𝒟N_{\mathcal{L}}(\mathcal{D})italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ). Now, for any 𝒫𝒫\mathcal{P}\in\mathcal{L}caligraphic_P ∈ caligraphic_L, Bernstein’s inequality (Theorem 2.10 in (Boucheron et al., 2013)) implies that X𝒫subscript𝑋𝒫X_{\mathcal{P}}italic_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT is a sub-gamma random variable (Section 2.4 (Boucheron et al., 2013)), such that X𝒫Γ+(u,b)subscript𝑋𝒫subscriptΓ𝑢𝑏X_{\mathcal{P}}\in\Gamma_{+}(u,b)italic_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ∈ roman_Γ start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ( italic_u , italic_b ) with u=ω^/m𝑢^𝜔𝑚u=\hat{\omega}/mitalic_u = over^ start_ARG italic_ω end_ARG / italic_m and b=1/(3m)𝑏13𝑚b=1/(3m)italic_b = 1 / ( 3 italic_m ), since it is an average of i.i.d. random variables f𝒫(si)(𝐯iμˇ)subscript𝑓𝒫subscript𝑠𝑖subscript𝐯𝑖ˇ𝜇f_{\mathcal{P}}(s_{i})(\mathbf{v}_{i}-\check{\mu})italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - overroman_ˇ start_ARG italic_μ end_ARG ) that are bounded in the interval [μˇ,1μˇ]ˇ𝜇1ˇ𝜇[-\check{\mu},1-\check{\mu}][ - overroman_ˇ start_ARG italic_μ end_ARG , 1 - overroman_ˇ start_ARG italic_μ end_ARG ] and have variance ω^absent^𝜔\leq\hat{\omega}≤ over^ start_ARG italic_ω end_ARG. Consequently, we apply a maximal inequality (Corollary 2.6 in (Boucheron et al., 2013)) to upper bound the expected maximum of sub-gamma random variables, obtaining that 𝔼[max𝒫X𝒫]2uln(N(𝒟))+bln(N(𝒟))𝔼delimited-[]subscript𝒫subscript𝑋𝒫2𝑢subscript𝑁𝒟𝑏subscript𝑁𝒟\mathop{\mathbb{E}}[\max_{\mathcal{P}}X_{\mathcal{P}}]\leq\sqrt{2u\ln(N_{% \mathcal{L}}(\mathcal{D}))}+b\ln(N_{\mathcal{L}}(\mathcal{D}))blackboard_E [ roman_max start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ] ≤ square-root start_ARG 2 italic_u roman_ln ( italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) ) end_ARG + italic_b roman_ln ( italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) ). Note that the upper bound to the expectation given above holds. The statement follows from the application of Theorem 1 to d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ), following the same steps of the proof of Theorem 6. ∎

To upper bound N(𝒟)subscript𝑁𝒟N_{\mathcal{L}}(\mathcal{D})italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) for the language of subgroups, we prove the following. Note that we focus on the case of subgroups with continuous features, since every categorical feature fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT can be converted to a discrete one fdsubscript𝑓𝑑f_{d}italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT (assigning a random order to the distinct elements), and observing that each equality condition fc=asubscript𝑓𝑐𝑎\text{``}f_{c}=a\text{''}“ italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_a ” is equivalent to the interval fd[ax,a+x]subscript𝑓𝑑𝑎𝑥𝑎𝑥\text{``}f_{d}\in[a-x,a+x]\text{''}“ italic_f start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ [ italic_a - italic_x , italic_a + italic_x ] ” for some x>0𝑥0x>0italic_x > 0. Therefore, the language of subgroups over continuous features contains the language over datasets with both continuous and categorical features, thus all results for the former apply to the latter. Note the reverse direction is not true in general.

Lemma 3.

Let \mathcal{L}caligraphic_L be the language of subgroups composed by conjunctions with at most z𝑧zitalic_z conditions over d𝑑ditalic_d continuous features. Then, it holds N(𝒟)(e3dm24z3)zsubscript𝑁𝒟superscriptsuperscript𝑒3𝑑superscript𝑚24superscript𝑧3𝑧N_{\mathcal{L}}(\mathcal{D})\leq\left(\frac{e^{3}dm^{2}}{4z^{3}}\right)^{z}italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) ≤ ( divide start_ARG italic_e start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_d italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT.

Proof.

Consider a dataset 𝒟𝒟\mathcal{D}caligraphic_D with d𝑑ditalic_d continuous features, and let [1,d]1𝑑[1,d][ 1 , italic_d ] be the indices of these features. Let any A[1,d]𝐴1𝑑A\subseteq[1,d]italic_A ⊆ [ 1 , italic_d ] with |A|=vz𝐴𝑣𝑧|A|=v\leq z| italic_A | = italic_v ≤ italic_z, and define the language Asubscript𝐴\mathcal{L}_{A}\subseteq\mathcal{L}caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⊆ caligraphic_L as all subgroups with v𝑣vitalic_v conjunction terms involving conditions on the features of A𝐴Aitalic_A, such as inequalities or intervals. Equivalently, we can see the projection of Asubscript𝐴\mathcal{L}_{A}caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT over the transactions of 𝒟𝒟\mathcal{D}caligraphic_D as the class of axis-aligned rectangles in vsuperscript𝑣\mathbb{R}^{v}blackboard_R start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT. It is known that the VC-dimension of this class is 2v2𝑣2v2 italic_v (Problem 6.5 in (Shalev-Shwartz and Ben-David, 2014)). Therefore, from Sauer-Shelah-Perles’ Lemma (Lemma 6.10 in (Shalev-Shwartz and Ben-David, 2014)), the number NA(𝒟)subscript𝑁subscript𝐴𝒟N_{\mathcal{L}_{A}}(\mathcal{D})italic_N start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D ) of distinct projections of Asubscript𝐴\mathcal{L}_{A}caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT on 𝒟𝒟\mathcal{D}caligraphic_D is NA(𝒟)i=12v(mi)(em2v)2vsubscript𝑁subscript𝐴𝒟superscriptsubscript𝑖12𝑣binomial𝑚𝑖superscript𝑒𝑚2𝑣2𝑣N_{\mathcal{L}_{A}}(\mathcal{D})\leq\sum_{i=1}^{2v}\binom{m}{i}\leq\left(\frac% {em}{2v}\right)^{2v}italic_N start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D ) ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_v end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_m end_ARG start_ARG italic_i end_ARG ) ≤ ( divide start_ARG italic_e italic_m end_ARG start_ARG 2 italic_v end_ARG ) start_POSTSUPERSCRIPT 2 italic_v end_POSTSUPERSCRIPT. From an union bound,

N(𝒟)ANA(𝒟)v=1z(dv)(em2v)2v(em2z)2z(edz)z,subscript𝑁𝒟subscript𝐴subscript𝑁subscript𝐴𝒟superscriptsubscript𝑣1𝑧binomial𝑑𝑣superscript𝑒𝑚2𝑣2𝑣superscript𝑒𝑚2𝑧2𝑧superscript𝑒𝑑𝑧𝑧\displaystyle N_{\mathcal{L}}(\mathcal{D})\leq\sum_{A}N_{\mathcal{L}_{A}}(% \mathcal{D})\leq\sum_{v=1}^{z}\binom{d}{v}\left(\frac{em}{2v}\right)^{2v}\leq% \left(\frac{em}{2z}\right)^{2z}\left(\frac{ed}{z}\right)^{z},italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) ≤ ∑ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D ) ≤ ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_d end_ARG start_ARG italic_v end_ARG ) ( divide start_ARG italic_e italic_m end_ARG start_ARG 2 italic_v end_ARG ) start_POSTSUPERSCRIPT 2 italic_v end_POSTSUPERSCRIPT ≤ ( divide start_ARG italic_e italic_m end_ARG start_ARG 2 italic_z end_ARG ) start_POSTSUPERSCRIPT 2 italic_z end_POSTSUPERSCRIPT ( divide start_ARG italic_e italic_d end_ARG start_ARG italic_z end_ARG ) start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ,

obtaining the statement. ∎

Combining Lemma 3 and Theorem 2, we obtain the following Corollary.

Corollary 4.

Let \mathcal{L}caligraphic_L be the language of subgroups composed by conjunctions with at most z𝑧zitalic_z conditions over d𝑑ditalic_d continuous features. Then the value d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) computed by FSR in line 1 of Algorithm 1 is

d~(,μˇ)2ω^zln(e3dm24z3)m+ln(1λ)2cm+zln(e3dm24z3)3m~𝑑superscriptˇ𝜇2^𝜔𝑧superscript𝑒3𝑑superscript𝑚24superscript𝑧3𝑚1𝜆2𝑐𝑚𝑧superscript𝑒3𝑑superscript𝑚24superscript𝑧33𝑚\displaystyle\tilde{d}(\mathcal{R}^{\star},\check{\mu})\leq\sqrt{\frac{2\hat{% \omega}z\ln(\frac{e^{3}dm^{2}}{4z^{3}})}{m}}+\sqrt{\frac{\ln(\frac{1}{\lambda}% )}{2cm}}+\frac{z\ln(\frac{e^{3}dm^{2}}{4z^{3}})}{3m}over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) ≤ square-root start_ARG divide start_ARG 2 over^ start_ARG italic_ω end_ARG italic_z roman_ln ( divide start_ARG italic_e start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_d italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG + square-root start_ARG divide start_ARG roman_ln ( divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG ) end_ARG start_ARG 2 italic_c italic_m end_ARG end_ARG + divide start_ARG italic_z roman_ln ( divide start_ARG italic_e start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_d italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) end_ARG start_ARG 3 italic_m end_ARG

with probability at least 1λ1𝜆1-\lambda1 - italic_λ.

A.5.1. Power analysis of FSR-C

To prove Theorem 8, regarding the power of FSR-C, we first describe the model we assume for the distribution of the alternative hypotheses, i.e., the set of patterns with 𝗊𝒫>0subscript𝗊𝒫0\mathsf{q}_{\mathcal{P}}>0sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT > 0. We assume that the quality of patterns correlated with the target follows the Wallenius’ noncentral hypergeometric distribution (Wallenius, 1963), a generalization of the hypergeometric distribution that allows to model a biased random sampling of a contingency table with fixed marginals. We define the model Wn({mi,wi})subscript𝑊𝑛subscript𝑚𝑖subscript𝑤𝑖W_{n}(\{m_{i},w_{i}\})italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( { italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) that describes the distribution of a sequence of binary random variables 1,,nsubscript1subscript𝑛\ell_{1},\dots,\ell_{n}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for the weighted sampling of n𝑛nitalic_n elements from a set of m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT items with label 1111 and m0subscript𝑚0m_{0}italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT items with label 00; the parameters w0subscript𝑤0w_{0}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, with 1w0<w11subscript𝑤0subscript𝑤11\leq w_{0}<w_{1}1 ≤ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT < italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, are respectively the weights of items with label 00 and 1111. The fact w1>w0subscript𝑤1subscript𝑤0w_{1}>w_{0}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT expresses the bias toward sampling items with label 1111. The first element 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is sampled according to the weighted proportion of the items, such that

Pr(1=1)=m1w1m1w1+m0w0.Prsubscript11subscript𝑚1subscript𝑤1subscript𝑚1subscript𝑤1subscript𝑚0subscript𝑤0\displaystyle\Pr\left(\ell_{1}=1\right)=\frac{m_{1}w_{1}}{m_{1}w_{1}+m_{0}w_{0% }}.roman_Pr ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 ) = divide start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG .

The second element 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is taken according to the weighted proportion of the remaining items, therefore dependending on the outcome of the first choice. In general, we obtain that

Pr(i=1)=m1iw1m1iw1+m0iw0,Prsubscript𝑖1superscriptsubscript𝑚1𝑖subscript𝑤1superscriptsubscript𝑚1𝑖subscript𝑤1superscriptsubscript𝑚0𝑖subscript𝑤0\displaystyle\Pr\left(\ell_{i}=1\right)=\frac{m_{1}^{i}w_{1}}{m_{1}^{i}w_{1}+m% _{0}^{i}w_{0}},roman_Pr ( roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) = divide start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ,

where m1i=m1j=1i1jsuperscriptsubscript𝑚1𝑖subscript𝑚1superscriptsubscript𝑗1𝑖1subscript𝑗m_{1}^{i}=m_{1}-\sum_{j=1}^{i-1}\ell_{j}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and m0i=m0j=1i1(1j)superscriptsubscript𝑚0𝑖subscript𝑚0superscriptsubscript𝑗1𝑖11subscript𝑗m_{0}^{i}=m_{0}-\sum_{j=1}^{i-1}(1-\ell_{j})italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - roman_ℓ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) are, respectively, the number of remaining items with label 1111 and label 00, i.e., the ones that are not sampled in previous steps. We may observe that, as the bias goes to 00 (i.e., w1w0subscript𝑤1subscript𝑤0w_{1}\rightarrow w_{0}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), this distribution converges to the (standard) hypergeometric distribution. We remark that a direct calculation of mean and the probability mass function of the noncentral hypergeometric distribution above, therefore the computation of exact tail bounds, is extremely unwieldy. Moreover, the random variables {i,i[1,n]}subscript𝑖𝑖1𝑛\{\ell_{i},i\in[1,n]\}{ roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ 1 , italic_n ] } are clearly not independent, therefore standard concentration results (e.g., Chernoff-Hoeffding bounds) do not apply directly.

To overcome these issues, we leverage an advanced concentration bound for martingales, which also applies to random variables that are not necessarely independent (Dubhashi and Panconesi, 2009). We use the following version of the method of bounded differences (McDiarmid, 1989). For a given set of random variables X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\dots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we use 𝐗isubscript𝐗𝑖\mathbf{X}_{i}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote the set {Xj,j[1,i]}subscript𝑋𝑗𝑗1𝑖\{X_{j},j\in[1,i]\}{ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ [ 1 , italic_i ] }.

Definition 5.

A function f𝑓fitalic_f satisfies the Averaged Lipschitz Condition (ALC) with parameters ci,i[n]subscript𝑐𝑖𝑖delimited-[]𝑛c_{i},i\in[n]italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ italic_n ], with respect to the random variables X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\dots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT if for any ai,aisubscript𝑎𝑖subscriptsuperscript𝑎𝑖a_{i},a^{\prime}_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,

|𝔼[f𝐗i1,Xi=ai]𝔼[f𝐗i1,Xi=ai]|ci,𝔼delimited-[]conditional𝑓subscript𝐗𝑖1subscript𝑋𝑖subscript𝑎𝑖𝔼delimited-[]conditional𝑓subscript𝐗𝑖1subscript𝑋𝑖subscriptsuperscript𝑎𝑖subscript𝑐𝑖\displaystyle\lvert\mathop{\mathbb{E}}\left[f\mid\mathbf{X}_{i-1},X_{i}=a_{i}% \right]-\mathop{\mathbb{E}}\left[f\mid\mathbf{X}_{i-1},X_{i}=a^{\prime}_{i}% \right]\rvert\leq c_{i},| blackboard_E [ italic_f ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] - blackboard_E [ italic_f ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] | ≤ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

for 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n.

The following result establishes concentration bounds for functions that satisfy the ALC.

Theorem 6 (Corollary 5.1 (Dubhashi and Panconesi, 2009)).

Let f𝑓fitalic_f satisfy the ALC with parameters ci,i[n]subscript𝑐𝑖𝑖delimited-[]𝑛c_{i},i\in[n]italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ italic_n ], with respect to the random variables X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\dots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and let C=i=1nci2𝐶superscriptsubscript𝑖1𝑛superscriptsubscript𝑐𝑖2C=\sum_{i=1}^{n}c_{i}^{2}italic_C = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Then it holds

Pr(f>𝔼[f]+t),Pr(f<𝔼[f]t)exp(2t2/C).Pr𝑓𝔼delimited-[]𝑓𝑡Pr𝑓𝔼delimited-[]𝑓𝑡2superscript𝑡2𝐶\displaystyle\Pr\left(f>\mathop{\mathbb{E}}[f]+t\right),\Pr\left(f<\mathop{% \mathbb{E}}[f]-t\right)\leq\exp(-2t^{2}/C).roman_Pr ( italic_f > blackboard_E [ italic_f ] + italic_t ) , roman_Pr ( italic_f < blackboard_E [ italic_f ] - italic_t ) ≤ roman_exp ( - 2 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_C ) .

We now prove that the sum i=1nisuperscriptsubscript𝑖1𝑛subscript𝑖\sum_{i=1}^{n}\ell_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a function that satisfy the ALC, thus is sharply concentrated toward its expectation.

Theorem 7.

Let {i,i[1,n]}subscript𝑖𝑖1𝑛\{\ell_{i},i\in[1,n]\}{ roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ 1 , italic_n ] } be a set of random variables distributed according to Wn({mi,wi})subscript𝑊𝑛subscript𝑚𝑖subscript𝑤𝑖W_{n}(\{m_{i},w_{i}\})italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( { italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ). Denote Xi=isubscript𝑋𝑖subscript𝑖X_{i}=\ell_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the function f=i=1nXi𝑓superscriptsubscript𝑖1𝑛subscript𝑋𝑖f=\sum_{i=1}^{n}X_{i}italic_f = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, it holds

Pr(f>𝔼[f]+t),Pr(f<𝔼[f]t)exp(2t2/n).Pr𝑓𝔼delimited-[]𝑓𝑡Pr𝑓𝔼delimited-[]𝑓𝑡2superscript𝑡2𝑛\displaystyle\Pr\left(f>\mathop{\mathbb{E}}[f]+t\right),\Pr\left(f<\mathop{% \mathbb{E}}[f]-t\right)\leq\exp(-2t^{2}/n).roman_Pr ( italic_f > blackboard_E [ italic_f ] + italic_t ) , roman_Pr ( italic_f < blackboard_E [ italic_f ] - italic_t ) ≤ roman_exp ( - 2 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_n ) .
Proof.

We prove that f𝑓fitalic_f satisfy the ALC. For any i[1,n]𝑖1𝑛i\in[1,n]italic_i ∈ [ 1 , italic_n ],

𝔼[j=1nXj𝐗i1,Xi=1]𝔼[j=1nXj𝐗i1,Xi=0]𝔼delimited-[]conditionalsuperscriptsubscript𝑗1𝑛subscript𝑋𝑗subscript𝐗𝑖1subscript𝑋𝑖1𝔼delimited-[]conditionalsuperscriptsubscript𝑗1𝑛subscript𝑋𝑗subscript𝐗𝑖1subscript𝑋𝑖0\displaystyle\mathop{\mathbb{E}}\Bigl{[}\sum_{j=1}^{n}X_{j}\mid\mathbf{X}_{i-1% },X_{i}=1\Bigr{]}-\mathop{\mathbb{E}}\Bigl{[}\sum_{j=1}^{n}X_{j}\mid\mathbf{X}% _{i-1},X_{i}=0\Bigr{]}blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ] - blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ]
=1+𝔼[j=i+1nXj𝐗i1,Xi=1]𝔼[j=i+1nXj𝐗i1,Xi=0]absent1𝔼delimited-[]conditionalsuperscriptsubscript𝑗𝑖1𝑛subscript𝑋𝑗subscript𝐗𝑖1subscript𝑋𝑖1𝔼delimited-[]conditionalsuperscriptsubscript𝑗𝑖1𝑛subscript𝑋𝑗subscript𝐗𝑖1subscript𝑋𝑖0\displaystyle=1+\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_{j}\mid\mathbf{X}% _{i-1},X_{i}=1\Bigr{]}-\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_{j}\mid% \mathbf{X}_{i-1},X_{i}=0\Bigr{]}= 1 + blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ] - blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ]
1+𝔼[j=i+1nXj𝐗i1,Xi=0]𝔼[j=i+1nXj𝐗i1,Xi=0]=1.absent1𝔼delimited-[]conditionalsuperscriptsubscript𝑗𝑖1𝑛subscript𝑋𝑗subscript𝐗𝑖1subscript𝑋𝑖0𝔼delimited-[]conditionalsuperscriptsubscript𝑗𝑖1𝑛subscript𝑋𝑗subscript𝐗𝑖1subscript𝑋𝑖01\displaystyle\leq 1+\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_{j}\mid% \mathbf{X}_{i-1},X_{i}=0\Bigr{]}-\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_% {j}\mid\mathbf{X}_{i-1},X_{i}=0\Bigr{]}=1.≤ 1 + blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ] - blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ] = 1 .

We then prove the other direction:

𝔼[j=1nXj𝐗i1,Xi=0]𝔼[j=1nXj𝐗i1,Xi=1]𝔼delimited-[]conditionalsuperscriptsubscript𝑗1𝑛subscript𝑋𝑗subscript𝐗𝑖1subscript𝑋𝑖0𝔼delimited-[]conditionalsuperscriptsubscript𝑗1𝑛subscript𝑋𝑗subscript𝐗𝑖1subscript𝑋𝑖1\displaystyle\mathop{\mathbb{E}}\Bigl{[}\sum_{j=1}^{n}X_{j}\mid\mathbf{X}_{i-1% },X_{i}=0\Bigr{]}-\mathop{\mathbb{E}}\Bigl{[}\sum_{j=1}^{n}X_{j}\mid\mathbf{X}% _{i-1},X_{i}=1\Bigr{]}blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ] - blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ]
=𝔼[j=i+1nXj𝐗i1,Xi=0]1𝔼[j=i+1nXj𝐗i1,Xi=1]absent𝔼delimited-[]conditionalsuperscriptsubscript𝑗𝑖1𝑛subscript𝑋𝑗subscript𝐗𝑖1subscript𝑋𝑖01𝔼delimited-[]conditionalsuperscriptsubscript𝑗𝑖1𝑛subscript𝑋𝑗subscript𝐗𝑖1subscript𝑋𝑖1\displaystyle=\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_{j}\mid\mathbf{X}_{% i-1},X_{i}=0\Bigr{]}-1-\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_{j}\mid% \mathbf{X}_{i-1},X_{i}=1\Bigr{]}= blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ] - 1 - blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ]
1+𝔼[j=i+1nXj𝐗i1,Xi=1]1𝔼[j=i+1nXj𝐗i1,Xi=1]absent1𝔼delimited-[]conditionalsuperscriptsubscript𝑗𝑖1𝑛subscript𝑋𝑗subscript𝐗𝑖1subscript𝑋𝑖11𝔼delimited-[]conditionalsuperscriptsubscript𝑗𝑖1𝑛subscript𝑋𝑗subscript𝐗𝑖1subscript𝑋𝑖1\displaystyle\leq 1+\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}X_{j}\mid% \mathbf{X}_{i-1},X_{i}=1\Bigr{]}-1-\mathop{\mathbb{E}}\Bigl{[}\sum_{j=i+1}^{n}% X_{j}\mid\mathbf{X}_{i-1},X_{i}=1\Bigr{]}≤ 1 + blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ] - 1 - blackboard_E [ ∑ start_POSTSUBSCRIPT italic_j = italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_X start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ]
=0.absent0\displaystyle=0.= 0 .

We conclude that f𝑓fitalic_f satisfy the ALC with ci=1subscript𝑐𝑖1c_{i}=1italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1, for all 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n, and the concentration bounds follow from Theorem 6. ∎

Using the result above, we prove a probabilitic lower bound to the observed quality 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) of a pattern 𝒫𝒫\mathcal{P}caligraphic_P. For any 𝒫𝒫\mathcal{P}\in\mathcal{L}caligraphic_P ∈ caligraphic_L, let n=m𝖿𝒫(𝒟)𝑛𝑚subscript𝖿𝒫𝒟n=m\mathsf{f}_{\mathcal{P}}(\mathcal{D})italic_n = italic_m sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) be the number of transactions of 𝒟𝒟\mathcal{D}caligraphic_D where 𝒫𝒫\mathcal{P}caligraphic_P is supported, and assume w.l.o.g. that {i:𝒫si,i[1,m]}=[1,n]conditional-set𝑖formulae-sequence𝒫subscript𝑠𝑖𝑖1𝑚1𝑛\{i:\mathcal{P}\in s_{i},i\in[1,m]\}=[1,n]{ italic_i : caligraphic_P ∈ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ 1 , italic_m ] } = [ 1 , italic_n ]. We define Xi=isubscript𝑋𝑖subscript𝑖X_{i}=\ell_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for all i[1,n]𝑖1𝑛i\in[1,n]italic_i ∈ [ 1 , italic_n ], where isubscript𝑖\ell_{i}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT follows the biased sampling distribution described above.

Proposition 8.

It holds

Pr(𝗊¯𝒫(𝒟)𝗊𝒫t)exp(2mt2𝖿𝒫(𝒟)).Prsubscript¯𝗊𝒫𝒟subscript𝗊𝒫𝑡2𝑚superscript𝑡2subscript𝖿𝒫𝒟\displaystyle\Pr\left({\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\leq\mathsf% {q}_{\mathcal{P}}-t\right)\leq\exp\left(\frac{-2mt^{2}}{\mathsf{f}_{\mathcal{P% }}(\mathcal{D})}\right).roman_Pr ( over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≤ sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT - italic_t ) ≤ roman_exp ( divide start_ARG - 2 italic_m italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) end_ARG ) .
Proof.

Define the function g=1mi=1n(iμ(𝒟))𝑔1𝑚superscriptsubscript𝑖1𝑛subscript𝑖𝜇𝒟g=\frac{1}{m}\sum_{i=1}^{n}(\ell_{i}-\mu(\mathcal{D}))italic_g = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ ( caligraphic_D ) ). Considering the function f𝑓fitalic_f in the statement of Theorem 7, we observe that it holds g=f/mμ(𝒟)𝑔𝑓𝑚𝜇𝒟g=f/m-\mu(\mathcal{D})italic_g = italic_f / italic_m - italic_μ ( caligraphic_D ), and that n=m𝖿𝒫(𝒟)𝑛𝑚subscript𝖿𝒫𝒟n=m\mathsf{f}_{\mathcal{P}}(\mathcal{D})italic_n = italic_m sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ). From these observations, we apply Theorem 7 to the function f=m(g+μ(𝒟))𝑓𝑚𝑔𝜇𝒟f=m(g+\mu(\mathcal{D}))italic_f = italic_m ( italic_g + italic_μ ( caligraphic_D ) ), obtaining the statement after simple manipulations of the r.h.s.. ∎

We now extend Proposition 8 to obtain a bound valid simoultaneously for all patterns of the language \mathcal{L}caligraphic_L.

Proposition 9.

Define 𝖿^(𝒟)=sup𝒫𝖿𝒫(𝒟)^𝖿𝒟subscriptsupremum𝒫subscript𝖿𝒫𝒟\hat{\mathsf{f}}(\mathcal{D})=\sup_{\mathcal{P}}\mathsf{f}_{\mathcal{P}}(% \mathcal{D})over^ start_ARG sansserif_f end_ARG ( caligraphic_D ) = roman_sup start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ). With probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ, it holds

𝗊¯𝒫(𝒟)𝗊𝒫2𝖿^(𝒟)ln(N(𝒟)/δ)m,𝒫.formulae-sequencesubscript¯𝗊𝒫𝒟subscript𝗊𝒫2^𝖿𝒟subscript𝑁𝒟𝛿𝑚for-all𝒫\displaystyle{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\geq\mathsf{q}_{% \mathcal{P}}-\sqrt{\frac{2\hat{\mathsf{f}}(\mathcal{D})\ln(N_{\mathcal{L}}(% \mathcal{D})/\delta)}{m}},\forall\mathcal{P}\in\mathcal{L}.over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≥ sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT - square-root start_ARG divide start_ARG 2 over^ start_ARG sansserif_f end_ARG ( caligraphic_D ) roman_ln ( italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) / italic_δ ) end_ARG start_ARG italic_m end_ARG end_ARG , ∀ caligraphic_P ∈ caligraphic_L .
Proof.

Define the event

E𝒫=𝗊¯𝒫(𝒟)𝗊𝒫2𝖿^(𝒟)ln(N(𝒟)/δ)m.subscript𝐸𝒫subscript¯𝗊𝒫𝒟subscript𝗊𝒫2^𝖿𝒟subscript𝑁𝒟𝛿𝑚\displaystyle E_{\mathcal{P}}=\text{``}{\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D})}\leq\mathsf{q}_{\mathcal{P}}-\sqrt{\frac{2\hat{\mathsf{f}}(% \mathcal{D})\ln(N_{\mathcal{L}}(\mathcal{D})/\delta)}{m}}\text{''}.italic_E start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = “ over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≤ sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT - square-root start_ARG divide start_ARG 2 over^ start_ARG sansserif_f end_ARG ( caligraphic_D ) roman_ln ( italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) / italic_δ ) end_ARG start_ARG italic_m end_ARG end_ARG ” .

The statement holds if Pr(𝒫E𝒫)δPrsubscript𝒫subscript𝐸𝒫𝛿\Pr(\cup_{\mathcal{P}}E_{\mathcal{P}})\leq\deltaroman_Pr ( ∪ start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ) ≤ italic_δ. Now, denote two patterns 𝒫1,𝒫2subscript𝒫1subscript𝒫2\mathcal{P}_{1},\mathcal{P}_{2}caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, such that they are supported by the same set of transactions: {i:𝒫1si}={i:𝒫2si}conditional-set𝑖subscript𝒫1subscript𝑠𝑖conditional-set𝑖subscript𝒫2subscript𝑠𝑖\{i:\mathcal{P}_{1}\in s_{i}\}=\{i:\mathcal{P}_{2}\in s_{i}\}{ italic_i : caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } = { italic_i : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. This implies that 𝗊¯𝒫1(𝒟)=𝗊¯𝒫2(𝒟)subscript¯𝗊subscript𝒫1𝒟subscript¯𝗊subscript𝒫2𝒟{\bar{\mathsf{q}}_{\mathcal{P}_{1}}(\mathcal{D})}={\bar{\mathsf{q}}_{\mathcal{% P}_{2}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D ) = over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_D ) for all possible labels, but also that 𝗊𝒫1=𝗊𝒫2subscript𝗊subscript𝒫1subscript𝗊subscript𝒫2\mathsf{q}_{\mathcal{P}_{1}}=\mathsf{q}_{\mathcal{P}_{2}}sansserif_q start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = sansserif_q start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, since the two qualities are functions of the same set of labels. From these observations, it holds E𝒫1E𝒫2=E𝒫1subscript𝐸subscript𝒫1subscript𝐸subscript𝒫2subscript𝐸subscript𝒫1E_{\mathcal{P}_{1}}\cup E_{\mathcal{P}_{2}}=E_{\mathcal{P}_{1}}italic_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ italic_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Therefore, the union over all patterns can be replaced by the union over the ones with distinct projections, whose number is at most N(𝒟)subscript𝑁𝒟N_{\mathcal{L}}(\mathcal{D})italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ). From an union bound over these events, and using the tail bound of Proposition 8 with the fact 𝖿𝒫(𝒟)𝖿^(𝒟)subscript𝖿𝒫𝒟^𝖿𝒟\mathsf{f}_{\mathcal{P}}(\mathcal{D})\leq\hat{\mathsf{f}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≤ over^ start_ARG sansserif_f end_ARG ( caligraphic_D ), we have

Pr(𝒫E𝒫)N(𝒟)exp(2mt2/𝖿^(𝒟)).Prsubscript𝒫subscript𝐸𝒫subscript𝑁𝒟2𝑚superscript𝑡2^𝖿𝒟\displaystyle\Pr\left(\cup_{\mathcal{P}}E_{\mathcal{P}}\right)\leq N_{\mathcal% {L}}(\mathcal{D})\exp\bigl{(}-2mt^{2}/\hat{\mathsf{f}}(\mathcal{D})\bigr{)}.roman_Pr ( ∪ start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ) ≤ italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) roman_exp ( - 2 italic_m italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / over^ start_ARG sansserif_f end_ARG ( caligraphic_D ) ) .

Imposing the r.h.s. of the inequality to be δabsent𝛿\leq\delta≤ italic_δ, and solving for t𝑡titalic_t, we obtain the statement. ∎

Finally, we prove the power guarantees for FSR-C.

Proof of Theorem 8.

To obtain the statement, we show that the condition 𝗊𝒫ξsubscript𝗊𝒫𝜉\mathsf{q}_{\mathcal{P}}\geq\xisansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ≥ italic_ξ, for some ξ>0𝜉0\xi>0italic_ξ > 0 sufficiently large, implies that 𝒫𝒫\mathcal{P}caligraphic_P is reported in output with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ. In fact, we have that the implication 𝗊𝒫ξ𝗊¯𝒫(𝒟)ξε1subscript𝗊𝒫𝜉subscript¯𝗊𝒫𝒟𝜉subscript𝜀1\mathsf{q}_{\mathcal{P}}\geq\xi\implies{\bar{\mathsf{q}}_{\mathcal{P}}(% \mathcal{D})}\geq\xi-\varepsilon_{1}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ≥ italic_ξ ⟹ over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≥ italic_ξ - italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT holds with probability 1δ/2absent1𝛿2\geq 1-\delta/2≥ 1 - italic_δ / 2 for all 𝒫𝒫\mathcal{P}\in\mathcal{L}caligraphic_P ∈ caligraphic_L, where ε1=2𝖿^(𝒟)zln(e3m2d2z3δ)msubscript𝜀12^𝖿𝒟𝑧superscript𝑒3superscript𝑚2𝑑2superscript𝑧3𝛿𝑚\varepsilon_{1}=\sqrt{\frac{2\hat{\mathsf{f}}(\mathcal{D})z\ln(\frac{e^{3}m^{2% }d}{2z^{3}\delta})}{m}}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 2 over^ start_ARG sansserif_f end_ARG ( caligraphic_D ) italic_z roman_ln ( divide start_ARG italic_e start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 2 italic_z start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG is obtained from Proposition 9 (replacing δ𝛿\deltaitalic_δ by δ/2𝛿2\delta/2italic_δ / 2, and using the upper bound to N(𝒟)subscript𝑁𝒟N_{\mathcal{L}}(\mathcal{D})italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) from Lemma 3). Consequently, we seek a sufficient condition to ensure that 𝒫𝒫\mathcal{P}caligraphic_P is reported in output; this condition is 𝗊¯𝒫(𝒟)ξε1εsubscript¯𝗊𝒫𝒟𝜉subscript𝜀1𝜀{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\geq\xi-\varepsilon_{1}\geq\varepsilonover¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≥ italic_ξ - italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ italic_ε, and also ξε1+ε𝜉subscript𝜀1𝜀\xi\geq\varepsilon_{1}+\varepsilonitalic_ξ ≥ italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ε, where ε𝜀\varepsilonitalic_ε is as defined in Corollary 7. Combining this inequality with the upper bound to d~(,μˇ)~𝑑superscriptˇ𝜇\tilde{d}(\mathcal{R}^{\star},\check{\mu})over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , overroman_ˇ start_ARG italic_μ end_ARG ) of Theorem 2 (setting λ=δ/2𝜆𝛿2\lambda=\delta/2italic_λ = italic_δ / 2), we obtain the lower bound to 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT in the statement, which holds with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ from a union bound. ∎

A.5.2. Power analysis of FSR-U

The following is a restatement of Theorem 5 of (Li et al., 2001), that provides uniform convergence bounds for families of functions with bounded pseudodimension (Pollard, 2012; Shalev-Shwartz and Ben-David, 2014).

Theorem 10.

Let \mathcal{F}caligraphic_F be a family of functions from a domain 𝒳𝒳\mathcal{X}caligraphic_X to [a,b]𝑎𝑏[a,b]\subset\mathbb{R}[ italic_a , italic_b ] ⊂ blackboard_R with pseudodimention P()r𝑃𝑟P(\mathcal{F})\leq ritalic_P ( caligraphic_F ) ≤ italic_r. Let 𝒮={s1,,sm}𝒮subscript𝑠1subscript𝑠𝑚\mathcal{S}=\{s_{1},\dots,s_{m}\}caligraphic_S = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } be a random sample of size m𝑚mitalic_m taken i.i.d. from a distribution γ𝛾\gammaitalic_γ. It holds

Pr(supf|𝔼𝒮[1mi=1mf(si)]1mi=1mf(si)|>t)exp(rt2mc~(ba)2),Prsubscriptsupremum𝑓subscript𝔼𝒮delimited-[]1𝑚superscriptsubscript𝑖1𝑚𝑓subscript𝑠𝑖1𝑚superscriptsubscript𝑖1𝑚𝑓subscript𝑠𝑖𝑡𝑟superscript𝑡2𝑚~𝑐superscript𝑏𝑎2\displaystyle\Pr\left(\sup_{f\in\mathcal{F}}\left\lvert\mathop{\mathbb{E}}_{% \mathcal{S}}\left[\frac{1}{m}\sum_{i=1}^{m}f(s_{i})\right]-\frac{1}{m}\sum_{i=% 1}^{m}f(s_{i})\right\rvert>t\right)\leq\exp\left(r-\frac{t^{2}m}{\tilde{c}(b-a% )^{2}}\right),roman_Pr ( roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_f ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | > italic_t ) ≤ roman_exp ( italic_r - divide start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m end_ARG start_ARG over~ start_ARG italic_c end_ARG ( italic_b - italic_a ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ,

where c~~𝑐\tilde{c}over~ start_ARG italic_c end_ARG is an absolute constant.

We note that the absolute constant c~~𝑐\tilde{c}over~ start_ARG italic_c end_ARG in the Theorem above is estimated to be at most 0.50.50.50.5 (Löffler and Phillips, 2009).

Using the result above, we prove the following bounds to the supports and qualities for all subgroups.

Proposition 11.

Let \mathcal{L}caligraphic_L be the language of subgroups composed by conjunctions with at most z𝑧zitalic_z conditions over d𝑑ditalic_d continuous features, and define 𝖿𝒫=𝔼𝒟[𝖿𝒫(𝒟)]subscript𝖿𝒫subscript𝔼𝒟delimited-[]subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}=\mathop{\mathbb{E}}_{\mathcal{D}}[\mathsf{f}_{% \mathcal{P}}(\mathcal{D})]sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ]. With probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ w.r.t. 𝒟𝒟\mathcal{D}caligraphic_D it holds, for all 𝒫𝒫\mathcal{P}\in\mathcal{L}caligraphic_P ∈ caligraphic_L,

𝖿𝒫(𝒟)subscript𝖿𝒫𝒟\displaystyle\mathsf{f}_{\mathcal{P}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) 𝖿𝒫+2z+ln(i=1z(di))+ln(2δ)2m,absentsubscript𝖿𝒫2𝑧superscriptsubscript𝑖1𝑧binomial𝑑𝑖2𝛿2𝑚\displaystyle\leq\mathsf{f}_{\mathcal{P}}+\sqrt{\frac{2z+\ln\bigl{(}\sum_{i=1}% ^{z}\binom{d}{i}\bigr{)}+\ln\bigl{(}\frac{2}{\delta}\bigr{)}}{2m}},≤ sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT + square-root start_ARG divide start_ARG 2 italic_z + roman_ln ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_d end_ARG start_ARG italic_i end_ARG ) ) + roman_ln ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_m end_ARG end_ARG ,
𝗊𝒫(𝒟)subscript𝗊𝒫𝒟\displaystyle\mathsf{q}_{\mathcal{P}}(\mathcal{D})sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) 𝗊𝒫2z+ln(i=1z(di))+ln(2δ)2m.absentsubscript𝗊𝒫2𝑧superscriptsubscript𝑖1𝑧binomial𝑑𝑖2𝛿2𝑚\displaystyle\geq\mathsf{q}_{\mathcal{P}}-\sqrt{\frac{2z+\ln\bigl{(}\sum_{i=1}% ^{z}\binom{d}{i}\bigr{)}+\ln\bigl{(}\frac{2}{\delta}\bigr{)}}{2m}}.≥ sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT - square-root start_ARG divide start_ARG 2 italic_z + roman_ln ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_d end_ARG start_ARG italic_i end_ARG ) ) + roman_ln ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_m end_ARG end_ARG .
Proof.

Consider a dataset 𝒟𝒟\mathcal{D}caligraphic_D with d𝑑ditalic_d continuous features, and let [1,d]1𝑑[1,d][ 1 , italic_d ] be the indices of these features. Let any A[1,d]𝐴1𝑑A\subseteq[1,d]italic_A ⊆ [ 1 , italic_d ] with |A|=vz𝐴𝑣𝑧|A|=v\leq z| italic_A | = italic_v ≤ italic_z, and define the language Asubscript𝐴\mathcal{L}_{A}\subseteq\mathcal{L}caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⊆ caligraphic_L as all subgroups with v𝑣vitalic_v conjunction terms involving conditions on the features of A𝐴Aitalic_A, such as inequalities or intervals.

We first prove the bound for 𝖿𝒫(𝒟)subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ). For any 𝒫A𝒫subscript𝐴\mathcal{P}\in\mathcal{L}_{A}caligraphic_P ∈ caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, recall the functions f𝒫(s)=𝟙[𝒫s]subscript𝑓𝒫𝑠1delimited-[]𝒫𝑠f_{\mathcal{P}}(s)=\mathds{1}\left[\mathcal{P}\in s\right]italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s ) = blackboard_1 [ caligraphic_P ∈ italic_s ], and let A={f𝒫:𝒫A}subscript𝐴conditional-setsubscript𝑓𝒫𝒫subscript𝐴\mathcal{F}_{A}=\{f_{\mathcal{P}}:\mathcal{P}\in\mathcal{L}_{A}\}caligraphic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = { italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT : caligraphic_P ∈ caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT } be the family of such functions. It is immediate to observe that the average value of f𝒫subscript𝑓𝒫f_{\mathcal{P}}italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT over 𝒟𝒟\mathcal{D}caligraphic_D is equal to 𝖿𝒫(𝒟)subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ). As discussed previously, Asubscript𝐴\mathcal{F}_{A}caligraphic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is equivalent to the class of axis-aligned rectangles in vsuperscript𝑣\mathbb{R}^{v}blackboard_R start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT, with VC-dimension 2v2𝑣2v2 italic_v (Problem 6.5 in (Shalev-Shwartz and Ben-David, 2014)). We apply Theorem 10 to the particular case of the binary function family Asubscript𝐴\mathcal{F}_{A}caligraphic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, obtaining that

Pr(supf𝒫A|𝖿𝒫𝖿𝒫(𝒟)|>t)exp(2v2t2m).Prsubscriptsupremumsubscript𝑓𝒫subscript𝐴subscript𝖿𝒫subscript𝖿𝒫𝒟𝑡2𝑣2superscript𝑡2𝑚\displaystyle\Pr\Bigl{(}\sup_{f_{\mathcal{P}}\in\mathcal{F}_{A}}\left\lvert% \mathsf{f}_{\mathcal{P}}-\mathsf{f}_{\mathcal{P}}(\mathcal{D})\right\rvert>t% \Bigr{)}\leq\exp(2v-2t^{2}m).roman_Pr ( roman_sup start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT | sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT - sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) | > italic_t ) ≤ roman_exp ( 2 italic_v - 2 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ) .

From an union bound over all subsets A[1,d]𝐴1𝑑A\subseteq[1,d]italic_A ⊆ [ 1 , italic_d ] with cardinality zabsent𝑧\leq z≤ italic_z, we have

Pr(supAsupf𝒫A|𝖿𝒫𝖿𝒫(𝒟)|>t)Prsubscriptsupremum𝐴subscriptsupremumsubscript𝑓𝒫subscript𝐴subscript𝖿𝒫subscript𝖿𝒫𝒟𝑡\displaystyle\Pr\Bigl{(}\sup_{A}\sup_{f_{\mathcal{P}}\in\mathcal{F}_{A}}\left% \lvert\mathsf{f}_{\mathcal{P}}-\mathsf{f}_{\mathcal{P}}(\mathcal{D})\right% \rvert>t\Bigr{)}roman_Pr ( roman_sup start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT | sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT - sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) | > italic_t )
APr(supf𝒫A|𝖿𝒫𝖿𝒫(𝒟)|>t)i=1z(di)exp(2i2t2m).absentsubscript𝐴Prsubscriptsupremumsubscript𝑓𝒫subscript𝐴subscript𝖿𝒫subscript𝖿𝒫𝒟𝑡superscriptsubscript𝑖1𝑧binomial𝑑𝑖2𝑖2superscript𝑡2𝑚\displaystyle\leq\sum_{A}\Pr\Bigl{(}\sup_{f_{\mathcal{P}}\in\mathcal{F}_{A}}% \left\lvert\mathsf{f}_{\mathcal{P}}-\mathsf{f}_{\mathcal{P}}(\mathcal{D})% \right\rvert>t\Bigr{)}\leq\sum_{i=1}^{z}\binom{d}{i}\exp(2i-2t^{2}m).≤ ∑ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT roman_Pr ( roman_sup start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT | sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT - sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) | > italic_t ) ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_d end_ARG start_ARG italic_i end_ARG ) roman_exp ( 2 italic_i - 2 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ) .

Setting the r.h.s. to be δ/2absent𝛿2\leq\delta/2≤ italic_δ / 2, we obtain the first bound.

We now focus on the bound for 𝗊𝒫(𝒟)subscript𝗊𝒫𝒟\mathsf{q}_{\mathcal{P}}(\mathcal{D})sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ). For any 𝒫A𝒫subscript𝐴\mathcal{P}\in\mathcal{L}_{A}caligraphic_P ∈ caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, recall the functions g𝒫(s,)=𝟙[𝒫s](μ)subscript𝑔𝒫𝑠1delimited-[]𝒫𝑠𝜇g_{\mathcal{P}}(s,\ell)=\mathds{1}\left[\mathcal{P}\in s\right](\ell-\mu)italic_g start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_s , roman_ℓ ) = blackboard_1 [ caligraphic_P ∈ italic_s ] ( roman_ℓ - italic_μ ), and let 𝒢A={g𝒫:𝒫A}subscript𝒢𝐴conditional-setsubscript𝑔𝒫𝒫subscript𝐴\mathcal{G}_{A}=\{g_{\mathcal{P}}:\mathcal{P}\in\mathcal{L}_{A}\}caligraphic_G start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = { italic_g start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT : caligraphic_P ∈ caligraphic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT } be the family of such functions. It is immediate to observe that the average value of g𝒫subscript𝑔𝒫g_{\mathcal{P}}italic_g start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT over 𝒟𝒟\mathcal{D}caligraphic_D is equal to 𝗊𝒫(𝒟)subscript𝗊𝒫𝒟\mathsf{q}_{\mathcal{P}}(\mathcal{D})sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ). By using Lemma 3.6 of (Riondato and Vandin, 2020), we obtain that the pseudodimension of 𝒢Asubscript𝒢𝐴\mathcal{G}_{A}caligraphic_G start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is equal to the VC-dimension of Asubscript𝐴\mathcal{F}_{A}caligraphic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, which is 2v2𝑣2v2 italic_v. Therefore, we apply Theorem 10 to 𝒢Asubscript𝒢𝐴\mathcal{G}_{A}caligraphic_G start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, having

Pr(supg𝒫𝒢A|𝗊𝒫𝗊𝒫(𝒟)|>t)exp(2v2t2m).Prsubscriptsupremumsubscript𝑔𝒫subscript𝒢𝐴subscript𝗊𝒫subscript𝗊𝒫𝒟𝑡2𝑣2superscript𝑡2𝑚\displaystyle\Pr\Bigl{(}\sup_{g_{\mathcal{P}}\in\mathcal{G}_{A}}\left\lvert% \mathsf{q}_{\mathcal{P}}-\mathsf{q}_{\mathcal{P}}(\mathcal{D})\right\rvert>t% \Bigr{)}\leq\exp(2v-2t^{2}m).roman_Pr ( roman_sup start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUBSCRIPT | sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT - sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) | > italic_t ) ≤ roman_exp ( 2 italic_v - 2 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ) .

We complete the proof following the same steps for the frequencies. ∎

We are now ready to prove Theorem 13.

Proof of Theorem 13.

From the guarantees of Corollary 4, the value of ε𝜀\varepsilonitalic_ε defined in the statement is an upper bound to the value of ε𝜀\varepsilonitalic_ε computed by FSR-U in Algorithm 1 with probability 1δ/3absent1𝛿3\geq 1-\delta/3≥ 1 - italic_δ / 3 (since λ=δ/3𝜆𝛿3\lambda=\delta/3italic_λ = italic_δ / 3 in the definition of r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG). We now prove a condition on the quality 𝗊𝒫subscript𝗊𝒫\mathsf{q}_{\mathcal{P}}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT for a pattern to be reported in output. From Proposition 11 (replacing δ/2𝛿2\delta/2italic_δ / 2 by δ/3𝛿3\delta/3italic_δ / 3), a pattern with 𝗊𝒫ξsubscript𝗊𝒫𝜉\mathsf{q}_{\mathcal{P}}\geq\xisansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ≥ italic_ξ has, with probability 1δ/3absent1𝛿3\geq 1-\delta/3≥ 1 - italic_δ / 3, its estimate 𝗊𝒫(𝒟)ξε1subscript𝗊𝒫𝒟𝜉subscript𝜀1\mathsf{q}_{\mathcal{P}}(\mathcal{D})\geq\xi-\varepsilon_{1}sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≥ italic_ξ - italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where

ε1=2z+ln(i=1z(di))+ln(3δ)2m.subscript𝜀12𝑧superscriptsubscript𝑖1𝑧binomial𝑑𝑖3𝛿2𝑚\displaystyle\varepsilon_{1}=\sqrt{\frac{2z+\ln\bigl{(}\sum_{i=1}^{z}\binom{d}% {i}\bigr{)}+\ln\bigl{(}\frac{3}{\delta}\bigr{)}}{2m}}.italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 2 italic_z + roman_ln ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ( FRACOP start_ARG italic_d end_ARG start_ARG italic_i end_ARG ) ) + roman_ln ( divide start_ARG 3 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_m end_ARG end_ARG .

Moreover, from the relation between 𝗊𝒫(𝒟)subscript𝗊𝒫𝒟\mathsf{q}_{\mathcal{P}}(\mathcal{D})sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) and 𝗊¯𝒫(𝒟)subscript¯𝗊𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ), it holds 𝗊¯𝒫(𝒟)ξε1εT𝖿𝒫(𝒟)subscript¯𝗊𝒫𝒟𝜉subscript𝜀1subscript𝜀𝑇subscript𝖿𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\geq\xi-\varepsilon_{1}-% \varepsilon_{T}\mathsf{f}_{\mathcal{P}}(\mathcal{D})over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≥ italic_ξ - italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ). Consequently, 𝒫𝒫\mathcal{P}caligraphic_P is reported in output if 𝗊¯𝒫(𝒟)ξε1εT𝖿𝒫(𝒟)ε+εT𝖿𝒫(𝒟)subscript¯𝗊𝒫𝒟𝜉subscript𝜀1subscript𝜀𝑇subscript𝖿𝒫𝒟𝜀subscript𝜀𝑇subscript𝖿𝒫𝒟{\bar{\mathsf{q}}_{\mathcal{P}}(\mathcal{D})}\geq\xi-\varepsilon_{1}-% \varepsilon_{T}\mathsf{f}_{\mathcal{P}}(\mathcal{D})\geq\varepsilon+% \varepsilon_{T}\mathsf{f}_{\mathcal{P}}(\mathcal{D})over¯ start_ARG sansserif_q end_ARG start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≥ italic_ξ - italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ≥ italic_ε + italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ), so if ξε1+2εT𝖿𝒫(𝒟)+ε𝜉subscript𝜀12subscript𝜀𝑇subscript𝖿𝒫𝒟𝜀\xi\geq\varepsilon_{1}+2\varepsilon_{T}\mathsf{f}_{\mathcal{P}}(\mathcal{D})+\varepsilonitalic_ξ ≥ italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) + italic_ε. Using the upper bound to 𝖿𝒫(𝒟)subscript𝖿𝒫𝒟\mathsf{f}_{\mathcal{P}}(\mathcal{D})sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) proved in Proposition 11 (replacing δ/2𝛿2\delta/2italic_δ / 2 by δ/3𝛿3\delta/3italic_δ / 3), the condition 𝗊𝒫ξε1+2εT(𝖿𝒫+ε1)+εsubscript𝗊𝒫𝜉subscript𝜀12subscript𝜀𝑇subscript𝖿𝒫subscript𝜀1𝜀\mathsf{q}_{\mathcal{P}}\geq\xi\geq\varepsilon_{1}+2\varepsilon_{T}(\mathsf{f}% _{\mathcal{P}}+\varepsilon_{1})+\varepsilonsansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ≥ italic_ξ ≥ italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 italic_ε start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_ε is sufficient to guarantee that, with probability 1δabsent1𝛿\geq 1-\delta≥ 1 - italic_δ, 𝒫𝒫\mathcal{P}caligraphic_P is reported in output, obtaining the statement. ∎

A.6. Baseline methods

In this Section we provide additional details on the baseline algorithm FSR-U-UB considered in our experimental evaluation. As anticipated in Section 5, it is not possible to apply standard statistical techniques (e.g., Bonferroni correction) to bound the FWER when the number |||\mathcal{L}|| caligraphic_L | of tested hypothesis is unbounded. Therefore, to overcome this issue we proceed as follows. Recall that N(𝒟)subscript𝑁𝒟N_{\mathcal{L}}(\mathcal{D})italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) is the number of distinct projections of the language \mathcal{L}caligraphic_L on the dataset 𝒟𝒟\mathcal{D}caligraphic_D (defined in Section A.5). The key idea for FSR-U-UB is to apply a Bonferroni correction on the N(𝒟)subscript𝑁𝒟N_{\mathcal{L}}(\mathcal{D})italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) distinct tested hypothesis only. We do so with the following result, that defines the value of ε𝜀\varepsilonitalic_ε used by FSR-U-UB in Algorithm 1 to bound the FWER. Note that FSR-U-UB does not use any resamples, but uses N(𝒟)subscript𝑁𝒟N_{\mathcal{L}}(\mathcal{D})italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) to obtain an analytical value for ε𝜀\varepsilonitalic_ε instead.

Theorem 12.

Let 𝒟𝒟\mathcal{D}caligraphic_D be a dataset of m𝑚mitalic_m samples taken i.i.d. from a distribution γ𝛾\gammaitalic_γ. For any δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ), define νTμ(1μ)subscript𝜈𝑇𝜇1𝜇\nu_{T}\geq\mu(1-\mu)italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≥ italic_μ ( 1 - italic_μ ), ννTsup𝒫{𝔼𝒟[𝖿𝒫(𝒟)]}𝜈subscript𝜈𝑇subscriptsupremum𝒫superscriptsubscript𝔼𝒟delimited-[]subscript𝖿𝒫𝒟\nu\geq\nu_{T}\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{\mathop{\mathbb{E% }}_{\mathcal{D}}\left[\mathsf{f}_{\mathcal{P}}(\mathcal{D})\right]\right\}italic_ν ≥ italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ sansserif_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) ] }, N^N(𝒟)^𝑁subscript𝑁𝒟\hat{N}\geq N_{\mathcal{L}}(\mathcal{D})over^ start_ARG italic_N end_ARG ≥ italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ), and ε𝜀\varepsilonitalic_ε as

r^=ln(4N^δ)2m^𝑟4^𝑁𝛿2𝑚\displaystyle\hat{r}=\sqrt{\frac{\ln(\frac{4\hat{N}}{\delta})}{2m}}over^ start_ARG italic_r end_ARG = square-root start_ARG divide start_ARG roman_ln ( divide start_ARG 4 over^ start_ARG italic_N end_ARG end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_m end_ARG end_ARG
d^=r^+(2νTln(4δ)m)2+2r^ln(4δ)m+2νTln(4δ)m^𝑑^𝑟superscript2subscript𝜈𝑇4𝛿𝑚22^𝑟4𝛿𝑚2subscript𝜈𝑇4𝛿𝑚\displaystyle\hat{d}=\hat{r}+\sqrt{\left(\frac{2\nu_{T}\ln\bigl{(}\frac{4}{% \delta}\bigr{)}}{m}\right)^{2}+\frac{2\hat{r}\ln\bigl{(}\frac{4}{\delta}\bigr{% )}}{m}}+\frac{2\nu_{T}\ln\bigl{(}\frac{4}{\delta}\bigr{)}}{m}over^ start_ARG italic_d end_ARG = over^ start_ARG italic_r end_ARG + square-root start_ARG ( divide start_ARG 2 italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 2 over^ start_ARG italic_r end_ARG roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG + divide start_ARG 2 italic_ν start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_m end_ARG
εd^+2ln(4δ)(ν+2d^)m+ln(4δ)3m.approaches-limit𝜀^𝑑24𝛿𝜈2^𝑑𝑚4𝛿3𝑚\displaystyle\varepsilon\doteq\hat{d}+\sqrt{\frac{2\ln\bigl{(}\frac{4}{\delta}% \bigr{)}\left(\nu+2\hat{d}\right)}{m}}+\frac{\ln\bigl{(}\frac{4}{\delta}\bigr{% )}}{3m}.italic_ε ≐ over^ start_ARG italic_d end_ARG + square-root start_ARG divide start_ARG 2 roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) ( italic_ν + 2 over^ start_ARG italic_d end_ARG ) end_ARG start_ARG italic_m end_ARG end_ARG + divide start_ARG roman_ln ( divide start_ARG 4 end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 3 italic_m end_ARG .

With probability at least 1δ1𝛿1-\delta1 - italic_δ over 𝒟𝒟\mathcal{D}caligraphic_D it holds

sup𝒫{𝗊𝒫(𝒟)}ε.subscriptsupremum𝒫superscriptsubscript𝗊𝒫𝒟𝜀\displaystyle\sup_{\mathcal{P}\in\mathcal{L}^{\star}}\left\{\mathsf{q}_{% \mathcal{P}}(\mathcal{D})\right\}\leq\varepsilon.roman_sup start_POSTSUBSCRIPT caligraphic_P ∈ caligraphic_L start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { sansserif_q start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( caligraphic_D ) } ≤ italic_ε .
Proof.

We follow similar steps taken in the proof of Theorem 11; however, the key difference is in the upper bound r^^𝑟\hat{r}over^ start_ARG italic_r end_ARG to 𝔼[d~()|𝒜]subscript𝔼superscriptdelimited-[]conditional~𝑑superscript𝒜\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}[\tilde{d}(\mathcal{R}^{\star})\>|\>% \mathcal{A}]blackboard_E start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) | caligraphic_A ]. We have, using Hoeffding’s inequality and an union bound, that for any t0𝑡0t\geq 0italic_t ≥ 0

Pr(d~()t)N(𝒟)exp(2mt2)N^exp(2mt2).subscriptPrsuperscript~𝑑superscript𝑡subscript𝑁𝒟2𝑚superscript𝑡2^𝑁2𝑚superscript𝑡2\displaystyle\Pr_{\mathcal{R}^{\star}}\left(\tilde{d}(\mathcal{R}^{\star})\geq t% \right)\leq N_{\mathcal{L}}(\mathcal{D})\exp\left(-2mt^{2}\right)\leq\hat{N}% \exp\left(-2mt^{2}\right).roman_Pr start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ≥ italic_t ) ≤ italic_N start_POSTSUBSCRIPT caligraphic_L end_POSTSUBSCRIPT ( caligraphic_D ) roman_exp ( - 2 italic_m italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ≤ over^ start_ARG italic_N end_ARG roman_exp ( - 2 italic_m italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

This implies that, with probability 1δ/4absent1𝛿4\geq 1-\delta/4≥ 1 - italic_δ / 4 over superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, it holds

d~()ln(4N^δ)2m.~𝑑superscript4^𝑁𝛿2𝑚\displaystyle\tilde{d}(\mathcal{R}^{\star})\leq\sqrt{\frac{\ln(\frac{4\hat{N}}% {\delta})}{2m}}.over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ≤ square-root start_ARG divide start_ARG roman_ln ( divide start_ARG 4 over^ start_ARG italic_N end_ARG end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG 2 italic_m end_ARG end_ARG .

Taking the expectation w.r.t. superscript\mathcal{R}^{\star}caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, conditionally on 𝒜𝒜\mathcal{A}caligraphic_A, proves that 𝔼[d~()|𝒜]r^subscript𝔼superscriptdelimited-[]conditional~𝑑superscript𝒜^𝑟\mathop{\mathbb{E}}_{\mathcal{R}^{\star}}[\tilde{d}(\mathcal{R}^{\star})\>|\>% \mathcal{A}]\leq\hat{r}blackboard_E start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ over~ start_ARG italic_d end_ARG ( caligraphic_R start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) | caligraphic_A ] ≤ over^ start_ARG italic_r end_ARG and gives the statement. ∎

A.7. Details on CNN interpretation

In this section we provide all details regarding the application of FSR to interpret the activation of neurons in a CNN. We considered the MNIST handwritten digit dataset, and train a CNN to predict the digit contained in each image. Following the experimental setup of (Fischer et al., 2021), we employ a simple CNN composed by: two convolutional layers, with respectively 10101010 and 40404040 filters and 3×3333\times 33 × 3 kernels; each convolutional layer is followed by 2×2222\times 22 × 2 maxpooling and dropout of rate 0.250.250.250.25; a fully connected layer with 64646464 nodes and ReLu activations; a dropout layer with rate 0.50.50.50.5; the output layer of size 10101010 with softmax activations. We trained the network over the 60000600006000060000 images of the training set, using SGD based on categorical cross entropy loss, with learning rate 0.010.010.010.01, momentum 0.90.90.90.9, 12121212 epochs, and batch size of 128128128128, obtaining an accuracy of 0.9870.9870.9870.987 on the 10000100001000010000 holdout instances. Then, analogously to (Fischer et al., 2021), we computed the activations of neurons in the first filter of the first convolutional layer on the 10000100001000010000 testing instances, obtaining d=676𝑑676d=676italic_d = 676 continuous features for m=10000𝑚10000m=10000italic_m = 10000 transactions. As binary target, we use the value 1111 for all digits containing straight lines only (the digits 1111 and 7777), and 00 for all other digits. Doing so, we obtain a fraction μ(𝒟)=0.216𝜇𝒟0.216\mu(\mathcal{D})=0.216italic_μ ( caligraphic_D ) = 0.216 of samples with label 1111. We ran FSR-C on this dataset, using c=10𝑐10c=10italic_c = 10 resamples and z=1𝑧1z=1italic_z = 1, to identify significant subgroups with FWER 0.05absent0.05\leq 0.05≤ 0.05. For each neuron, we identify the most significant subgroup containing a condition over the corresponding continuous feature x𝑥xitalic_x. We focus on conditions of the type xt𝑥𝑡x\geq titalic_x ≥ italic_t and x<t𝑥𝑡x<titalic_x < italic_t, where t𝑡titalic_t is any real valued threshold; the condition xt𝑥𝑡x\geq titalic_x ≥ italic_t denotes that the neuron x𝑥xitalic_x is activated, with values at least t𝑡titalic_t, while x<t𝑥𝑡x<titalic_x < italic_t means that x𝑥xitalic_x is inactive, with a value at most t𝑡titalic_t.

In Figure 4 we show the results of this experiment. Figures 4-(a) and (b) show the average activation values of all neurons with label 1111 (corresponding to the digits 1111 and 7777) and label 00 (all other digits). Then, in Figure 4-(c) we show the neurons with activation conditions significantly associated to the target label 1111: neurons that are significantly activated, i.e., with conditions xt𝑥𝑡x\geq titalic_x ≥ italic_t, are shown in red, while neurons that are significantly inactive, i.e., with conditions x<t𝑥𝑡x<titalic_x < italic_t, are shown in blue. The color intensity (i.e., toward red or blue) depends on the threshold t𝑡titalic_t: for activated neurons it saturates at the maximum activation value, while for inactive neurons at the minimum. Interestingly, we observe that the neurons in this filter are significantly activated in the presence of a central vertical stroke (red pixels), that specifically characterizes the digits 1111 and 7777 (Figure 4-(a)). We remark, however, that this observation is not immediate from the average activations alone, and may be lost when the activation values are binarized, since FSR identifies subgroups that precisely separate the two classes of digits by using granular threshould values. On the countrary, neurons that are significantly inactive (blue pixels) are the ones surrounding this vertical area, forming two rounded shapes, which are typical of the other digits (Figure 4-(b)). Interestingly, these blue regions seem to shape the negative of the digits 1111 and 7777.

From this experiment we conclude that FSR can be effectively applied to identify complex patterns, such as the ones for the interpretation of the activations within a neural layer.

Refer to caption

Refer to caption

(a)

Refer to caption

(b)

Refer to caption

(c)
Figure 4. Average activation of neurons of the first filter in the first convolutional layer, for the digits 1111 and 7777 (a) and for the other digits (b). (c): neurons that are significantly activated for the digits 1111 and 7777 (in red) and inactive (in blue) identified by FSR.