Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Private Approximate Query over Horizontal Data Federation

Ala Eddine Laouir and Abdessamad Imine Université de Lorraine, CNRS and Inria, NancyFrance firstname.lastname@loria.fr
Abstract.

In many real-world scenarios, multiple data providers need to collaboratively perform analysis of their private data. The challenges of these applications, especially at the big data scale, are time and resource efficiency as well as end-to-end privacy with minimal loss of accuracy. Existing approaches rely primarily on cryptography, which improves privacy, but at the expense of query response time. However, current big data analytics frameworks require fast and accurate responses to large-scale queries, making cryptography-based solutions less suitable. In this work, we address the problem of combining Approximate Query Processing (AQP) and Differential Privacy (DP) in a private federated environment answering range queries on horizontally partitioned multidimensional data. We propose a new approach that considers a data distribution-aware online sampling technique to accelerate the execution of range queries and ensure end-to-end data privacy during and after analysis with minimal loss in accuracy. Through empirical evaluation, we show that our solution is able of providing up to 8888 times faster processing than the basic non-secure solution while maintaining accuracy, formal privacy guarantees and resilience to learning-based attacks.

Differential privacy, Big Data, Federated data, Online sampling.

1. Introduction

The extensive reliance of individuals on software solutions in daily and professional life has led to an exponential growth of data collected by companies, corporations, government organisations, and even hospitals. These vast mines of data, if carefully and efficiently analysed, can provide valuable insights that guide decision-making and business development. In large-scale studies and research, the analysis must be conducted on several data sources to obtain meaningful conclusions. An example of such a case is during a pandemic, where many hospitals jointly conduct studies to have a global view of the problem.

One of the most commonly used tools to analyse and explore these huge volumes of data is OLAP tasks, where various aggregation queries (SUM, COUNT, etc.) can be issued to learn existing patterns and trends within the data. These aggregation queries may seem simple, but they are very time-consuming in big databases. The analysis of data from multiple data providers comes with two main challenges: privacy and resource/time efficiency. The privacy issue arises from the fact that this data is personal and sensitive to individuals, and sharing it with other parties can be very harmful. Many regulations and restrictions like GDPR are imposed by governments on how to process and share such sensitive data. In the case of a federated environment, where a joint study requires the collaboration of many data providers, data sharing is highly restricted. Each data provider must ensure the security and privacy of the data collected from their users during and after the analysis.

To satisfy the requirement of end-to-end privacy, many solutions have been proposed in the literature, and most of them rely on cryptography to ensure there is no data leakage during the exchange and query evaluation. Secure multiparty computation (SMC) solutions(Bater et al., 2018, 2020) appear to be a prominent solution in federated environments. Others use oblivious operations(Bater et al., 2017) or secure hardware (Zheng et al., 2017; Eskandarian and Zaharia, 2017; Qiu et al., 2023) so that during query evaluation, each data provider can maintain the confidentiality of their data. Additionally for securing the end result of any OLAP query, Differential Privacy (DP) (Dwork et al., 2014) is generally considered the gold standard by government and private institutions (Team et al., 2017; Abowd, 2018; Bittau et al., 2017; Erlingsson et al., 2014). Due to its strong formal confidentiality guarantees, DP allows individuals to deny their participation in the database. These query evaluation solutions in a federated environment meet end-to-end security and privacy requirements. However, what they have in common is their reliance on encryption. This causes a huge processing time overhead, and for time-sensitive tasks, utility is measured by both accuracy and speed. They certainly address the privacy issue, but they are time and resource consuming.

The issue of reducing query response time has been widely addressed in the literature, through the need to obtain Approximate Query Processing (AQP). Existing AQP methods can be classified into two types, online approximation and offline synopsis creation. In online approximation, there is Online Aggregation based solutions (Hellerstein et al., 1997; Li et al., 2016; Qin and Rusu, 2014) that provide fast and reliable approximation of the query continuously, and other solutions based on applying online sampling to reduce the processed data and obtain an approximation from a sample (Zhang et al., 2016; Goiri et al., 2015; Song et al., 2018).In offline synopsis creation, views are generated offline using query workloads or/and data statistics (Acharya et al., 1999; Agarwal et al., 2013; Chaudhuri et al., 2007).
In this area of research, the main focus is on efficiency, but privacy has not been considered.

In our work, we address the challenge of answering OLAP aggregation range queries in a federated environment, while preserving end-to-end privacy and improving resource and time consumption for query processing. Our solution relies heavily on differential privacy to secure collaboration and end results, and ensure no information leaks. To speed up queries, we implement a cluster-based sampling method using a well-known statistical estimator that provides accurate estimates for range queries (such as SUM and COUNT) while processing minimal data portions. While existing systems ensure either privacy or speedup for query approximation, to the best of our knowledge, our solution is the first to offer speedup over plain-text execution with end-to-end privacy in a federated environment. Our main contributions can be listed as follows:

  1. (1)

    Definition of a lightweight collaboration method that determines optimal sampling decisions for data providers to maximize accuracy without needing access to their full datasets or information leakage.

  2. (2)

    Introduction of data distribution-aware cluster sampling method with DP guarantees for individual privacy.

  3. (3)

    Meticulous integration of DP at every step with minimal loss of precision.

  4. (4)

    Extensive experimentation to empirically validate the performance of our approach in terms of accuracy and time efficiency.

  5. (5)

    Extensive experimentation to ensure the resilience of our system against learning-based attacks.

Roadmap. The paper is structured as follows: Section 2 reviews some existing works. Section 3 introduces the notions used throughout our paper. Section 4 gives a detailed description of the problem solved by our approach. Section 5 presents our proposed solution in detail. The extensive evaluation of our approach is given in Section 6. In Section 7, we discuss the limitations/extensions of our solution and we conclude in Section 8 by giving some future works.

2. Related Works

Due to the increasing size and distribution of databases, querying and exploring such vast volumes for analytical purposes, quickly and without revealing sensitive information, has become a challenge. Here, we describe the state-of-the-art related to our work.

Approximate Query Processing (AQP). As the quality of a query is based on its accuracy and response time, especially for time-sensitive tasks like OLAP (Wang and Jajodia, 2008) and Business Intelligence (BI), approximating the query offers the best way to strike a balance between these two quality factors.

In the early 1990199019901990s, (Hellerstein et al., 1997) proposed a new interactive method for query processing that provides a quick initial answer with a certain error, refining it as processing continues. Other works followed in this direction (Xu et al., 2008; Li et al., 2016; Wu et al., 2010; Qin and Rusu, 2014), each enhancing specific aspects of the method by including support for group by or propose parallel and distributed versions. Another research direction focuses on processing a small subset of the original data, thereby reducing query run-time. In (Olken and Rotem, 1986, 1995; Piatetsky-Shapiro and Connell, 1984; Song et al., 2018), uniform row-level random sampling is applied online before query processing. Although row-level sampling may improve processing time for complex queries, it can introduce overhead and slow down queries that require a full table scan (Haas and König, 2004) (e.g. Bernoulli sampling). To avoid such overhead, the solutions from (Acharya et al., 1999; Agarwal et al., 2013; Chaudhuri et al., 2007) create the samples offline. Cluster sampling, also referred to as page-sampling (Haas and König, 2004), is utilized to speed-up aggregation queries in big databases. Methods in (Goiri et al., 2015; Zhang et al., 2016; Ahmadvand et al., 2019) use this sampling in the context of Hadoop Map-Reduce framework 111https://hadoop.apache.org, as it proves to be fast and I/O efficient compared to row-level sampling.

Federated query answering. Data is often distributed across multiple locations (e.g. data providers like hospitals and companies) and the collaboration among all parties is necessary to answer range aggregation queries. But for privacy and security reasons, each data provider cannot disclose their data to third parties.

Some solutions rely on secure hardware modules (i.e. enclaves), in which all sensitive code and data are processed. Methods in (Agrawal et al., 2006; Zheng et al., 2017; Eskandarian and Zaharia, 2017) focus on aggregation queries in this setting, and (Zheng et al., 2017; Eskandarian and Zaharia, 2017) use intel’s SGX for secure processing. These solutions are generally efficient, but their reliance on trusted hardware and weakness to side-channel attacks constitute a limitation. Recently in (Qiu et al., 2023), the notion of Differential Obliviousness was used to mitigate the risk of side channel attacks.‘

Other recent works presented Secure Multiparty Computation (SMC) query processing engines (Bater et al., 2017, 2018, 2020). These engines enable data providers to respond to OLAP queries securely by joining data with end-to-end privacy. Differential Privacy (DP) is used to perturb the final results, thereby mitigating any inference attacks based on the results. While these solutions incur computational overhead, (Bater et al., 2020) introduced online random sampling to improve secure computing performance by reducing the size of shared data for query processing. In (Cao et al., 2021), sampling is performed offline to create a synopsis to further improve performance. Another solution (Liagouris et al., 2021) focused on reducing the cost of SMC operation thus obtaining significant improvement in performances. All of these SMC (or enclaves)-based protocols are encryption-based, which prevents them from outperforming plain-text query execution. Even with significant improvements introduces in the past years, on real world big tables they still expensive for real-time queries(Liagouris et al., 2021).

To highlight the scale of this problem, we performed a simulation222https://github.com/AlaEddineLaouir/Federated-Range-Queries.git using a synthetic Adult(Becker and Kohavi, 1996) horizontally distributed on 4 data providers as a federated environment. We ran a set of random range queries, which are the type of queries we focus on. For the query processing, we considered two solutions using SMC: (i) data providers sharing the rows and collectively evaluating the query; and (ii) evaluating the query locally and only sharing the results and computing the final result.

Refer to caption
Figure 1. Runtime cost of data sharing in SMC.

We measured the time required to share the rows/results in SMC. The results in Figure 1 show that sharing only local results incurs an insignificant overhead of 0.040.040.040.04 seconds. On average, this is less than 440440440440 times the time required for row sharing in SMC. Additionally, the cost of sharing only results remains constant and independent of the dataset, whereas the cost of sharing rows will increase with larger tables.

In our work, we propose a framework to approximate query processing in a federated environment, enabling accelerated query execution compared to plain text execution while ensuring end-to-end Differential Privacy guarantees.

3. Preliminaries

In this section, we give the notation and explain briefly notions used throughout the paper.

Data model. In a tabular database T𝑇Titalic_T defined over a set of n𝑛nitalic_n dimensions (or attributes) D={d1,d2,,dn}𝐷subscript𝑑1subscript𝑑2subscript𝑑𝑛D=\{d_{1},d_{2},...,d_{n}\}italic_D = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, each individual is a row with values on each dimension. We assume that each dimension d𝑑ditalic_d is associated with a domain |d|𝑑|d|| italic_d | containing discrete and totally ordered values, the size of the domain is dnorm𝑑||d||| | italic_d | |. For performance purposes during online analytics tasks, the table T𝑇Titalic_T is transformed into a multidimensional data (or a count tensor) Tasuperscript𝑇𝑎T^{a}italic_T start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT of dimensions DaDsuperscript𝐷𝑎𝐷D^{a}\subset Ditalic_D start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ⊂ italic_D, which has an attribute Measure storing the number of aggregated rows of T𝑇Titalic_T. Figure 2 illustrates how to construct a count tensor Tasuperscript𝑇𝑎T^{a}italic_T start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT from table T𝑇Titalic_T by aggregating dimension Service. For simplicity, we use term “table” for “tabular data” and “count tensor”.

Refer to caption
Figure 2. Count tensor

Queries. To analyze and extract insights from these tables, the analyst can issue aggregation queries, helping to explore the data and gain a general understanding of patterns and trends. In this work, we consider a range query Q𝑄Qitalic_Q defined as:

SELECT Aggregation FROM Table WHERE Range, where:

  • Aggregation is COUNT(*) or SUM(Measure).

  • Range is a set of intervals rd=[lbd,ubd]subscript𝑟𝑑superscriptsubscript𝑙𝑏𝑑superscriptsubscript𝑢𝑏𝑑r_{d}=[l_{b}^{d},u_{b}^{d}]italic_r start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = [ italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ] on each dimension dDQ where DQD𝑑superscript𝐷𝑄 where superscript𝐷𝑄𝐷d\in D^{Q}\text{ where }D^{Q}\subseteq Ditalic_d ∈ italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT where italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ⊆ italic_D in Table, such that lbdvubdsuperscriptsubscript𝑙𝑏𝑑𝑣superscriptsubscript𝑢𝑏𝑑l_{b}^{d}\leq v\leq u_{b}^{d}italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ≤ italic_v ≤ italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for every value v|d|𝑣𝑑v\in|d|italic_v ∈ | italic_d |.

In our work, we focus on COUNT and SUM queries because they are used in several analytics applications. For instance, in a big database aggregating per-stock order data for the NASDAQ exchange, these queries are typically used to analyze order data from past days. Additionally, aggregations, such as average, standard deviation, and variance, can be derived from COUNT and SUM.

Query Approximation and Sampling. The goal of query approximation is generally to speed up execution at the expense of answering the query exactly, while preserving answer accuracy as much as possible. Online sampling is employed for time-sensitive tasks to reduce the overhead of evaluating queries on large databases. Note that in this case, the sampling differs from one query to another. In statistical terms, random sampling is essentially the process of selecting a subpopulation SP𝑆𝑃SPitalic_S italic_P from the total population P𝑃Pitalic_P where a sampling rate sr𝑠𝑟sritalic_s italic_r dictates the size of SP𝑆𝑃SPitalic_S italic_P. This subpopulation contains sufficiently representative individuals and properties, capturing various characteristics of P𝑃Pitalic_P such that the analysis conducted on SP𝑆𝑃SPitalic_S italic_P can be generalized to P𝑃Pitalic_P. All random sampling techniques can be categorized based on three main features:

  • Granularity: sampling elements are individuals or a bulk/cluster of individuals.

  • Uniformity: elements are sampled with equal/unequal probabilities.

  • Replacement: sampling elements can be chosen multiple times or only once.

Nowadays, all modern systems choose to split/store a big table T𝑇Titalic_T into a set of smaller, manageable entities T={C1,C2,,CN}𝑇subscript𝐶1subscript𝐶2subscript𝐶𝑁T=\{C_{1},C_{2},...,C_{N}\}italic_T = { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } where each entity has a maximum size S𝑆Sitalic_S. The entity could be Table pages333https://www.postgresql.org/docs/current/storage-page-layout.html, HDFS file Blocks444https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html, etc.

In this paper, we call these storage entities Clusters and we assume that our tables are already stored as a set of clusters. Given this storage format, sampling on databases can be done at two levels: Row/Cluster level (Haas and König, 2004).

In tabular databases with range queries, it is particularly challenging to find an online sampling algorithm that offers speed-up while maintaining accuracy.

Data providers. For many real-world use cases, multiple organizations or institutions, called data providers, publish access to their databases for joint analysis. Let 𝕊𝕊\mathbb{S}blackboard_S be the set of data providers. In this work, we assume that a large table T𝑇Titalic_T is horizontally distributed over 𝕊𝕊\mathbb{S}blackboard_S such that all data providers share the same schema (i.e. a set of dimensions) of T𝑇Titalic_T but each contains different rows. All data providers use clusters of the same size to store their local tables. More importantly, for privacy reasons, data providers collaborate on joint analyzes without revealing their data.

Differential Privacy (DP). A privacy model that provides formal guarantees of indistinguishability such that the query results do not yield much information about the presence or absence of any particular individual. Consequently, it hides information about which of the neighbouring tables (Dwork et al., 2014) was used to answer the query.

Definition 3.0 (Neighbouring Tables(Dwork et al., 2014)).

Two tables T𝑇Titalic_T and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are neighbouring if we can obtain one of them by inserting at most a row into the other.

We use d(T,T)𝑑𝑇superscript𝑇d(T,T^{\prime})italic_d ( italic_T , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) to represent the distance between two tables T𝑇Titalic_T and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and we say that two tables are neighbouring if their distance is 1111 or less.

Definition 3.0 ((ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-Differential Privacy(Dwork et al., 2014)).

A mechanism M𝑀Mitalic_M satisfies (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-Differential Privacy (or (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP) if, for any two neighboring tables T𝑇Titalic_T, Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and for any possible output V𝑉Vitalic_V of M𝑀Mitalic_M:

Pr[M(T)V]exp(ϵ)×Pr[M(T)V]+δPrdelimited-[]𝑀𝑇𝑉𝑒𝑥𝑝italic-ϵPrdelimited-[]𝑀superscript𝑇𝑉𝛿\mbox{Pr}\left[M\left(T\right)\in V\right]\leq exp(\epsilon)\times\mbox{Pr}% \left[M\left(T^{\prime}\right)\in V\right]+\deltaPr [ italic_M ( italic_T ) ∈ italic_V ] ≤ italic_e italic_x italic_p ( italic_ϵ ) × Pr [ italic_M ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_V ] + italic_δ

where δ𝛿\deltaitalic_δ represents the failure probability. We refer to (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ ) as the privacy budget.

In practice, M𝑀Mitalic_M is a randomized algorithm, which has many possible outputs under the same input. It is well known that DP is used to answer specific queries on databases. Let f𝑓fitalic_f be a query on a table T𝑇Titalic_T whose its answer f(𝒯)𝑓𝒯f(\mathcal{T})italic_f ( caligraphic_T ) returns a number. The global sensitivity of f𝑓fitalic_f is the amount by which the output of f𝑓fitalic_f changes for all neighboring tables.

Definition 3.0 (Global Sensitivity(Dwork et al., 2014)).

For any two neighboring tables T𝑇Titalic_T and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the global sensitivity of function f𝑓fitalic_f is:

GSf=maxT,T:d(T,T)1f(T)f(T)1𝐺subscript𝑆𝑓subscript:𝑇superscript𝑇𝑑𝑇superscript𝑇1subscriptnorm𝑓𝑇𝑓superscript𝑇1GS_{f}=\max_{T,T^{\prime}:d(T,T^{\prime})\leq 1}\left\|f(T)-f(T^{\prime})% \right\|_{1}italic_G italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_T , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_d ( italic_T , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ 1 end_POSTSUBSCRIPT ∥ italic_f ( italic_T ) - italic_f ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

where 1\left\|\cdot\right\|_{1}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm.

For instance, if f𝑓fitalic_f is a COUNT range query then GSf𝐺subscript𝑆𝑓GS_{f}italic_G italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is 1111.

The Laplace Mechanism is a randomized mechanism for enforcing ϵitalic-ϵ\epsilonitalic_ϵ-DP (or (ϵ,0)italic-ϵ0(\epsilon,0)( italic_ϵ , 0 )-DP referred to as pure DP), which adds calibrated noise to the output of a function f𝑓fitalic_f based on its global sensitivity GSf𝐺subscript𝑆𝑓GS_{f}italic_G italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT.

Definition 3.0 (Laplace Mechanism (Dwork et al., 2014)).

The Laplace Mechanism adds noise to f(T)𝑓𝑇f(T)italic_f ( italic_T ) as:

S=f(T)+Lap(GSfϵ)𝑆𝑓𝑇Lap𝐺subscript𝑆𝑓italic-ϵS=f(T)+\text{Lap}\left(\frac{GS_{f}}{\epsilon}\right)italic_S = italic_f ( italic_T ) + Lap ( divide start_ARG italic_G italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ end_ARG )

where GSf𝐺subscript𝑆𝑓GS_{f}italic_G italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is the global sensitivity of f𝑓fitalic_f, and Lap(α)𝐿𝑎𝑝𝛼Lap(\alpha)italic_L italic_a italic_p ( italic_α ) denotes sampling from the Laplace distribution with center 00 and scale α𝛼\alphaitalic_α.

Unlike the Laplace Mechanism, which is used to release noisy numerical values, the Exponential Mechanism can be used for biased selection of elements from a set based on a scoring function while preserving (ϵ,0italic-ϵ0\epsilon,0italic_ϵ , 0)-DP (Dwork et al., 2014).

Definition 3.0 (Exponential Mechanism (Dwork et al., 2014)).

Given a set of elements SE𝑆𝐸SEitalic_S italic_E and a scoring function L𝐿Litalic_L, the Exponential Mechanism randomly selects eSE𝑒𝑆𝐸e\in SEitalic_e ∈ italic_S italic_E with the probability of the element e𝑒eitalic_e being proportional to:

exp(ϵ×L(e)2×ΔL)italic-ϵ𝐿𝑒2subscriptΔ𝐿\exp\left(\frac{\epsilon\times L(e)}{2\times\Delta_{L}}\right)roman_exp ( divide start_ARG italic_ϵ × italic_L ( italic_e ) end_ARG start_ARG 2 × roman_Δ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG )

where ΔLsubscriptΔ𝐿\Delta_{L}roman_Δ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is the sensitivity of L𝐿Litalic_L.

Local and Smooth Sensitivity. In many applications of DP, the global sensitivity GSf𝐺subscript𝑆𝑓GS_{f}italic_G italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT cannot bounded. In this case, there is an alternative definition of sensitivity called local sensitivity, where the maximum difference between the query’s results is based on a fixed database T𝑇Titalic_T and any database Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT neighbouring to it:

Definition 3.0 (Local Sensitivity(Nissim et al., 2007)).

Given a database T𝑇Titalic_T and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as any of its possible neighbouring tables, the local sensitivity of function f𝑓fitalic_f is:

LSf(T)=maxT:d(T,T)1f(T)f(T)1𝐿subscript𝑆𝑓𝑇subscript:superscript𝑇𝑑𝑇superscript𝑇1subscriptnorm𝑓𝑇𝑓superscript𝑇1LS_{f}(T)=\max_{T^{\prime}:\,d(T,T^{\prime})\leq 1}\left\|f(T)-f(T^{\prime})% \right\|_{1}italic_L italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_T ) = roman_max start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_d ( italic_T , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ 1 end_POSTSUBSCRIPT ∥ italic_f ( italic_T ) - italic_f ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

where 1\left\|\cdot\right\|_{1}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm.

The local sensitivity LSf(T)𝐿subscript𝑆𝑓𝑇LS_{f}(T)italic_L italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_T ) is often much less than the global sensitivity GSf𝐺subscript𝑆𝑓GS_{f}italic_G italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT because it is based on a specific instance of the data T𝑇Titalic_T. This also makes it unsafe to use, as it can leak information about T𝑇Titalic_T on which it is based. Nassim et al (Nissim et al., 2007). suggest the use of a smoothing function that finds a safe upper bound for LSf(T)𝐿subscript𝑆𝑓𝑇LS_{f}(T)italic_L italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_T ) and can be used to calibrate the randomness (noise) without any risk. These functions usually require that the local sensitivity be computed at any arbitrary distance k𝑘kitalic_k from T𝑇Titalic_T.

Definition 3.0 (Local Sensitivity at Distance k𝑘kitalic_k (Nissim et al., 2007)).

Given a table T𝑇Titalic_T, the local sensitivity of function f𝑓fitalic_f is:

LSf(T)k=maxT:d(T,T)kf(T)f(T)1𝐿subscript𝑆𝑓superscript𝑇𝑘subscript:superscript𝑇𝑑𝑇superscript𝑇𝑘subscriptnorm𝑓𝑇𝑓superscript𝑇1LS_{f}(T)^{k}=\max_{T^{\prime}:\,d(T,T^{\prime})\leq k}\left\|f(T)-f(T^{\prime% })\right\|_{1}italic_L italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_T ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_d ( italic_T , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_k end_POSTSUBSCRIPT ∥ italic_f ( italic_T ) - italic_f ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

where 1\left\|\cdot\right\|_{1}∥ ⋅ ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm.

A safe approximate upper bound of LSf(T)𝐿subscript𝑆𝑓𝑇LS_{f}(T)italic_L italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_T ), S_LSf(T)𝑆_𝐿subscript𝑆𝑓𝑇S\_LS_{f}(T)italic_S _ italic_L italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_T ), which is insensitive to small variations of data can be obtained by the smooth sensitivity framework (Nissim et al., 2007).

Definition 3.0 (Smooth Sensitivity Framework (Nissim et al., 2007)).
S_LSf(T)=maxk=0,1,n{exp(βk)LSf(T)k}𝑆_𝐿subscript𝑆𝑓𝑇𝑚𝑎subscript𝑥𝑘01𝑛𝑒𝑥𝑝𝛽𝑘𝐿subscript𝑆𝑓superscript𝑇𝑘missing-subexpression\begin{array}[]{ll}S\_LS_{f}(T)=max_{k=0,1,...n}\{exp(-\beta k)LS_{f}(T)^{k}\}% \end{array}start_ARRAY start_ROW start_CELL italic_S _ italic_L italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_T ) = italic_m italic_a italic_x start_POSTSUBSCRIPT italic_k = 0 , 1 , … italic_n end_POSTSUBSCRIPT { italic_e italic_x italic_p ( - italic_β italic_k ) italic_L italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_T ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } end_CELL start_CELL end_CELL end_ROW end_ARRAY

where β=ϵ2log(2/δ)𝛽italic-ϵ2𝑙𝑜𝑔2𝛿\beta=\frac{\epsilon}{2log(2/\delta)}italic_β = divide start_ARG italic_ϵ end_ARG start_ARG 2 italic_l italic_o italic_g ( 2 / italic_δ ) end_ARG.

After a number of n𝑛nitalic_n iterations, this upper bound can be used to calibrate noise for the Laplace mechanism to ensure (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP.

DP Properties. Combining several DP mechanisms is possible, and the privacy accounting is managed using the sequential and the parallel composition properties of DP. Let M1,,Mnsubscript𝑀1subscript𝑀𝑛M_{1},\ldots,M_{n}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be mechanisms satisfying (ϵ1,δ1),,(ϵn,δn)subscriptitalic-ϵ1subscript𝛿1subscriptitalic-ϵ𝑛subscript𝛿𝑛(\epsilon_{1},\delta_{1}),\ldots,(\epsilon_{n},\delta_{n})( italic_ϵ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) -DP.

Theorem 3.9 (Sequential Composition (Dwork et al., 2014)).

Applying sequentially M1,,Mnsubscript𝑀1subscript𝑀𝑛M_{1},\ldots,M_{n}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT satisfies (j=1nϵj,j=1nδj)superscriptsubscript𝑗1𝑛subscriptitalic-ϵ𝑗superscriptsubscript𝑗1𝑛subscript𝛿𝑗\left(\sum_{j=1}^{n}\epsilon_{j},\sum_{j=1}^{n}\delta_{j}\right)( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )-DP.

Theorem 3.10 (Parallel Composition (Dwork et al., 2014)).

A mechanism that applies M1,,Mnsubscript𝑀1subscript𝑀𝑛M_{1},\ldots,M_{n}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT on disjoint parts of the data satisfies:
(maxin(ϵi),maxin(δi))𝑚𝑎subscript𝑥𝑖𝑛subscriptitalic-ϵ𝑖𝑚𝑎subscript𝑥𝑖𝑛subscript𝛿𝑖\left(max_{i\in n}(\epsilon_{i}),max_{i\in n}(\delta_{i})\right)( italic_m italic_a italic_x start_POSTSUBSCRIPT italic_i ∈ italic_n end_POSTSUBSCRIPT ( italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_m italic_a italic_x start_POSTSUBSCRIPT italic_i ∈ italic_n end_POSTSUBSCRIPT ( italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )-DP

The post-processing property states that it is safe to execute any function on the output of a DP mechanism.

Theorem 3.11 (Post-Processing (Dwork et al., 2014)).

For any (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP mechanism M𝑀Mitalic_M and any function f𝑓fitalic_f, f(M)𝑓𝑀f(M)italic_f ( italic_M ) satisfies (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP.

In the context of online query answering, each query consumes (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ ) to secure the results. In order to manage/limit the information released to the analyst, a total budget (ξ,ψ)𝜉𝜓(\xi,\psi)( italic_ξ , italic_ψ ) is given which will be consumed by N𝑁Nitalic_N queries such that ξ=Nϵ𝜉𝑁italic-ϵ\xi=N\epsilonitalic_ξ = italic_N italic_ϵ and ψ=Nδ𝜓𝑁𝛿\psi=N\deltaitalic_ψ = italic_N italic_δ. The analyst can continue sending queries until their total budget is consumed.

Secure Multiparty Computation (SMC). it refers to cryptographic protocols that enable a set of independent parties to collaboratively evaluate a query without revealing their private inputs to each other. It also allows them to avoid trusting a third party with the union of their data for query evaluation. However, this safety assurance comes at the cost of resources and processing time. Using SMC is several times slower than insecure alternatives.

4. Problem Statement

Given a federated system in which n𝑛nitalic_n data providers pool their private data for analysis querying. Consider a private table T𝑇Titalic_T (as in Figure 2) which is horizontally partitioned among data providers as tables T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, \ldots, Tnsubscript𝑇𝑛T_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Each data provider wants to keep the individual tuples of their local table confidential and only the schema of T𝑇Titalic_T is public. Suppose an end user sends the following range query Q𝑄Qitalic_Q:

SELECT COUNT(*) FROM Table WHERE 20 <= Age <= 40

where Q𝑄Qitalic_Q is performed on the union of tables stored at the data providers, i=1nTisubscriptsuperscript𝑛𝑖1subscript𝑇𝑖\cup^{n}_{i=1}T_{i}∪ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. However, even though Q𝑄Qitalic_Q may seem very simple at first glance, the big data associated with Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT makes Q𝑄Qitalic_Q very complex and time-consuming.

To solve the problem of slow query response time, we can resort to Approximate Query Processing (AQP) to find a trade-off between accuracy and speed of results via approximation. One very straightforward technique of AQP is to perform random sampling, given a sampling rate sr𝑠𝑟sritalic_s italic_r, to obtain a set of tuples from T𝑇Titalic_T. For example, an end user can request an answer for Q𝑄Qitalic_Q based only on sr=20%𝑠𝑟percent20sr=20\%italic_s italic_r = 20 % of the entire T𝑇Titalic_T. Even for a single table T𝑇Titalic_T, to obtain a good approximation of Q𝑄Qitalic_Q, the sampled tuples must contain meaningful data in the ranges of Q𝑄Qitalic_Q. Random sampling can be done at the row or cluster level. Although cluster-level sampling is faster than row-level sampling, both have linear performance with respect to sampling rate. The larger the sample, the more accurate and slower the result, and vice versa.

Consider T𝑇Titalic_T is stored as a set of clusters. To get an accurate estimate of Q𝑄Qitalic_Q when processing a few parts of the data, we use a statistical estimator (Lohr, 2009). To do this, we need to consider the distribution of rows between all clusters. It should be noted that the assumption of a uniform distribution of rows among all clusters is rarely valid in real databases. Indeed, the rows generally follow a skewed distribution. In contrast, unequal probability cluster sampling is more effective at providing better estimates, where the probability of a cluster being sampled is based on the data distribution for Q𝑄Qitalic_Q.

Assume that each partition Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of T𝑇Titalic_T is stored using clusters. How to apply the unequal probability cluster sampling in our federated context? Note that each cluster within each data provider should have a specific probability p𝑝pitalic_p of being sampled to estimate Q𝑄Qitalic_Q, taking into account all other clusters (even those from other data providers). As a result, capturing the inter/intra data distribution will bias the sampling toward clusters or data providers that hold most of the data related to Q𝑄Qitalic_Q. We refer to this sampling as global sampling.

The other solution is local sampling, where each data provider computes the sampling probabilities for its clusters (without considering other data providers). In this sampling, the sample size is distributed uniformly on data providers, so it does not require a collaboration between data providers. This lack of global data distribution awareness makes this solution less appealing than global sampling.

To apply global data distribution-aware sampling and approximation, data providers must provide appropriate information about their data to quickly and accurately estimate Q𝑄Qitalic_Q. The optimal solution to capture the data distribution in this context is achieved if data providers have access to each other’s data and sampling probabilities are computed collectively. This collaboration will lead to an overhead in processing time. The challenge is then to define the summarized and small pieces of information that data providers can share and be sufficient to capture the data distribution while producing negligible overhead. Once this global data distribution is captured, each data provider can locally sample clusters, estimate the query, and send its result. All results from data providers will be added together and the final result will be returned to the end user.

Another dimension of our problem concerns privacy and data protection. In the federated context, the end-to-end privacy property must be guaranteed. This essentially ensures that data is protected (i) during and after query execution, (ii) for intermediate results during collaboration, and (iii) for the final response. Differential Privacy (DP) is a widely accepted privacy model, typically applied to query results to prevent any inference about the presence or absence of individuals. As for the intermediate results produced during collaboration between data providers, they must also be protected, with each data provider seeking to prevent any leakage of information on its table. Even if the exchange is limited to summarized (aggregated) information, there will be no privacy guarantee. Thus, DP can also be used to publish intermediate results between data providers.

An alternative solution to DP is the use of Secure Multiparty Computation (SMC) to implement collaboration between data providers. This solution has two major drawbacks: If data providers use the summary information for sampling in SMC, query approximation (which includes running the query on each cluster) must also be done in SMC because the sampling is based on sensitive information and its results may disclose information to other data providers. Second, SMC relies heavily on cryptography, which will significantly reduce the utility of the query in terms of processing time, thereby diluting the purpose of approximations.

In this work, we aim to provide fast and accurate responses to range queries in a federated setup while preserving end-to-end privacy. The challenges we address are: defining a lightweight sampling algorithm considering data distribution for query approximation in a federated environment and carefully applying Differential Privacy to ensure end-to-end privacy with minimal loss of query accuracy.

5. Our solution

5.1. Overview

In our proposal, we combine DP with lightweight SMC to protect intermediate results when collaborating between data providers. This allows us to obtain significantly better performance in terms of speed-up and achieve end-to-end privacy, while maintaining high utility answers for online range queries. To achieve these goals, we propose an efficient and lightweight collaboration method, allowing data providers to decide how many samples to extract from each, guided by the summary information shared during this collaboration. To integrate knowledge of the data distribution into our sampling and approximation steps, we use the probability proportional to size (pps) method (Lohr, 2009). Here, the probability p𝑝pitalic_p of including (or sampling) a cluster C𝐶Citalic_C is determined by the proportion R𝑅Ritalic_R of rows in C𝐶Citalic_C falling within the ranges of the query Q𝑄Qitalic_Q. Computing R𝑅Ritalic_R is expensive and requires similar overhead as running the query. To minimize the processing time of Q𝑄Qitalic_Q, we will approximate each R𝑅Ritalic_R of any cluster C𝐶Citalic_C using lightweight metadata associated with C𝐶Citalic_C.

Refer to caption
Figure 3. Protocol and Architecture

Our solution has two main phases: offline data preprocessing and online query answering. In the offline data preprocessing phase, each data provider constructs global and individual metadata for its clusters. This metadata makes query approximation easier without imposing a significant overhead in terms of processing time. All data providers agree on the same maximum cluster size S𝑆Sitalic_S (more details are given in Section 7) before initiating the system. The size S𝑆Sitalic_S may not reflect the actual size of their clusters, but it would be used to calculate the R𝑅Ritalic_R of each cluster. The offline phase and metadata creation are detailed in Section 5.2, and Figure 3 (b) shows the general architecture of our system with each data provider as well as its metadata.

Once preprocessing is complete for all data providers, the system goes online. In the online query response phase, the end user interacts with an aggregator by sending their query Q𝑄Qitalic_Q and desired sampling rate sr𝑠𝑟sritalic_s italic_r and receives a secure response in return. The aggregator manages the rest of the exchanges with the data providers. The query lifecycle (see Figure 3 (a)) as well as the collaboration (exchange of summary data) are described as follows:

  1. (1)

    First, the aggregator sends the query Q𝑄Qitalic_Q to the data providers. Each data provider performs two tasks: i) identify the set of clusters CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT covering Q𝑄Qitalic_Q such that NQ=|CQ|superscript𝑁𝑄superscript𝐶𝑄N^{Q}=|C^{Q}|italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT = | italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT |, ii) compute the proportion R𝑅Ritalic_R of rows for each CCQ𝐶superscript𝐶𝑄C\in C^{Q}italic_C ∈ italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT. The data provider uses previously stored metadata to avoid overhead when performing these two tasks.

  2. (2)

    Each data provider securely (using DP) sends to the aggregator the summarized data needed for collaboration. The number of clusters NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and average of proportions Avg(R^)𝐴𝑣𝑔^𝑅Avg(\widehat{R})italic_A italic_v italic_g ( over^ start_ARG italic_R end_ARG ) where R^={R1,,RNQ}^𝑅subscript𝑅1subscript𝑅superscript𝑁𝑄\widehat{R}=\{R_{1},...,R_{N^{Q}}\}over^ start_ARG italic_R end_ARG = { italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_R start_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }.

  3. (3)

    The aggregator computes and sends the best allocation (sample size s𝑠sitalic_s) for each data provider while respecting the total sample size given by sr𝑠𝑟sritalic_s italic_r.

  4. (4)

    Each data provider tests the condition NQ<Nminsuperscript𝑁𝑄superscript𝑁𝑚𝑖𝑛N^{Q}<N^{min}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT < italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT in order to compute Q𝑄Qitalic_Q “regularly” without approximation. The Nminsuperscript𝑁𝑚𝑖𝑛N^{min}italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT is a threshold set by each data provider to trigger the approximation only if the query is significantly large (more details about Nminsuperscript𝑁𝑚𝑖𝑛N^{min}italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT are given in Section 5.2).

  5. (5)

    If the previous condition does not hold, each data provider randomly and securely with DP samples CSQsubscriptsuperscript𝐶𝑄𝑆C^{Q}_{S}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, where CSQCQsubscriptsuperscript𝐶𝑄𝑆superscript𝐶𝑄C^{Q}_{S}\subset C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ⊂ italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT.

  6. (6)

    After sampling, each data provider estimates Q𝑄Qitalic_Q over CSQsubscriptsuperscript𝐶𝑄𝑆C^{Q}_{S}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT locally and then securely sends the result to the aggregator with DP guarantees.

  7. (7)

    Alternatively, data providers may use SMC to share their local estimations and ”sensitivities”. Then, the aggregator obliviously sums the estimations and applies DP using the maximum sensitivity before safely releasing the final result.

In Section 5.2, we will focus on the approximation via cluster sampling and the metadata created offline. Afterward, section 5.3 will be dedicated to the second phase of our solution. In Section 5.3.1, we will describe the allocation step and how it preserves the same semantics as the naive (sharing all data) method of collaboration by keeping the sampling data distribution aware without an overhead. In Section 5.3.2, we will present the privacy-preserving sampling used by each data provider locally to create CSQsubscriptsuperscript𝐶𝑄𝑆C^{Q}_{S}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. In Section 5.3.3, we detail how to obtain a calibrated DP noise for the end result obtained by using a statistical estimator. Finally in section 5.4, we explain how the privacy budget for each query is managed and consumed.

5.2. Query Approximation and sampling

As previously mentioned in Section 5.1, our unequal probability sampling is based on the proportion R𝑅Ritalic_R of rows in cluster C𝐶Citalic_C that corresponds to Q𝑄Qitalic_Q. Computing the exact R𝑅Ritalic_R for each cluster is as costly as evaluating the query itself, rendering the approximation useless. Inspired by (Zhang et al., 2016), we will only approximate R𝑅Ritalic_R to avoid an overhead in response time. Given a query Q𝑄Qitalic_Q defined by a set of ranges: Q={dDQ — rd=[lbd,ubd]}𝑄for-all𝑑superscript𝐷𝑄 — subscript𝑟𝑑superscriptsubscript𝑙𝑏𝑑superscriptsubscript𝑢𝑏𝑑Q=\{\forall d\in D^{Q}\text{ | }r_{d}=[l_{b}^{d},u_{b}^{d}]\}italic_Q = { ∀ italic_d ∈ italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT — italic_r start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = [ italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ] } on each dimension, we assume that the dimensions are not correlated (independent). We will compute the sub-proportions Rdsuperscript𝑅𝑑R^{d}italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT on each dimension as follows:

Rd=Rd(lbd)Rd(ubd)superscript𝑅𝑑superscript𝑅𝑑absentsuperscriptsubscript𝑙𝑏𝑑superscript𝑅𝑑absentsuperscriptsubscript𝑢𝑏𝑑R^{d}=R^{d\geq}(l_{b}^{d})-R^{d\geq}(u_{b}^{d})italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = italic_R start_POSTSUPERSCRIPT italic_d ≥ end_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) - italic_R start_POSTSUPERSCRIPT italic_d ≥ end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT )
 where : Rd(x)=|rowsdx|S and S is the cluster size\text{ where : }R^{d\geq}(x)=\frac{|rows^{d}\geq x|}{S}\text{ and }S\text{ is % the cluster size}where : italic_R start_POSTSUPERSCRIPT italic_d ≥ end_POSTSUPERSCRIPT ( italic_x ) = divide start_ARG | italic_r italic_o italic_w italic_s start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ≥ italic_x | end_ARG start_ARG italic_S end_ARG and italic_S is the cluster size

The proportion Rdsuperscript𝑅𝑑R^{d}italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is computed based on the proportions Rd(lbd)superscript𝑅𝑑absentsuperscriptsubscript𝑙𝑏𝑑R^{d\geq}(l_{b}^{d})italic_R start_POSTSUPERSCRIPT italic_d ≥ end_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) and Rd(ubd)superscript𝑅𝑑absentsuperscriptsubscript𝑢𝑏𝑑R^{d\geq}(u_{b}^{d})italic_R start_POSTSUPERSCRIPT italic_d ≥ end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) of records whose dimension d𝑑ditalic_d values are lbdabsentsuperscriptsubscript𝑙𝑏𝑑\geq l_{b}^{d}≥ italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and ubdabsentsuperscriptsubscript𝑢𝑏𝑑\geq u_{b}^{d}≥ italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, respectively. Based on the assumption of independence between dimensions, R𝑅Ritalic_R can be obtained as follows:

(1) R=dDQRd and pj=Rji=0NQRi𝑅superscriptproduct𝑑superscript𝐷𝑄superscript𝑅𝑑 and subscript𝑝𝑗subscript𝑅𝑗superscriptsubscript𝑖0superscript𝑁𝑄subscript𝑅𝑖R=\prod^{d\in D^{Q}}R^{d}\textbf{ and }p_{j}=\frac{R_{j}}{\sum_{i=0}^{N^{Q}}R_% {i}}italic_R = ∏ start_POSTSUPERSCRIPT italic_d ∈ italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG

where NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT is the number of clusters covering Q𝑄Qitalic_Q. The approximated R𝑅Ritalic_R can then be used to obtain the sampling probabilities pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for the jth𝑗𝑡jthitalic_j italic_t italic_h cluster as shown in Equation 1. Even this approximation requires a lot of calculations, which may cause similar overhead as the exact R𝑅Ritalic_R. To bypass this limitation, we associate each cluster with a set of metadata that accelerates these computations for any given query (see Algorithm 1).

Algorithm 1 Cluster metadata
1:T={C1,C2,,CN}𝑇subscript𝐶1subscript𝐶2subscript𝐶𝑁T=\{C_{1},C_{2},...,C_{N}\}italic_T = { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }: Set of clusters
2:Clusters_metascreate_global_meta()Clusters_metas𝑐𝑟𝑒𝑎𝑡𝑒_𝑔𝑙𝑜𝑏𝑎𝑙_𝑚𝑒𝑡𝑎\textbf{Clusters\_metas}\leftarrow create\_global\_meta()Clusters_metas ← italic_c italic_r italic_e italic_a italic_t italic_e _ italic_g italic_l italic_o italic_b italic_a italic_l _ italic_m italic_e italic_t italic_a ( )
3:for each CTeach 𝐶𝑇\text{each }C\in Teach italic_C ∈ italic_T do
4:     cluster_meta[]𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑚𝑒𝑡𝑎cluster\_meta\leftarrow[]italic_c italic_l italic_u italic_s italic_t italic_e italic_r _ italic_m italic_e italic_t italic_a ← [ ]
5:     datas_metacreate_datas_meta_(C)datas_meta𝑐𝑟𝑒𝑎𝑡𝑒_𝑑𝑎𝑡𝑎𝑠_𝑚𝑒𝑡𝑎_𝐶\textbf{datas\_meta}\leftarrow create\_datas\_meta\_(C)datas_meta ← italic_c italic_r italic_e italic_a italic_t italic_e _ italic_d italic_a italic_t italic_a italic_s _ italic_m italic_e italic_t italic_a _ ( italic_C )
6:     for each dDeach 𝑑𝐷\text{each }d\in Deach italic_d ∈ italic_D do
7:         for each v|d|Ceach 𝑣subscript𝑑𝐶\text{each }v\in|d|_{C}each italic_v ∈ | italic_d | start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT do
8:              Rd(v)portions_greater_(C,d,v)superscript𝑅𝑑absent𝑣𝑝𝑜𝑟𝑡𝑖𝑜𝑛𝑠_𝑔𝑟𝑒𝑎𝑡𝑒𝑟_𝐶𝑑𝑣R^{d\geq}(v)\leftarrow portions\_greater\_(C,d,v)italic_R start_POSTSUPERSCRIPT italic_d ≥ end_POSTSUPERSCRIPT ( italic_v ) ← italic_p italic_o italic_r italic_t italic_i italic_o italic_n italic_s _ italic_g italic_r italic_e italic_a italic_t italic_e italic_r _ ( italic_C , italic_d , italic_v )
9:              datas_metas.add(d,v,Rd(v))formulae-sequencedatas_metas𝑎𝑑𝑑𝑑𝑣superscript𝑅𝑑absent𝑣\textbf{datas\_metas}.add(d,v,R^{d\geq}(v))datas_metas . italic_a italic_d italic_d ( italic_d , italic_v , italic_R start_POSTSUPERSCRIPT italic_d ≥ end_POSTSUPERSCRIPT ( italic_v ) )
10:         end for
11:         vmind,vmaxdmin_max(|d|C)subscriptsuperscript𝑣𝑑𝑚𝑖𝑛subscriptsuperscript𝑣𝑑𝑚𝑎𝑥𝑚𝑖𝑛_𝑚𝑎𝑥subscript𝑑𝐶v^{d}_{min},v^{d}_{max}\leftarrow min\_max(|d|_{C})italic_v start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ← italic_m italic_i italic_n _ italic_m italic_a italic_x ( | italic_d | start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT )
12:         cluster_meta.add(vmind,vmaxd)formulae-sequence𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑚𝑒𝑡𝑎𝑎𝑑𝑑subscriptsuperscript𝑣𝑑𝑚𝑖𝑛subscriptsuperscript𝑣𝑑𝑚𝑎𝑥cluster\_meta.add(v^{d}_{min},v^{d}_{max})italic_c italic_l italic_u italic_s italic_t italic_e italic_r _ italic_m italic_e italic_t italic_a . italic_a italic_d italic_d ( italic_v start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT )
13:     end for
14:     Clusters_metas.add(cluster_meta)formulae-sequenceClusters_metas𝑎𝑑𝑑𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑚𝑒𝑡𝑎\textbf{Clusters\_metas}.add(cluster\_meta)Clusters_metas . italic_a italic_d italic_d ( italic_c italic_l italic_u italic_s italic_t italic_e italic_r _ italic_m italic_e italic_t italic_a )
15:     save(datas_metas)𝑠𝑎𝑣𝑒datas_metassave(\textbf{datas\_metas})italic_s italic_a italic_v italic_e ( datas_metas )
16:end for
17:save(Clusters_metas)𝑠𝑎𝑣𝑒Clusters_metassave(\textbf{Clusters\_metas})italic_s italic_a italic_v italic_e ( Clusters_metas )

For each cluster C𝐶Citalic_C and for each distinct value v𝑣vitalic_v of dimension dD𝑑𝐷d\in Ditalic_d ∈ italic_D in CT𝐶𝑇C\in Titalic_C ∈ italic_T (Lines 5,6 Algorithm 1), Rd(v)superscript𝑅𝑑absent𝑣R^{d\geq}(v)italic_R start_POSTSUPERSCRIPT italic_d ≥ end_POSTSUPERSCRIPT ( italic_v ) is stored in the dedicated meta file for the cluster where the entry is in the form {d,v,Rd(v)}𝑑𝑣superscript𝑅𝑑absent𝑣\{d,v,R^{d\geq}(v)\}{ italic_d , italic_v , italic_R start_POSTSUPERSCRIPT italic_d ≥ end_POSTSUPERSCRIPT ( italic_v ) } (Line 8 Algorithm 1). These metadata will be used by each data provider to quickly access precomputed proportions that correspond to the range of a given Q𝑄Qitalic_Q. Thus significantly reducing the overhead in the online phase. To further improve the performances, Algorithm 1 stores additional global metadata about the clusters Clusters_metas, enabling the system to easily identify the clusters CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT that correspond to Q𝑄Qitalic_Q before even computing the proportions. In a dedicated global file Clusters metas, for each dimension dD𝑑𝐷d\in Ditalic_d ∈ italic_D in cluster C𝐶Citalic_C, Algorithm 1 (Line 11,13) stores vmindsubscriptsuperscript𝑣𝑑v^{d}_{\min}italic_v start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT (vmaxdsubscriptsuperscript𝑣𝑑v^{d}_{\max}italic_v start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT), the minimum (maximum) value of d𝑑ditalic_d in C𝐶Citalic_C. Based on these metadata in Clusters metas, the system is able to focus only on a small subset of the database CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT that actually contains rows matching Q𝑄Qitalic_Q instead of T𝑇Titalic_T, thus reducing the processing time of Q𝑄Qitalic_Q. The set CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT is defined as follows:

(2) CQ={CT — dDQ , [vmind,vmaxd]rd} where rd is the interval of Q in dimension d.superscript𝐶𝑄for-all𝐶𝑇 — for-all𝑑superscript𝐷𝑄 , subscriptsuperscript𝑣𝑑𝑚𝑖𝑛subscriptsuperscript𝑣𝑑𝑚𝑎𝑥subscript𝑟𝑑missing-subexpression where subscript𝑟𝑑 is the interval of Q in dimension d.missing-subexpression\begin{array}[]{ll}C^{Q}=\{\forall C\in T\text{ | }\forall d\in D^{Q}\text{ , % }[v^{d}_{min},v^{d}_{max}]\cap r_{d}\neq\emptyset\}&\\ \text{ where }r_{d}\text{ is the interval of $Q$ in dimension d.}\end{array}start_ARRAY start_ROW start_CELL italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT = { ∀ italic_C ∈ italic_T — ∀ italic_d ∈ italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , [ italic_v start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] ∩ italic_r start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ≠ ∅ } end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL where italic_r start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is the interval of italic_Q in dimension d. end_CELL start_CELL end_CELL end_ROW end_ARRAY

Since we are able to identify the clusters CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT concerned by Q𝑄Qitalic_Q, it only makes sense to approximate Q𝑄Qitalic_Q only when NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT is bigger than a certain threshold Nminsuperscript𝑁𝑚𝑖𝑛N^{min}italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT. This threshold can be set independently by each data provider based on the size of the clusters, the processing time required for a single cluster, and the hardware and software infrastructure. The cost of saving these metadata is very negligible compared to the actual table and clusters. We used the same data structure like (Zhang et al., 2016) which is very efficient. In Section 6 we show the space needed for each database.

Once the sampling is applied according to the probability computed using Equation 1, the Hansen-Hurwitz estimator (Lohr, 2009) is used to obtain the final estimation of Q𝑄Qitalic_Q. The estimation is done as follows:

(3) E(Q,CSQ)=1NSi=1NSQ(Ci)pi𝐸𝑄subscriptsuperscript𝐶𝑄𝑆1subscript𝑁𝑆superscriptsubscript𝑖1subscript𝑁𝑆𝑄subscript𝐶𝑖subscript𝑝𝑖missing-subexpression\begin{array}[]{ll}E(Q,C^{Q}_{S})=\frac{1}{N_{S}}\sum_{i=1}^{N_{S}}\frac{Q(C_{% i})}{p_{i}}\\ \end{array}start_ARRAY start_ROW start_CELL italic_E ( italic_Q , italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_Q ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_CELL start_CELL end_CELL end_ROW end_ARRAY

where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the sampling probability of the ith𝑖𝑡ithitalic_i italic_t italic_h cluster and Q(Ci)𝑄subscript𝐶𝑖Q(C_{i})italic_Q ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the query execution result on the ith𝑖𝑡ithitalic_i italic_t italic_h cluster

5.3. Federated protocol

In this section, we will review all the steps of online query approximation and how we were able to carefully integrate DP into each step.

5.3.1. Allocation phase

In this step, the data providers 𝕊𝕊\mathbb{S}blackboard_S need to jointly decide the number of clusters to be sampled from each one of them based on the distribution (R𝑅Ritalic_R’s) of data related to Q𝑄Qitalic_Q. So upon receiving the query, each data provider identifies CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and computes the R𝑅Ritalic_R for each CCQ𝐶superscript𝐶𝑄C\in C^{Q}italic_C ∈ italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT using the metadata stored locally. Then each one sends to the Aggregator𝐴𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑜𝑟Aggregatoritalic_A italic_g italic_g italic_r italic_e italic_g italic_a italic_t italic_o italic_r its NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and Avg(R^)Avg^𝑅\text{Avg}({\widehat{R}})Avg ( over^ start_ARG italic_R end_ARG ), where R^^𝑅\widehat{R}over^ start_ARG italic_R end_ARG is the set of R𝑅Ritalic_R’s of the clusters in CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and Avg𝐴𝑣𝑔Avgitalic_A italic_v italic_g stands for Average. NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT indicates the number of clusters within that data provider that overlap with Q𝑄Qitalic_Q, while Avg(R^)Avg^𝑅\text{Avg}({\widehat{R}})Avg ( over^ start_ARG italic_R end_ARG ) shows the average proportion of rows within those clusters that corresponds to Q𝑄Qitalic_Q. Based on this information, we obtain an aggregated (summary) view of the data distribution of records corresponding to Q𝑄Qitalic_Q in each data provider. Using these insights, the Aggregator𝐴𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑜𝑟Aggregatoritalic_A italic_g italic_g italic_r italic_e italic_g italic_a italic_t italic_o italic_r finds the best sample size sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for ith𝑖𝑡ithitalic_i italic_t italic_h data provider using an optimization problem given in Equation 4 that aims to assign a bigger allocation to the data provider with the most data related to Q𝑄Qitalic_Q.

(4)  maximizei=0|𝕊|Avg(R^)i×si wherei=0|𝕊|si=sr×i=0|𝕊|NiQ and sr]0,1[ is the sampling rate and si]1,NQi[\begin{array}[]{ll}\mbox{ {\bf maximize}}&\sum_{i=0}^{|\mathbb{S}|}\text{Avg}(% {\widehat{R}})_{i}\times s_{i}\\ \mbox{ {\bf where}}&\sum_{i=0}^{|\mathbb{S}|}s_{i}=sr\times\sum_{i=0}^{|% \mathbb{S}|}{N^{Q}_{i}}\\ \mbox{ {\bf and }}&sr\in]0,1[\text{ is the sampling rate}\\ \mbox{ {\bf and }}&s_{i}\in]1,N^{Q}_{i}[\end{array}start_ARRAY start_ROW start_CELL bold_maximize end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | blackboard_S | end_POSTSUPERSCRIPT Avg ( over^ start_ARG italic_R end_ARG ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_where end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | blackboard_S | end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s italic_r × ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | blackboard_S | end_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_and end_CELL start_CELL italic_s italic_r ∈ ] 0 , 1 [ is the sampling rate end_CELL end_ROW start_ROW start_CELL bold_and end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ] 1 , italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ end_CELL end_ROW end_ARRAY

In Equation 4, the data provider that holds the most data related to Q𝑄Qitalic_Q (has the bigger Avg(R^)iAvgsubscript^𝑅𝑖\text{Avg}({\widehat{R}})_{i}Avg ( over^ start_ARG italic_R end_ARG ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) gets more allocation, thus sampling more clusters to approximate Q𝑄Qitalic_Q locally. This reflects the same behaviour as the original collaboration method (described in Section 4): sampling probabilities are computed globally and the clusters of the data provider with the bigger Rssuperscript𝑅𝑠R^{\prime}sitalic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s are more likely to be sampled than others (higher probabilities, Equation 3). So with our collaboration method, we are able to reproduce similar results and behaviour. It is important to highlight that comparing the Avg(R^)Avg^𝑅\text{Avg}({\widehat{R}})Avg ( over^ start_ARG italic_R end_ARG ) from each data provider is only possible because we imposed they use the same S𝑆Sitalic_S in order to compute the proportions during the metadata creation phase.

To solve the problem in Equation 4, each data provider shares the NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and Avg(R^)Avg^𝑅\text{Avg}({\widehat{R}})Avg ( over^ start_ARG italic_R end_ARG ). Both are sensitive pieces of information that may reveal insights about the individuals within the database. Even if the optimisation in Equation 4 is done over encrypted data, the released allocation sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT might give a data provider insights about the other data providers. To secure the release of this information, each data provider uses Laplace mechanism to ensure formal guarantees of privacy. Given a privacy budget of ϵOsuperscriptitalic-ϵ𝑂\epsilon^{O}italic_ϵ start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT, each data provider perturbs these two values as follows:

(5) Avg~(R^)=Avg(R^)+Lap(ΔAvg(R^)ϵO/2)NQ~=NQ+Lap(1ϵO/2)\begin{array}[]{ll}\widetilde{\text{Avg}}({\widehat{R}})=\text{Avg}({\widehat{% R})}+\text{Lap}(\frac{\Delta_{\text{Avg}(\widehat{R}})}{\epsilon^{O}/2})\\ \widetilde{N^{Q}}=N^{Q}+\text{Lap}(\frac{1}{\epsilon^{O}/2})\end{array}start_ARRAY start_ROW start_CELL over~ start_ARG Avg end_ARG ( over^ start_ARG italic_R end_ARG ) = Avg ( over^ start_ARG italic_R end_ARG ) + Lap ( divide start_ARG roman_Δ start_POSTSUBSCRIPT Avg ( over^ start_ARG italic_R end_ARG end_POSTSUBSCRIPT ) end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT / 2 end_ARG ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_ARG = italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + Lap ( divide start_ARG 1 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT / 2 end_ARG ) end_CELL start_CELL end_CELL end_ROW end_ARRAY

where the sensitivity of NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT to the absence/presence of an individual is 1, and the sensitivity of Avg(R^)Avg^𝑅\text{Avg}(\widehat{R})Avg ( over^ start_ARG italic_R end_ARG ), is ΔAvg(R^\Delta_{\text{Avg}(\widehat{R}}roman_Δ start_POSTSUBSCRIPT Avg ( over^ start_ARG italic_R end_ARG end_POSTSUBSCRIPT.

Theorem 5.1 (Sensitivity of estimator ΔAvg(R^\Delta_{\text{Avg}(\widehat{R}}roman_Δ start_POSTSUBSCRIPT Avg ( over^ start_ARG italic_R end_ARG end_POSTSUBSCRIPT).

For any two neighbouring databases T𝑇Titalic_T, Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT the sensitivity of Avg(R^)Avg^𝑅\text{Avg}({\widehat{R})}Avg ( over^ start_ARG italic_R end_ARG ) is defined as:

ΔAvg(R^)=max(ΔRNmin,1Nmin+1) where: ΔR=1(11S)|D|subscriptΔ𝐴𝑣𝑔^𝑅𝑚𝑎𝑥subscriptΔ𝑅superscript𝑁𝑚𝑖𝑛1superscript𝑁𝑚𝑖𝑛1 where: subscriptΔ𝑅1superscript11𝑆𝐷\Delta_{Avg(\widehat{R})}=max(\frac{\Delta_{R}}{N^{min}},\frac{1}{N^{min}+1})% \\ \text{ where: }\Delta_{R}=1-(1-\frac{1}{S})^{\left|D\right|}roman_Δ start_POSTSUBSCRIPT italic_A italic_v italic_g ( over^ start_ARG italic_R end_ARG ) end_POSTSUBSCRIPT = italic_m italic_a italic_x ( divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT + 1 end_ARG ) where: roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = 1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ) start_POSTSUPERSCRIPT | italic_D | end_POSTSUPERSCRIPT

The proof is given in appendix A.

With this perturbation, the collaboration between data providers for deciding the allocation does not reveal any sensitive information. So the optimization problem is formulated as follows:

(6)  maximizei=0|𝕊|Avg~(Ri^)×si wherei=0|𝕊|si=sr×i=0|𝕊|NiQ~ and sr]0,1[ is the sampling rate and si]1,NiQ~[\begin{array}[]{ll}\mbox{ {\bf maximize}}&\sum_{i=0}^{|\mathbb{S}|}\widetilde{% \text{Avg}}({\widehat{R_{i}}})\times s_{i}\\ \mbox{ {\bf where}}&\sum_{i=0}^{|\mathbb{S}|}s_{i}=sr\times\sum_{i=0}^{|% \mathbb{S}|}\widetilde{N^{Q}_{i}}\\ \mbox{ {\bf and }}&sr\in]0,1[\text{ is the sampling rate}\\ \mbox{ {\bf and }}&s_{i}\in]1,\widetilde{N^{Q}_{i}}[\end{array}start_ARRAY start_ROW start_CELL bold_maximize end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | blackboard_S | end_POSTSUPERSCRIPT over~ start_ARG Avg end_ARG ( over^ start_ARG italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) × italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_where end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | blackboard_S | end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_s italic_r × ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | blackboard_S | end_POSTSUPERSCRIPT over~ start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL bold_and end_CELL start_CELL italic_s italic_r ∈ ] 0 , 1 [ is the sampling rate end_CELL end_ROW start_ROW start_CELL bold_and end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ] 1 , over~ start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG [ end_CELL end_ROW end_ARRAY

The test of NQ<Nminsuperscript𝑁𝑄superscript𝑁𝑚𝑖𝑛N^{Q}<N^{min}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT < italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT comes after the allocation (collaboration) phase in order to encourage all data providers to participate. Otherwise, if a data provider does not participate in allocation because locally approximating Q𝑄Qitalic_Q is not possible, this may reveal information about the size of its data to other data providers.

5.3.2. Sampling phase

After the allocation phase, each data provider receives an allocation s𝑠sitalic_s: the number of clusters to process for the Q𝑄Qitalic_Q approximation. Using the R^^𝑅\widehat{R}over^ start_ARG italic_R end_ARG computed locally, the data provider computes the sampling probabilities for CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and then performs unequal probability sampling to randomly select s𝑠sitalic_s clusters. Since the sampling probabilities are computed based on the rows (individuals) in the database, the result of the sampling (choices) may leak information about the presence/absence of any individual. To guarantee DP, our system uses the Exponential Mechanism (EM) to select the s𝑠sitalic_s clusters CSQCQsubscriptsuperscript𝐶𝑄𝑆superscript𝐶𝑄C^{Q}_{S}\subset C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ⊂ italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT (Algorithm 2) while consuming ϵSsuperscriptitalic-ϵ𝑆\epsilon^{S}italic_ϵ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT privacy budget.

Algorithm 2 EM_sampling𝐸𝑀_𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔EM\_samplingitalic_E italic_M _ italic_s italic_a italic_m italic_p italic_l italic_i italic_n italic_g
1:CQ:set of clusters:superscript𝐶𝑄set of clustersC^{Q}:\text{set of clusters}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT : set of clusters, R^:set of corresponding Rs to CQ:^𝑅set of corresponding Rs to CQ\widehat{R}:\text{set of corresponding $R^{\prime}s$ to $C^{Q}$}over^ start_ARG italic_R end_ARG : set of corresponding italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s to italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, s:sample size:𝑠sample sizes:\text{sample size}italic_s : sample size, ϵS:total budget:superscriptitalic-ϵ𝑆total budget\epsilon^{S}:\text{total budget}italic_ϵ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT : total budget
2:Pget_sampling_probabilities(R^)𝑃get_sampling_probabilities^𝑅P\leftarrow\text{get\_sampling\_probabilities}(\widehat{R})italic_P ← get_sampling_probabilities ( over^ start_ARG italic_R end_ARG ) \triangleright Equation 1
3:PEM[]superscript𝑃𝐸𝑀P^{EM}\leftarrow[]italic_P start_POSTSUPERSCRIPT italic_E italic_M end_POSTSUPERSCRIPT ← [ ]
4:ϵsϵS/ssuperscriptitalic-ϵ𝑠superscriptitalic-ϵ𝑆𝑠\epsilon^{s}\leftarrow\epsilon^{S}/sitalic_ϵ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ← italic_ϵ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT / italic_s
5:for i[1,NQ]𝑖1superscript𝑁𝑄i\in[1,N^{Q}]italic_i ∈ [ 1 , italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ] do
6:     PEM[i]exp(ϵs×P[i]2×Δp)superscript𝑃𝐸𝑀delimited-[]𝑖expsuperscriptitalic-ϵ𝑠𝑃delimited-[]𝑖2Δ𝑝P^{EM}[i]\leftarrow\text{exp}\left(\frac{\epsilon^{s}\times P[i]}{2\times% \Delta p}\right)italic_P start_POSTSUPERSCRIPT italic_E italic_M end_POSTSUPERSCRIPT [ italic_i ] ← exp ( divide start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT × italic_P [ italic_i ] end_ARG start_ARG 2 × roman_Δ italic_p end_ARG )
7:end for
8:CSQrandom_choice(CQ,PEM,s)subscriptsuperscript𝐶𝑄𝑆random_choicesuperscript𝐶𝑄superscript𝑃𝐸𝑀𝑠C^{Q}_{S}\leftarrow\text{random\_choice}(C^{Q},P^{EM},s)italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ← random_choice ( italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT italic_E italic_M end_POSTSUPERSCRIPT , italic_s )
9:Return CSQ,PReturn subscriptsuperscript𝐶𝑄𝑆𝑃\textbf{Return }C^{Q}_{S},PReturn italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_P

The score of the ith𝑖𝑡ithitalic_i italic_t italic_h cluster CiCQsubscript𝐶𝑖superscript𝐶𝑄C_{i}\in C^{Q}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT is its own sampling probability pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (Algorithm 2 line 1), which means the scoring function L𝐿Litalic_L of EM is defined by the computation in Equation 1. So to calibrate the noise (randomness) of EM, we must find the sensitivity of this function L𝐿Litalic_L to the absence/presence of any individual in the database.

Consider two neighbouring databases T𝑇Titalic_T and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is obtained by adding any random record (which represents an individual) to T𝑇Titalic_T at any possible cluster. Given a range query Q𝑄Qitalic_Q, in order to measure ΔpiΔsubscript𝑝𝑖\Delta p_{i}roman_Δ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (sensitivity of pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is the same as L𝐿Litalic_L) we assume the worst case scenario for T𝑇Titalic_T and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT: all clusters of CQ(CQTC^{Q}(C^{Q}\subset Titalic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ( italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ⊂ italic_T) each have a record that corresponds to Q𝑄Qitalic_Q. In this case, their probabilities are the same: p=1NQ𝑝1superscript𝑁𝑄p=\frac{1}{N^{Q}}italic_p = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_ARG. In Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, one record is added to another cluster Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT outside of CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT that matches Q𝑄Qitalic_Q. Thus CQ=CQ{C}superscript𝐶𝑄superscript𝐶𝑄superscript𝐶C^{\prime Q}=C^{Q}\cup\{C^{\prime}\}italic_C start_POSTSUPERSCRIPT ′ italic_Q end_POSTSUPERSCRIPT = italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∪ { italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } and NQ=NQ+1superscript𝑁𝑄superscript𝑁𝑄1N^{\prime Q}=N^{Q}+1italic_N start_POSTSUPERSCRIPT ′ italic_Q end_POSTSUPERSCRIPT = italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1, and for Q𝑄Qitalic_Q all the clusters have the same sampling probability: p=1NQ+1superscript𝑝1superscript𝑁𝑄1p^{\prime}=\frac{1}{N^{Q}+1}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG. So the ΔpΔ𝑝\Delta proman_Δ italic_p can be computed as follows:

(7) Δp|1NQ1NQ+1|Δp1NQ×(NQ+1)Δ𝑝1superscript𝑁𝑄1superscript𝑁𝑄1Δ𝑝1superscript𝑁𝑄superscript𝑁𝑄1missing-subexpression\begin{array}[]{ll}\Delta p\leq\left|\frac{1}{N^{Q}}-\frac{1}{N^{Q}+1}\right|% \implies\Delta p\leq\frac{1}{N^{Q}\times(N^{Q}+1)}\end{array}start_ARRAY start_ROW start_CELL roman_Δ italic_p ≤ | divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG | ⟹ roman_Δ italic_p ≤ divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT × ( italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 ) end_ARG end_CELL start_CELL end_CELL end_ROW end_ARRAY

We notice that ΔpΔ𝑝\Delta proman_Δ italic_p is dependent on the query Q𝑄Qitalic_Q. To find the global maximum value for ΔpΔ𝑝\Delta proman_Δ italic_p, we replace the NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT by its minimum possible value Nminsuperscript𝑁𝑚𝑖𝑛N^{min}italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT.

Theorem 5.2 (Sensitivity of sampling probability).

For any two neighbouring databases T𝑇Titalic_T, Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT the sensitivity of the sampling probability of any cluster C𝐶Citalic_C is bounded by :

Δp=maxT,TpC(T)pC(T)1=1Nmin×(Nmin+1)Δ𝑝subscript𝑇superscript𝑇subscriptnormsubscript𝑝𝐶𝑇subscript𝑝𝐶superscript𝑇11superscript𝑁𝑚𝑖𝑛superscript𝑁𝑚𝑖𝑛1\Delta p=\max_{T,T^{\prime}}\left\|p_{C}(T)-p_{C}(T^{\prime})\right\|_{1}=% \frac{1}{N^{min}\times(N^{min}+1)}roman_Δ italic_p = roman_max start_POSTSUBSCRIPT italic_T , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_T ) - italic_p start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT × ( italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT + 1 ) end_ARG

where .1\left\|.\right\|_{1}∥ . ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm.

In Algorithm 2 line 5, this sensitivity ΔpΔ𝑝\Delta proman_Δ italic_p is used for sampling using EM𝐸𝑀EMitalic_E italic_M. To manage the total budget ϵSsuperscriptitalic-ϵ𝑆\epsilon^{S}italic_ϵ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT allocated for EM𝐸𝑀EMitalic_E italic_M in order to safely make s𝑠sitalic_s selections (Algorithm 2 line 7), we set ϵs=ϵSssuperscriptitalic-ϵ𝑠superscriptitalic-ϵ𝑆𝑠\epsilon^{s}=\frac{\epsilon^{S}}{s}italic_ϵ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = divide start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_ARG start_ARG italic_s end_ARG the budget of each random selection (Algorithm 2 line 3).

5.3.3. Approximation phase

To obtain the final result from CSQsubscriptsuperscript𝐶𝑄𝑆C^{Q}_{S}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, each data provider uses the estimator E𝐸Eitalic_E defined in Equation 3. In order to release the final results securely and have DP privacy guarantees, a well-calibrated noise will be added to the final answer using Laplace Mechanism. To apply Laplace Mechanism, we need to find the sensitivity ΔEsubscriptΔ𝐸\Delta_{E}roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT of the estimator. Let us define 𝔼(C,Q,p)=Q(C)p𝔼𝐶𝑄𝑝𝑄𝐶𝑝\mathbb{E}(C,Q,p)=\frac{Q(C)}{p}blackboard_E ( italic_C , italic_Q , italic_p ) = divide start_ARG italic_Q ( italic_C ) end_ARG start_ARG italic_p end_ARG. We can re-write E𝐸Eitalic_E as follows :

(8) E(Q,CSQ)=1si=1s𝔼(Q,Ci,pi)where s is the size of CSQ𝐸𝑄subscriptsuperscript𝐶𝑄𝑆1𝑠superscriptsubscript𝑖1𝑠𝔼𝑄subscript𝐶𝑖subscript𝑝𝑖missing-subexpressionwhere s is the size of subscriptsuperscript𝐶𝑄𝑆missing-subexpression\begin{array}[]{ll}E(Q,C^{Q}_{S})=\frac{1}{s}\sum_{i=1}^{s}\mathbb{E}(Q,C_{i},% p_{i})\\ \mbox{{\bf where }}\text{$s$ is the size of }C^{Q}_{S}\end{array}start_ARRAY start_ROW start_CELL italic_E ( italic_Q , italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_s end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT blackboard_E ( italic_Q , italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL bold_where s is the size of italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW end_ARRAY

Which implies that :

(9) ΔE=1si=1sΔ𝔼subscriptΔ𝐸1𝑠superscriptsubscript𝑖1𝑠subscriptΔ𝔼\Delta_{E}=\frac{1}{s}\sum_{i=1}^{s}\Delta_{\mathbb{E}}roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_s end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT

To find ΔEsubscriptΔ𝐸\Delta_{E}roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT, we will focus on finding Δ𝔼subscriptΔ𝔼\Delta_{\mathbb{E}}roman_Δ start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT, and deduce ΔEsubscriptΔ𝐸\Delta_{E}roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT afterwards based on this implication. Given that 𝔼(C,Q,p)=Q(C)p𝔼𝐶𝑄𝑝𝑄𝐶𝑝\mathbb{E}(C,Q,p)=\frac{Q(C)}{p}blackboard_E ( italic_C , italic_Q , italic_p ) = divide start_ARG italic_Q ( italic_C ) end_ARG start_ARG italic_p end_ARG is a fraction of two real values, it gives a hint that its sensitivity might be unbounded similarly to Average𝐴𝑣𝑒𝑟𝑎𝑔𝑒Averageitalic_A italic_v italic_e italic_r italic_a italic_g italic_e operator (Near and Abuah, 2021). Upon further analysis (see appendix B), we find that Δ𝔼subscriptΔ𝔼\Delta_{\mathbb{E}}roman_Δ start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT is unbounded, which implies ΔEsubscriptΔ𝐸\Delta_{E}roman_Δ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT is also unbounded.

Theorem 5.3 (Sensitivity of estimator 𝔼𝔼\mathbb{E}blackboard_E).

For any two neighbouring databases T𝑇Titalic_T, Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT the sensitivity of the estimator 𝔼𝔼\mathbb{E}blackboard_E for any cluster C𝐶Citalic_C and query Q𝑄Qitalic_Q is unbounded:

Δ𝔼=maxT,T𝔼(Q,C)𝔼(Q,C)1N×SD21subscriptΔ𝔼subscript𝑇superscript𝑇subscriptnorm𝔼𝑄𝐶𝔼𝑄superscript𝐶1𝑁superscript𝑆𝐷21\Delta_{\mathbb{E}}=\max_{T,T^{\prime}}\left\|\mathbb{E}(Q,C)-\mathbb{E}(Q,C^{% \prime})\right\|_{1}\geq\frac{N\times S^{D}}{2}-1roman_Δ start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_T , italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ blackboard_E ( italic_Q , italic_C ) - blackboard_E ( italic_Q , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ divide start_ARG italic_N × italic_S start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - 1

where .1\left\|.\right\|_{1}∥ . ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm.

See appendix B for a proof of theorem 5.3.

Given that a global sensitivity does not exist, we resort to the Local Sensitivity (LS) which is measured based on the database instance T𝑇Titalic_T. For any database Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT neighbouring to T𝑇Titalic_T obtained by adding 1 row (one individual) that matches the query Q𝑄Qitalic_Q, we can distinguish four scenarios for a cluster CCQ𝐶superscript𝐶𝑄C\in C^{Q}italic_C ∈ italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT (we focus on one cluster C𝐶Citalic_C because we are looking for Δ𝔼subscriptΔ𝔼\Delta_{\mathbb{E}}roman_Δ start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT) that might affect 𝔼𝔼\mathbb{E}blackboard_E:

  • Scenario 1: Cluster C𝐶Citalic_C did not receive the new row, but another cluster did.

  • Scenario 2: Cluster C𝐶Citalic_C did receive the new row.

  • Scenario 3: Cluster C𝐶Citalic_C did not receive the new row but another cluster has been added to CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, such that NQ=NQ+1superscript𝑁𝑄superscript𝑁𝑄1N^{\prime Q}=N^{Q}+1italic_N start_POSTSUPERSCRIPT ′ italic_Q end_POSTSUPERSCRIPT = italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1.

  • Scenario 4: Cluster did receive the new individual, but only add +11+1+ 1 to the Measure𝑀𝑒𝑎𝑠𝑢𝑟𝑒Measureitalic_M italic_e italic_a italic_s italic_u italic_r italic_e attribute of existing aggregate row.

Our aim is to find the upper bound of LS𝔼𝐿subscript𝑆𝔼LS_{\mathbb{E}}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT, thus we must consider the distance that provides the largest sensitivity. An analysis of each of these scenarios (see Appendix B.2) showed that under a certain condition, either scenario 1 or scenario 4 will yield the biggest distance. For a given cluster C𝐶Citalic_C, we can choose the Dominant scenario (which will yield the biggest LS𝔼𝐿subscript𝑆𝔼LS_{\mathbb{E}}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT) between scenarios 1 and 4 without needing to compute any of them.

Theorem 5.4 (Dominant distance LS).

the neighbouring scenario 1 will give bigger distance than scenario 4 iff:

Q(C)>RR^RΔR𝑄𝐶superscript𝑅^𝑅𝑅subscriptΔ𝑅Q(C)>\frac{\sum^{R\in\widehat{R}}R}{\Delta_{R}}italic_Q ( italic_C ) > divide start_ARG ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG start_ARG roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG

See Appendix B.2 for proof.
Since the LS𝔼𝐿subscript𝑆𝔼LS_{\mathbb{E}}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT is computed based on T𝑇Titalic_T, it cannot be used directly to inject noise because the scale of the noise may reveal sensitive information about T𝑇Titalic_T (Near and Abuah, 2021). To avoid such information leakage, we will use the smooth sensitivity framework (Nissim et al., 2007) for finding a safer upper bound S_LS𝔼𝑆_𝐿subscript𝑆𝔼S\_LS_{\mathbb{E}}italic_S _ italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT for the local sensitivity LS𝔼𝐿subscript𝑆𝔼LS_{\mathbb{E}}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT. So we redefine our LS𝔼𝐿subscript𝑆𝔼LS_{\mathbb{E}}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT in terms of a distance k𝑘kitalic_k between T𝑇Titalic_T and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

  • Scenario 1: LS𝔼k=k×Q(C)×ΔRR𝐿subscriptsuperscript𝑆𝑘𝔼𝑘𝑄𝐶subscriptΔ𝑅𝑅LS^{k}_{\mathbb{E}}=k\times\frac{Q(C)\times\Delta_{R}}{R}italic_L italic_S start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT = italic_k × divide start_ARG italic_Q ( italic_C ) × roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG italic_R end_ARG

  • Scenario 4: LS𝔼k=k×1p𝐿subscriptsuperscript𝑆𝑘𝔼𝑘1𝑝LS^{k}_{\mathbb{E}}=k\times\frac{1}{p}italic_L italic_S start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT = italic_k × divide start_ARG 1 end_ARG start_ARG italic_p end_ARG

See Appendix B.2 for proof.

The safe smooth upper S_LS𝔼𝑆_𝐿subscript𝑆𝔼S\_LS_{\mathbb{E}}italic_S _ italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT is defined as follows:

(10) S_LS𝔼=maxk=0,1,n{eβk×LS𝔼k}𝑆_𝐿subscript𝑆𝔼𝑚𝑎subscript𝑥𝑘01𝑛superscript𝑒𝛽𝑘𝐿subscriptsuperscript𝑆𝑘𝔼missing-subexpression\begin{array}[]{ll}S\_LS_{\mathbb{E}}=max_{k=0,1,...n}\{e^{-\beta k}\times LS^% {k}_{\mathbb{E}}\}\end{array}start_ARRAY start_ROW start_CELL italic_S _ italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT = italic_m italic_a italic_x start_POSTSUBSCRIPT italic_k = 0 , 1 , … italic_n end_POSTSUBSCRIPT { italic_e start_POSTSUPERSCRIPT - italic_β italic_k end_POSTSUPERSCRIPT × italic_L italic_S start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT } end_CELL start_CELL end_CELL end_ROW end_ARRAY

where β=ϵE2×ln(2/δ)where 𝛽superscriptitalic-ϵ𝐸22𝛿\text{ where }\beta=\frac{\epsilon^{E}}{2\times\ln(2/\delta)}where italic_β = divide start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT end_ARG start_ARG 2 × roman_ln ( 2 / italic_δ ) end_ARG and (ϵE,δ)superscriptitalic-ϵ𝐸𝛿(\epsilon^{E},\delta)( italic_ϵ start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_δ ) is the privacy budget allocated for releasing the final result.

Algorithm 3 Estimate_Q𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑒_𝑄Estimate\_Qitalic_E italic_s italic_t italic_i italic_m italic_a italic_t italic_e _ italic_Q
1:Q:query,CSQ:clusters,(ϵE,δ):budget,SMC:bool:𝑄𝑞𝑢𝑒𝑟𝑦subscriptsuperscript𝐶𝑄𝑆:𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑠superscriptitalic-ϵ𝐸𝛿:𝑏𝑢𝑑𝑔𝑒𝑡𝑆𝑀𝐶:𝑏𝑜𝑜𝑙Q:query,C^{Q}_{S}:clusters,(\epsilon^{E},\delta):budget,SMC:boolitalic_Q : italic_q italic_u italic_e italic_r italic_y , italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT : italic_c italic_l italic_u italic_s italic_t italic_e italic_r italic_s , ( italic_ϵ start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_δ ) : italic_b italic_u italic_d italic_g italic_e italic_t , italic_S italic_M italic_C : italic_b italic_o italic_o italic_l
2:resultapproximate_Q(Q,CSQ)𝑟𝑒𝑠𝑢𝑙𝑡approximate_Q𝑄subscriptsuperscript𝐶𝑄𝑆result\leftarrow\text{approximate\_Q}(Q,C^{Q}_{S})italic_r italic_e italic_s italic_u italic_l italic_t ← approximate_Q ( italic_Q , italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) \triangleright Equation 3
3:S_LS[]𝑆_𝐿𝑆S\_LS\leftarrow[]italic_S _ italic_L italic_S ← [ ]
4:for i[1,NSQ]𝑖1subscriptsuperscript𝑁𝑄𝑆i\in[1,N^{Q}_{S}]italic_i ∈ [ 1 , italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ] do
5:     S_LS[i]smooth_LS(Q,CSQ[i],ϵE,δ)𝑆_𝐿𝑆delimited-[]𝑖𝑠𝑚𝑜𝑜𝑡_𝐿𝑆𝑄subscriptsuperscript𝐶𝑄𝑆delimited-[]𝑖superscriptitalic-ϵ𝐸𝛿S\_LS[i]\leftarrow smooth\_LS(Q,C^{Q}_{S}[i],\epsilon^{E},\delta)italic_S _ italic_L italic_S [ italic_i ] ← italic_s italic_m italic_o italic_o italic_t italic_h _ italic_L italic_S ( italic_Q , italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT [ italic_i ] , italic_ϵ start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_δ ) \triangleright Equation 10
6:end for
7:LS_smoothaverage(S_LS)𝐿𝑆_𝑠𝑚𝑜𝑜𝑡𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑆_𝐿𝑆LS\_smooth\leftarrow average(S\_LS)italic_L italic_S _ italic_s italic_m italic_o italic_o italic_t italic_h ← italic_a italic_v italic_e italic_r italic_a italic_g italic_e ( italic_S _ italic_L italic_S ) \triangleright Equation 9
8:if SMC then
9:     send_secure(result,LS_smooth)𝑠𝑒𝑛𝑑_𝑠𝑒𝑐𝑢𝑟𝑒𝑟𝑒𝑠𝑢𝑙𝑡𝐿𝑆_𝑠𝑚𝑜𝑜𝑡send\_secure(result,LS\_smooth)italic_s italic_e italic_n italic_d _ italic_s italic_e italic_c italic_u italic_r italic_e ( italic_r italic_e italic_s italic_u italic_l italic_t , italic_L italic_S _ italic_s italic_m italic_o italic_o italic_t italic_h )
10:else
11:     dp_resultresult+Lap(2×LS_smoothϵE)𝑑𝑝_𝑟𝑒𝑠𝑢𝑙𝑡𝑟𝑒𝑠𝑢𝑙𝑡𝐿𝑎𝑝2𝐿𝑆_𝑠𝑚𝑜𝑜𝑡superscriptitalic-ϵ𝐸dp\_result\leftarrow result+Lap(\frac{2\times LS\_smooth}{\epsilon^{E}})italic_d italic_p _ italic_r italic_e italic_s italic_u italic_l italic_t ← italic_r italic_e italic_s italic_u italic_l italic_t + italic_L italic_a italic_p ( divide start_ARG 2 × italic_L italic_S _ italic_s italic_m italic_o italic_o italic_t italic_h end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT end_ARG )
12:     send(dp_result)𝑠𝑒𝑛𝑑𝑑𝑝_𝑟𝑒𝑠𝑢𝑙𝑡send(dp\_result)italic_s italic_e italic_n italic_d ( italic_d italic_p _ italic_r italic_e italic_s italic_u italic_l italic_t )
13:end if

Based on the definitions we gave for LS𝔼k𝐿subscriptsuperscript𝑆𝑘𝔼LS^{k}_{\mathbb{E}}italic_L italic_S start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT, the computational overhead to compute the smooth sensitivity for each cluster CCSQ𝐶subscriptsuperscript𝐶𝑄𝑆C\in C^{Q}_{S}italic_C ∈ italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is very negligible because: i) All the R𝑅Ritalic_R’s and p𝑝pitalic_p’s are computed before this step, and will be reused for each iteration over k𝑘kitalic_k; ii) the maximum value of k𝑘kitalic_k (steps) is also bounded by k=11eβ+1𝑘11superscript𝑒𝛽1k=\frac{1}{1-e^{\beta}}+1italic_k = divide start_ARG 1 end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG + 1 (see Appendix B.3 for proof), which guarantees that the process will terminate; iii) Theorem 5.4 allows to determine which scenario is dominant for any given cluster, thus only computing one S_LS𝔼𝑆_𝐿subscript𝑆𝔼S\_LS_{\mathbb{E}}italic_S _ italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT.

Algorithm 3 describes the process of estimating Q𝑄Qitalic_Q over the subset of cluster CSQsubscriptsuperscript𝐶𝑄𝑆C^{Q}_{S}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. It 3 starts in line 1 by estimating Q𝑄Qitalic_Q according to Equation 3. Then it proceeds to compute the smooth sensitivity (Lines 2-6), where the function smooth_LS𝑠𝑚𝑜𝑜𝑡_𝐿𝑆smooth\_LSitalic_s italic_m italic_o italic_o italic_t italic_h _ italic_L italic_S is responsible for computing the smooth sensitivity S_LS𝔼𝑆_𝐿subscript𝑆𝔼S\_LS_{\mathbb{E}}italic_S _ italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT for each cluster CCSQ𝐶subscriptsuperscript𝐶𝑄𝑆C\in C^{Q}_{S}italic_C ∈ italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT as described in Equation 10. Depending on the chosen setup by the data providers, either they compute and send a DP result to the aggregator (Algorithm 3, Lines 10–11) and the aggregator returns the sum to the user. The second option is that data providers share their estimations and computed sensitivities (Algorithm 3, Line 8) with the Aggregator𝐴𝑔𝑔𝑟𝑒𝑔𝑎𝑡𝑜𝑟Aggregatoritalic_A italic_g italic_g italic_r italic_e italic_g italic_a italic_t italic_o italic_r securely using SMC, and obliviously compute the sum of estimations and the max sensitivity to perturb the final result with Laplace Mechanism.

5.4. Privacy accounting

In the online query answering settings under DP, the end user is limited by a total privacy budget of (ξ,ψ)𝜉𝜓(\xi,\psi)( italic_ξ , italic_ψ ). For each query Q𝑄Qitalic_Q, a budget (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ ) is consumed in order to publish the answer and the end user can interact with system as long as the total budget (ξ,ψ)𝜉𝜓(\xi,\psi)( italic_ξ , italic_ψ ) is not consumed. In this section, we will track the privacy budget ϵitalic-ϵ\epsilonitalic_ϵ consumption for each query.

In our proposed protocol the data providers do not share their data, and Q𝑄Qitalic_Q is processed (data access and publishing) in parallel by each data provider. We can just track the consumption on one data provider, and based on the parallel composition property of DP we can deduce the budget consumption for Q𝑄Qitalic_Q on the full system. A data provider starts by publishing the NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and Avg(R^)Avg^𝑅\text{Avg}({\widehat{R}})Avg ( over^ start_ARG italic_R end_ARG ) using Laplace mechanism for the allocation phase, while consuming a total budget of ϵOsuperscriptitalic-ϵ𝑂\epsilon^{O}italic_ϵ start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT. Based on the post-processing property of DP, obtaining the sample size s𝑠sitalic_s is DP. Afterwards, each data provider uses Exponential Mechanism to sample a subset CSQCQsubscriptsuperscript𝐶𝑄𝑆superscript𝐶𝑄C^{Q}_{S}\subset C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ⊂ italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT while consuming a budget of ϵSsuperscriptitalic-ϵ𝑆\epsilon^{S}italic_ϵ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT. To publish an estimation of Q𝑄Qitalic_Q, each data provider uses Laplace mechanism once more, and consumes a budget of ϵEsuperscriptitalic-ϵ𝐸\epsilon^{E}italic_ϵ start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT. The final step does not in fact guarantee pure DP, since the smooth sensitivity has a δ𝛿\deltaitalic_δ failure probability. Based on the sequential composition property of DP, the total budget is: (ϵ=ϵO+ϵS+ϵE,δ)italic-ϵsuperscriptitalic-ϵ𝑂superscriptitalic-ϵ𝑆superscriptitalic-ϵ𝐸𝛿(\epsilon=\epsilon^{O}+\epsilon^{S}+\epsilon^{E},\delta)( italic_ϵ = italic_ϵ start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_δ ). Given the parallel composition property, the budget consumption for Q𝑄Qitalic_Q is (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ ).

In case the data providers used SMC to inject a single noise, based on parallel composition property we deduce that data providers consumed ϵO+ϵSsuperscriptitalic-ϵ𝑂superscriptitalic-ϵ𝑆\epsilon^{O}+\epsilon^{S}italic_ϵ start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT for the local computation. Afterwards they collectively consumed (once) ϵEsuperscriptitalic-ϵ𝐸\epsilon^{E}italic_ϵ start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT for publishing the result. By the sequential composition property of DP, the budget consumption for Q𝑄Qitalic_Q is (ϵ=ϵO+ϵS+ϵE,δ)italic-ϵsuperscriptitalic-ϵ𝑂superscriptitalic-ϵ𝑆superscriptitalic-ϵ𝐸𝛿(\epsilon=\epsilon^{O}+\epsilon^{S}+\epsilon^{E},\delta)( italic_ϵ = italic_ϵ start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT , italic_δ ).

Based on these results, a set of hyperparameters can be set in our system (by database admin for example) that regulates the ϵitalic-ϵ\epsilonitalic_ϵ budget distribution at each step of the query processing.
Let hp1,hp2 and hp3subscript𝑝1subscript𝑝2 and subscript𝑝3hp_{1},hp_{2}\text{ and }hp_{3}italic_h italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and italic_h italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT be this set of hyperparameters (where hpi]0,1[hp_{i}\in]0,1[italic_h italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ] 0 , 1 [ and hp1+hp2+hp3=1subscript𝑝1subscript𝑝2subscript𝑝31hp_{1}+hp_{2}+hp_{3}=1italic_h italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_h italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_h italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 1) such that : ϵO=hp1×ϵsuperscriptitalic-ϵ𝑂subscript𝑝1italic-ϵ\epsilon^{O}=hp_{1}\times\epsilonitalic_ϵ start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT = italic_h italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_ϵ, ϵS=hp2×ϵsuperscriptitalic-ϵ𝑆subscript𝑝2italic-ϵ\epsilon^{S}=hp_{2}\times\epsilonitalic_ϵ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = italic_h italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_ϵ and ϵE=hp3×ϵsuperscriptitalic-ϵ𝐸subscript𝑝3italic-ϵ\epsilon^{E}=hp_{3}\times\epsilonitalic_ϵ start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = italic_h italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × italic_ϵ.

6. Evaluation

6.1. Setup

Datasets. We used two big datasets: (i) Adult (Becker and Kohavi, 1996) contains demographic and income information for individuals with 15151515 dimensions and 48×10348superscript10348\times 10^{3}48 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT records, synthetically scaled up 4×1064superscript1064\times 10^{6}4 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT records. (ii) Amazon Review (Ni et al., 2019) is about reviews from Amazon clients across different product categories, with only three “range querable” dimensions and 231×106231superscript106231\times 10^{6}231 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT records (120similar-toabsent120\sim 120∼ 120 Gb). We synthetically added three randomly populated dimensions and random records to reach 4×231×1064231superscript1064\times 231\times 10^{6}4 × 231 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT records.

A count tensor with column Measure is created from each dataset, aggregating six dimensions of Adult and one dimension of Amazon Review.

Queries and Workloads. We generated random ranges for the queries and ran only those that lead to the approximation (Nmin<NQsuperscript𝑁𝑚𝑖𝑛superscript𝑁𝑄N^{min}<N^{Q}italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT < italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT) on all data providers. A workload (m,n)𝑚𝑛(m,n)( italic_m , italic_n ) is a set of m𝑚mitalic_m distinct queries with ranges over n𝑛nitalic_n dimensions.

Metrics. An online query is useful if it has a low error rate and low processing time. To measure the query error, we used Relative error=|answerestimation|answerRelative erroranswerestimationanswer\text{{Relative error}}=\frac{|\text{answer}-\text{estimation}|}{\text{answer}}Relative error = divide start_ARG | answer - estimation | end_ARG start_ARG answer end_ARG. For performance in terms of response time, we used: Speed-UP= time of normal computationtime of estimate computationtime of normal computationtime of estimate computation\frac{\text{time of normal computation}}{\text{time of estimate computation}}divide start_ARG time of normal computation end_ARG start_ARG time of estimate computation end_ARG.

Configuration. In our experiments, we assumed that there are one aggregator and four data providers and that each data provider has its own database. Datasets Adult and Amazon Review are horizontally partitioned equally across data providers.

Source code. Based on PostgreSQL555https://developers.google.com/optimization, our solution666https://github.com/AlaEddineLaouir/Federated-Range-Queries.git coded in Python uses the libraries: (i) OrTools777https://developers.google.com/optimization as solver; (ii) Pyro5888https://pyro5.readthedocs.io/en/latest/index.html as communication medium; and, (iii) MPyC999https://mpyc.readthedocs.io/en/latest/mpyc.html as SMC environment. Our implementation is a proof-of-concept in which the clusters of the original table are other smaller tables.

Hyperparameters. In our experiments, the total privacy budget (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ ) for each query is set with δ=103𝛿superscript103\delta=10^{-3}italic_δ = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1 (unless other values are indicated for ϵitalic-ϵ\epsilonitalic_ϵ). The budget ϵitalic-ϵ\epsilonitalic_ϵ is shared between each step of our solution as follows: ϵO=0.1×ϵsuperscriptitalic-ϵ𝑂0.1italic-ϵ\epsilon^{O}=0.1\times\epsilonitalic_ϵ start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT = 0.1 × italic_ϵ, ϵS=0.1×ϵsuperscriptitalic-ϵ𝑆0.1italic-ϵ\epsilon^{S}=0.1\times\epsilonitalic_ϵ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = 0.1 × italic_ϵ and ϵE=0.8×ϵsuperscriptitalic-ϵ𝐸0.8italic-ϵ\epsilon^{E}=0.8\times\epsilonitalic_ϵ start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT = 0.8 × italic_ϵ. To get clusters of the same size, we set the cluster size S𝑆Sitalic_S to 1%percent11\%1 % and 0.5%percent0.50.5\%0.5 % of the total size Tasuperscript𝑇𝑎T^{a}italic_T start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT of each data provider for Adult and Amazon Review, respectively.

Metadata space allocation. The metadata for Amazon Review dataset was about 11 MB (56 KB/cluster). As for Adult dataset, it occupied 6.4 MB (64 KB/cluster).

Hardware101010Grid5000: Grisou cluster https://www.grid5000.fr/w/Nancy:Hardware. For each of the data providers and the aggregator, we allocated a dedicated server with the following configuration: 2222 X Intel Xeon E52630526305-26305 - 2630 v3333 8888 cores/CPU x86_6486_6486\_6486 _ 64, RAM 128128128128 GB and 1.21.21.21.2 TB HDD, and a network with 1111 Gbps + 4444 x 10101010 Gbps (SR‑IOV).

6.2. Dimension-based analysis

In these experiments, we evaluated the impact of the number of dimensions in queries on accuracy. To this end, we generated random workloads (m,n)𝑚𝑛(m,n)( italic_m , italic_n ) with m=100𝑚100m=100italic_m = 100 distinct queries (SUM and COUNT) and dimension n[2,7]𝑛27n\in[2,7]italic_n ∈ [ 2 , 7 ] for Adult and n[2,5]𝑛25n\in[2,5]italic_n ∈ [ 2 , 5 ] for Amazon Review. For the sampling rate, we set it to 5%percent55\%5 % and 20%percent2020\%20 % for Amazon Review and Adult datasets, respectively.

The results presented in Figure 4 show that our solution achieves very high accuracy for COUNT and SUM queries. The relative error is less than 2.5%percent2.52.5\%2.5 % (resp. 11%percent1111\%11 %) on average for COUNT queries on Amazon Review (resp. Adult). As for SUM queries, the error is less than 5%percent55\%5 % (resp. 17%percent1717\%17 %) on Amazon Review (resp. Adult). This performance difference is due to the size difference between the databases. In big tables, query results are larger (contain more data), therefore less affected by Laplace Mechanism noise. Interestingly, the results also indicate that queries become more accurate as the number of dimensions decreases. Specifically, with workloads having only 2222 dimensions on both datasets, we reached an error close to 0%percent00\%0 %. This observed behavior corresponds to our expectations. Because in Equation 1, we approximate R𝑅Ritalic_R of each cluster and the accuracy of this approximation improves as the number of dimensions decreases, bringing the approximation closer to the exact R𝑅Ritalic_R. Thus, we have more accurate sampling probabilities which affect the estimation of the final result. For the speedup, the results in Figure 7 show that the higher the number of dimensions, the less speedup is gained. From the results in Figure 7, the speedup drops from approximately 8x8𝑥8x8 italic_x to 6x6𝑥6x6 italic_x as the number of dimensions increases from 2222 to 5555 on Amazon Review dataset. This drop is attributed to the sampling probabilities approximation phase, where our algorithm looks up the preprocessed metadata. The higher the number of dimensions, the more metadata it needs to look up. However, this effect becomes negligible on larger databases. Because even in these results, the speedup remains very significant.

Refer to caption
Figure 4. Dimension-based analysis

6.3. Sampling rate-based analysis

In this analysis, we examined the effect of sampling rate on query quality. For each database, we generated two random workloads for COUNT and SUM queries of m=100𝑚100m=100italic_m = 100 and n=4𝑛4n=4italic_n = 4. We varied the sampling rate between 5%percent55\%5 % and 20%percent2020\%20 % for each experiment and measured the quality obtained in terms of accuracy and speed-up. From the results in Figure 5, we observe that a higher sampling rate provides slightly better accuracy: reaching a relative error of less than 1%percent11\%1 % with a 20%percent2020\%20 % sampling rate for COUNT queries on Amazon Review dataset.

Refer to caption
Refer to caption
Figure 5. Sampling rate-based analysis

Regarding the speed-up, we note that our solution reaches up to a 7x7𝑥7x7 italic_x compared to a normal execution (without approximation) on Amazon Review (with 4444 dimensional queries). Additionally, the speed-up gains in Amazon Review are 4x4𝑥4x4 italic_x more significant than those in Adult. This result indicates that our solution provides more speed for larger datasets. Also based on the results in Figure 5, the tradeoff between speed-up and accuracy is noticeable. We observe that the larger the sampling, the less the speed-up is gained. On the other hand, accuracy improves with higher sampling rates. We can say that, based on the results shown in this experiment, accuracy gains with higher sampling are very costly in terms of speed-up. But it is up to the users (data analysts) to define the sampling rate according to their needs.

6.4. Privacy budget-based analysis

In these experiments, we analyzed the effect of the privacy budget ϵitalic-ϵ\epsilonitalic_ϵ on query quality. We generated two random workloads of m=100𝑚100m=100italic_m = 100 and n=4𝑛4n=4italic_n = 4 for COUNT and SUM queries and set the sampling rate to 5%percent55\%5 % and 10%percent1010\%10 % for Amazon Review and Adult, respectively. We varied ϵitalic-ϵ\epsilonitalic_ϵ between 0.10.10.10.1 and 1.31.31.31.3 and captured the performance on each workload. From the results in Figure 6, we can immediately observe the typical trend of any DP mechanism (larger ϵitalic-ϵ\epsilonitalic_ϵ leads to better accuracy).

Refer to caption
Figure 6. Epsilon-based analysis
Refer to caption
Figure 7. Impact of dimension and ϵitalic-ϵ\epsilonitalic_ϵ on speed-up

Interestingly, SUM queries are able to provide better utility (lower relative error) than COUNT queries. This happens because SUM queries yield more substantial results (larger query responses) than COUNT queries, making them less affected by noise added to the response. A similar observation applies when comparing results between the two databases, with workloads on Amazon Review preserving more accuracy than those on Adult. This is attributed to the fact that the Amazon Review dataset is much larger than Adult, causing queries to be less affected by the added noise. Based on this observation, we can predict that as the database size increases, the accuracy of our solution will improve by using smaller values for ϵitalic-ϵ\epsilonitalic_ϵ. Regarding speed-up, the results in Figure 7 show that ϵitalic-ϵ\epsilonitalic_ϵ levels have no effect.

6.5. SMC vs DP in terms of sharing results

To examine the performance of our SMC-based solution to share final results, we conducted experiments using an Adult dataset split across four data providers. We generated five random two-dimensional COUNT queries. Each query was repeated five times (with and without SMC) and we measured the speed-up and the the range of noise added using the Laplace mechanism at each iteration.

Refer to caption
Figure 8. SMC effect on speed-up and accuracy

The results in Figure 8 show, for each query, the range of noise sampled using the Laplace mechanism for both solutions at each iteration and speed-up. We notice in Figure 8 that using SMC to share only the sensitivity and the local result does not produce significant overhead, which corresponds to the simulation results in Figure 1. Concerning the injected noise, which affects the precision of the query result, the use of SMC allows a more restricted range of perturbation. Meanwhile, if each data provider perturbs its local data without SMC, there could be two cases: (i) the noises from the data providers cancel each other out, or (ii) the noise accumulates. In the first case, the sum of noises is close to zero because some are positive and others negative, which will help improve accuracy. In the second case, which represents the worst case where most of the noise is positive or negative, the accuracy of the results will be greatly affected.

Based on the experiment results, a user/data provider can choose the appropriate query execution process (with or without SMC) based on their needs, preferring accuracy over speed-up or vice versa.

6.6. Resilience to Learning-Based Attacks

DP prevents membership attacks revealing the presence/absence of an individual in the database. In (Cormode, 2010), the author introduced a simple attack that allows the disclosure of an individual’s sensitive SA𝑆𝐴SAitalic_S italic_A attribute based on anonymized data. This attack relies on training a Naive Bayes Classifier (NBC) using the results of COUNT queries from a noisy database, and this classifier will be used to predict the value of SA𝑆𝐴SAitalic_S italic_A based on a given set of QI𝑄𝐼QIitalic_Q italic_I (quasi-identifiers) attribute values of an individual. In our data model, SA𝑆𝐴SAitalic_S italic_A corresponds to one of the dimensions dSADsubscript𝑑𝑆𝐴𝐷d_{SA}\in Ditalic_d start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT ∈ italic_D, and QI𝑄𝐼QIitalic_Q italic_I is the subset DQID{dSA}subscript𝐷𝑄𝐼𝐷subscript𝑑𝑆𝐴D_{QI}\subseteq D\setminus\{d_{SA}\}italic_D start_POSTSUBSCRIPT italic_Q italic_I end_POSTSUBSCRIPT ⊆ italic_D ∖ { italic_d start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT }. Given VQI={v1,,v|DQI|}subscript𝑉𝑄𝐼subscript𝑣1subscript𝑣subscript𝐷𝑄𝐼V_{QI}=\{v_{1},...,v_{|D_{QI}|}\}italic_V start_POSTSUBSCRIPT italic_Q italic_I end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT | italic_D start_POSTSUBSCRIPT italic_Q italic_I end_POSTSUBSCRIPT | end_POSTSUBSCRIPT } for DQIsubscript𝐷𝑄𝐼D_{QI}italic_D start_POSTSUBSCRIPT italic_Q italic_I end_POSTSUBSCRIPT, a NBC attaches a probability to each possible value y𝑦yitalic_y of dSAsubscript𝑑𝑆𝐴d_{SA}italic_d start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT (y|dSA|𝑦subscript𝑑𝑆𝐴y\in|d_{SA}|italic_y ∈ | italic_d start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT |). The predicted value y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is the one with the highest probability according to Bayes Theorem (Cormode, 2010):

y^=argmaxy|dSA| P(y)i=1|DQI|P(vi|y)/P(vi)^𝑦𝑦subscript𝑑𝑆𝐴 𝑃𝑦subscriptsuperscriptproductsubscript𝐷𝑄𝐼𝑖1𝑃conditionalsubscript𝑣𝑖𝑦𝑃subscript𝑣𝑖\hat{y}=\underset{y\in|d_{SA}|}{\arg\max}\text{ }P(y)\prod^{|D_{QI}|}_{i=1}P(v% _{i}|y)/P(v_{i})over^ start_ARG italic_y end_ARG = start_UNDERACCENT italic_y ∈ | italic_d start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT | end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_P ( italic_y ) ∏ start_POSTSUPERSCRIPT | italic_D start_POSTSUBSCRIPT italic_Q italic_I end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_P ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_y ) / italic_P ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

To make these predictions, the classifier goes through a training phase during which it learns the conditional probabilities using the queries COUNT(*) (or SUM(Measure)) issued by the attacker to the database. The learned probabilities are saved and later used to make predictions. The number of queries nQueries𝑛𝑄𝑢𝑒𝑟𝑖𝑒𝑠nQueriesitalic_n italic_Q italic_u italic_e italic_r italic_i italic_e italic_s needed is:

nQueries=1+dSA+dSA×dQIDQIdQI𝑛𝑄𝑢𝑒𝑟𝑖𝑒𝑠1normsubscript𝑑𝑆𝐴normsubscript𝑑𝑆𝐴subscript𝑑𝑄𝐼subscript𝐷𝑄𝐼norm𝑑𝑄𝐼nQueries=1+||d_{SA}||+||d_{SA}||\times\sum_{d{QI}\in D_{QI}}||d{QI}||italic_n italic_Q italic_u italic_e italic_r italic_i italic_e italic_s = 1 + | | italic_d start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT | | + | | italic_d start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT | | × ∑ start_POSTSUBSCRIPT italic_d italic_Q italic_I ∈ italic_D start_POSTSUBSCRIPT italic_Q italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_d italic_Q italic_I | |

which is used to compute the size of the database, P(y)𝑃𝑦P(y)italic_P ( italic_y ) and P(v|y)/P(v)𝑃conditional𝑣𝑦𝑃𝑣P(v|y)/P(v)italic_P ( italic_v | italic_y ) / italic_P ( italic_v ) for all values and dimensions. For instance, consider a table T with 10000100001000010000 rows and |dSA|=[20,..,60]|d_{SA}|=[20,..,60]| italic_d start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT | = [ 20 , . . , 60 ] is the dimension for Age attribute. To compute P(Age=25)𝑃𝐴𝑔𝑒25P(Age=25)italic_P ( italic_A italic_g italic_e = 25 ), we use the following COUNT query: SELECT COUNT(*) FROM T WHERE 25 <= Age <= 25 )/ 10000. This huge number of queries can be easily issued to a published database using a DP algorithm with a fixed privacy budget (e.g. PrivBayes(Zhang et al., 2017)), and from which the attacker can infer some knowledge (Cormode, 2010; Gkountouna et al., 2022).

However, the database is not published in our system. As we showed in Section 5.4 the attacker has a limited budget (ξ>0,ψ>0)formulae-sequence𝜉0𝜓0(\xi>0,\psi>0)( italic_ξ > 0 , italic_ψ > 0 ), from which each issued query consumes a privacy budget (ϵ>0,δ>0)formulae-sequenceitalic-ϵ0𝛿0(\epsilon>0,\delta>0)( italic_ϵ > 0 , italic_δ > 0 ) based on a sequential composition 3.9. Since nQueries𝑛𝑄𝑢𝑒𝑟𝑖𝑒𝑠nQueriesitalic_n italic_Q italic_u italic_e italic_r italic_i italic_e italic_s can be very large, ϵitalic-ϵ\epsilonitalic_ϵ must be very small ϵ=ξ/nQueriesitalic-ϵ𝜉𝑛𝑄𝑢𝑒𝑟𝑖𝑒𝑠\epsilon=\xi/nQueriesitalic_ϵ = italic_ξ / italic_n italic_Q italic_u italic_e italic_r italic_i italic_e italic_s and δ=ψ/nQueries𝛿𝜓𝑛𝑄𝑢𝑒𝑟𝑖𝑒𝑠\delta=\psi/nQueriesitalic_δ = italic_ψ / italic_n italic_Q italic_u italic_e italic_r italic_i italic_e italic_s, thus losing the utility of query answers. An alternative to sequential composition is Advanced composition (Lohr, 2009; Kairouz et al., 2015), which allows the queries to have a greater budget ϵitalic-ϵ\epsilonitalic_ϵ without exceeding ξ𝜉\xiitalic_ξ. With the advanced composition, the budget of each query is: ϵ=ξ/(2×2×nQueries×log(1δ)) and δ=ψ/nQuesriesitalic-ϵ𝜉22𝑛𝑄𝑢𝑒𝑟𝑖𝑒𝑠𝑙𝑜𝑔1𝛿 and 𝛿𝜓𝑛𝑄𝑢𝑒𝑠𝑟𝑖𝑒𝑠\epsilon=\xi/\left(2\times\sqrt{2\times nQueries\times log(\frac{1}{\delta})}% \right)\text{ and }\delta=\psi/nQuesriesitalic_ϵ = italic_ξ / ( 2 × square-root start_ARG 2 × italic_n italic_Q italic_u italic_e italic_r italic_i italic_e italic_s × italic_l italic_o italic_g ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG ) and italic_δ = italic_ψ / italic_n italic_Q italic_u italic_e italic_s italic_r italic_i italic_e italic_s. We notice that ξ/(2×2×nQueries×log(1δ))>ξ/nQueries𝜉22𝑛𝑄𝑢𝑒𝑟𝑖𝑒𝑠𝑙𝑜𝑔1𝛿𝜉𝑛𝑄𝑢𝑒𝑟𝑖𝑒𝑠\xi/\left(2\times\sqrt{2\times nQueries\times log(\frac{1}{\delta})}\right)>% \xi/nQueriesitalic_ξ / ( 2 × square-root start_ARG 2 × italic_n italic_Q italic_u italic_e italic_r italic_i italic_e italic_s × italic_l italic_o italic_g ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) end_ARG ) > italic_ξ / italic_n italic_Q italic_u italic_e italic_r italic_i italic_e italic_s, which means queries have better utility.

To evaluate the resilience of our system against this learning-based attack, we tested both sequential compositions and the two allowed queries COUNT and SUM. We also considered parallel composition which allows multiple attackers to create a coalition, where each of them executes only one query (to maximize utility) and combines it with those of other attackers to train the classifier. The ingredients of our experiments are as follows:

Setup: We used Adult dataset with four data providers. We selected 3333 dimensions of our table to be DQIsubscript𝐷𝑄𝐼D_{QI}italic_D start_POSTSUBSCRIPT italic_Q italic_I end_POSTSUBSCRIPT and 1111 dimension to be dSAsubscript𝑑𝑆𝐴d_{SA}italic_d start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT where dSA=normsubscript𝑑𝑆𝐴absent||d_{SA}||=| | italic_d start_POSTSUBSCRIPT italic_S italic_A end_POSTSUBSCRIPT | | =100 (i.e. the number of classes for NBC). We also set ψ=106𝜓superscript106\psi=10^{-6}italic_ψ = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and we varied ξ𝜉\xiitalic_ξ between 1111 and 100100100100 since there is no standard value (Lohr, 2009; Laud and Pankova, 2019).

Evaluation: To assess the quality of the learning attack, we measured the accuracy of the NBC in predicting the value of SA𝑆𝐴SAitalic_S italic_A for each row in the original table accuracy=number of correct predictionstotal number of predictions𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦number of correct predictionstotal number of predictionsaccuracy=\frac{\text{number of correct predictions}}{\text{total number of % predictions}}italic_a italic_c italic_c italic_u italic_r italic_a italic_c italic_y = divide start_ARG number of correct predictions end_ARG start_ARG total number of predictions end_ARG.

ξ=1𝜉1\xi=1italic_ξ = 1 ξ=20𝜉20\xi=20italic_ξ = 20 ξ=50𝜉50\xi=50italic_ξ = 50 ξ=100𝜉100\xi=100italic_ξ = 100
Sequential / COUNT <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 %
Sequential / SUM <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 %
Advanced / COUNT <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 %
Advanced / SUM <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 %
Coalition / COUNT <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 %
Coalition / SUM <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 % <1%absentpercent1<1\%< 1 %
Table 1. Inference accuracy based on ξ𝜉\xiitalic_ξ

The results in Table 1 show that in all scenarios the accuracy is <1%absentpercent1<1\%< 1 %. Since the SA𝑆𝐴SAitalic_S italic_A we used had 100100100100 possible values, this means that the trained classifier is given similar accuracy as randomly assigning a value for SA𝑆𝐴SAitalic_S italic_A in each row. Three reasons can be put forward to explain the failure of the learning-based attack: i) our system is interactive (the database is not released) and the budget is limited, thus it is difficult to have good accuracy for large numbers of queries by a single attacker; ii) query answers in our system are approximated with random sampling, which will introduce some error; iii) the smooth sensitivity has a considerable scale, and in the case of queries that collects small values, the accuracy can be lost even for large values of ϵitalic-ϵ\epsilonitalic_ϵ.

Similar results were obtained when fixing the ξ=100𝜉100\xi=100italic_ξ = 100 and changing the number of dimensions in DQIsubscript𝐷𝑄𝐼D_{QI}italic_D start_POSTSUBSCRIPT italic_Q italic_I end_POSTSUBSCRIPT from 1, 3, 5 to 8. This shows the resilience of our system in different settings.

7. Discussion

In this section, we discuss the constraints, limits and points of improvement that could be integrated into our solution. In order to approximate the sampling probabilities in Section 5.2, we assumed that the dimensions are independent and that there is no correlation between them. However, this assumption is not valid in some cases. For example, if an individual’s Age𝐴𝑔𝑒Ageitalic_A italic_g italic_e is less than 25252525, this implies with a high probability that he/she is still studying (profession=student𝑝𝑟𝑜𝑓𝑒𝑠𝑠𝑖𝑜𝑛𝑠𝑡𝑢𝑑𝑒𝑛𝑡profession=studentitalic_p italic_r italic_o italic_f italic_e italic_s italic_s italic_i italic_o italic_n = italic_s italic_t italic_u italic_d italic_e italic_n italic_t). Likewise, if Age>65𝐴𝑔𝑒65Age>65italic_A italic_g italic_e > 65, the attribute profession=retired𝑝𝑟𝑜𝑓𝑒𝑠𝑠𝑖𝑜𝑛𝑟𝑒𝑡𝑖𝑟𝑒𝑑profession=retireditalic_p italic_r italic_o italic_f italic_e italic_s italic_s italic_i italic_o italic_n = italic_r italic_e italic_t italic_i italic_r italic_e italic_d. When it comes to range queries, capturing and managing these dependencies is non-trivial; so we will leave it for future work.

We also restricted the data providers to using the same value of S𝑆Sitalic_S in order to approximate the R𝑅Ritalic_R. Otherwise, we cannot compare the Avg(R^)Avg^𝑅\text{Avg}({\widehat{R}})Avg ( over^ start_ARG italic_R end_ARG ) in the allocation phase (Section 5.3.1). To agree on the same S𝑆Sitalic_S, each data provider 𝕊isubscript𝕊𝑖\mathbb{S}_{i}blackboard_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can share their true Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the others, and they will use then the maximum Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (which will guarantee that all the Rssuperscript𝑅𝑠R^{\prime}sitalic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s computed are 1absent1\leq 1≤ 1). The value of Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT itself is not sensitive since it is usually a constant in a database system. But if this is deemed sensitive in a particular case, then data providers can simply share a randomly chosen Sisubscriptsuperscript𝑆𝑖S^{\prime}_{i}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT value such that: SiSiSimsubscript𝑆𝑖subscriptsuperscript𝑆𝑖subscriptsuperscript𝑆𝑚𝑖S_{i}\leq S^{\prime}_{i}\leq S^{m}_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where Simsubscriptsuperscript𝑆𝑚𝑖S^{m}_{i}italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an upper bound chosen by each data provider (e.g. Sim=2×Sisubscriptsuperscript𝑆𝑚𝑖2subscript𝑆𝑖S^{m}_{i}=2\times S_{i}italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 × italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT).

In our solution, we focused on protecting the intermediate (summary information) and final result from inference attacks with the use of Differential Privacy. However, we have not directly addressed the risks associated with side-channel attacks. It is easy to see that thanks to the collaboration method that we propose, we manage to avoid certain risks mentioned in (Qiu et al., 2023), such as: memory access models and communication volumes since all data-based computations are performed locally at each data provider and the communication cost is constant and independent of the query. But we have postponed further consideration of this aspect of the problem to dedicated work.

Our solution serves as the first building block towards a more comprehensive solution that handles more complex queries, such as GROUP-BY queries. Integrating such clauses in the SQL query is not so trivial, and adding noise to the final result will not be enough to guarantee privacy (Desfontaines et al., 2020). Other aggregations, such as average, standard deviation, and variance, can be derived from SUM and COUNT using the sequential composition of DP. However, to handle other aggregations (such as Min, Max and Mode), different estimators are required.

Finally, during our evaluation, we built a proof of concept of our solution on PostgreSQL. It would be interesting to incorporate it directly into any DBMS, which would further improve our results.

8. Conclusion

In our study, we introduced a lightweight collaborative approach for online range query approximation in a federated environment. Our experimental results demonstrated the performance improvements our solution is capable of delivering, with processing times improved by up to 8x compared to plain-text execution, while ensuring end-to-end privacy with minimal loss of accuracy. Our solution uses cluster sampling and query estimation techniques that take into account data distribution to preserve query utility in terms of speed and accuracy. This work lays a solid foundation for future work to handle more complex queries while maintaining the same level of performance.

References

  • (1)
  • Abowd (2018) John M Abowd. 2018. The US Census Bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2867–2867.
  • Acharya et al. (1999) Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. The aqua approximate query answering system. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 574–576.
  • Agarwal et al. (2013) Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European conference on computer systems. 29–42.
  • Agrawal et al. (2006) Rakesh Agrawal, Dmitri Asonov, Murat Kantarcioglu, and Yaping Li. 2006. Sovereign joins. In 22nd International Conference on Data Engineering (ICDE’06). IEEE, 26–26.
  • Ahmadvand et al. (2019) Hossein Ahmadvand, Maziar Goudarzi, and Fouzhan Foroutan. 2019. Gapprox: using gallup approach for approximation in big data processing. Journal of Big Data 6 (2019), 1–24.
  • Bater et al. (2017) Johes Bater, Gregory Elliott, Craig Eggen, Satyender Goel, Abel N Kho, and Jennie Rogers. 2017. SMCQL: Secure Query Processing for Private Data Networks. Proc. VLDB Endow. 10, 6 (2017), 673–684.
  • Bater et al. (2018) Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, and Jennie Rogers. 2018. Shrinkwrap: efficient sql query processing in differentially private data federations. Proceedings of the VLDB Endowment 12, 3 (2018).
  • Bater et al. (2020) Johes Bater, Yongjoo Park, Xi He, Xiao Wang, and Jennie Rogers. 2020. Saqe: practical privacy-preserving approximate query processing for data federations. Proceedings of the VLDB Endowment 13, 12 (2020), 2691–2705.
  • Becker and Kohavi (1996) Barry Becker and Ronny Kohavi. 1996. Adult. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5XW20.
  • Bittau et al. (2017) Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. 2017. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th symposium on operating systems principles. 441–459.
  • Cao et al. (2021) Lei Cao, Dongqing Xiao, Yizhou Yan, Samuel Madden, and Guoliang Li. 2021. ATLANTIC: making database differentially private and faster with accuracy guarantee. (2021).
  • Chaudhuri et al. (2007) Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems (TODS) 32, 2 (2007), 9–es.
  • Cormode (2010) Graham Cormode. 2010. Individual privacy vs population privacy: Learning to attack anonymization. arXiv preprint arXiv:1011.2511 (2010).
  • Desfontaines et al. (2020) Damien Desfontaines, James Voss, Bryant Gipson, and Chinmoy Mandayam. 2020. Differentially private partition selection. arXiv preprint arXiv:2006.03684 (2020).
  • Dwork et al. (2014) Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 (2014), 211–407.
  • Erlingsson et al. (2014) Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. 1054–1067.
  • Eskandarian and Zaharia (2017) Saba Eskandarian and Matei Zaharia. 2017. Oblidb: Oblivious query processing for secure databases. arXiv preprint arXiv:1710.00458 (2017).
  • Gkountouna et al. (2022) Olga Gkountouna, Katerina Doka, Mingqiang Xue, Jianneng Cao, and Panagiotis Karras. 2022. One-off disclosure control by heterogeneous generalization. In 31st USENIX Security Symposium (USENIX Security 22). 3363–3377.
  • Goiri et al. (2015) Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D Nguyen. 2015. Approxhadoop: Bringing approximations to mapreduce frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. 383–397.
  • Haas and König (2004) Peter J Haas and Christian König. 2004. A bi-level bernoulli scheme for database sampling. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. 275–286.
  • Hellerstein et al. (1997) Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggregation. In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA. 171–182.
  • Kairouz et al. (2015) Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2015. The composition theorem for differential privacy. In International conference on machine learning. PMLR, 1376–1385.
  • Laud and Pankova (2019) Peeter Laud and Alisa Pankova. 2019. Interpreting epsilon of differential privacy in terms of advantage in guessing or approximating sensitive attributes. arXiv preprint arXiv:1911.12777 (2019).
  • Li et al. (2016) Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data. 615–629.
  • Liagouris et al. (2021) John Liagouris, Vasiliki Kalavri, Muhammad Faisal, and Mayank Varia. 2021. Secrecy: Secure collaborative analytics on secret-shared data. arXiv preprint arXiv:2102.01048 (2021).
  • Lohr (2009) Sharon L. Lohr. 2009. Sampling : Design and Analysis.
  • Near and Abuah (2021) Joseph P. Near and Chiké Abuah. 2021. Programming Differential Privacy. Vol. 1. https://programming-dp.com/
  • Ni et al. (2019) Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 188–197.
  • Nissim et al. (2007) Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. 75–84.
  • Olken and Rotem (1986) Frank Olken and Doron Rotem. 1986. Simple random sampling from relational databases. (1986).
  • Olken and Rotem (1995) Frank Olken and Doron Rotem. 1995. Random sampling from databases: a survey. Statistics and Computing 5 (1995), 25–42.
  • Piatetsky-Shapiro and Connell (1984) Gregory Piatetsky-Shapiro and Charles Connell. 1984. Accurate estimation of the number of tuples satisfying a condition. ACM Sigmod Record 14, 2 (1984), 256–276.
  • Qin and Rusu (2014) Chengjie Qin and Florin Rusu. 2014. PF-OLA: a high-performance framework for parallel online aggregation. Distributed and Parallel Databases 32 (2014), 337–375.
  • Qiu et al. (2023) Lina Qiu, Georgios Kellaris, Nikos Mamoulis, Kobbi Nissim, and George Kollios. 2023. Doquet: Differentially Oblivious Range and Join Queries with Private Data Structures. Proceedings of the VLDB Endowment 16, 13 (2023), 4160–4173.
  • Song et al. (2018) Guangxuan Song, Wenwen Qu, Xiaojie Liu, and Xiaoling Wang. 2018. Approximate calculation of window aggregate functions via global random sample. Data Science and Engineering 3 (2018), 40–51.
  • Team et al. (2017) ADP Team et al. 2017. Learning with privacy at scale. Apple Mach. Learn. J 1, 8 (2017), 1–25.
  • Wang and Jajodia (2008) Lingyu Wang and Sushil Jajodia. 2008. Security in Data Warehouses and OLAP Systems. In Handbook of Database Security - Applications and Trends, Michael Gertz and Sushil Jajodia (Eds.). Springer, 191–212.
  • Wu et al. (2010) Sai Wu, Beng Chin Ooi, and Kian-Lee Tan. 2010. Continuous sampling for online aggregation over multiple queries. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 651–662.
  • Xu et al. (2008) Fei Xu, Christopher M. Jermaine, and Alin Dobra. 2008. Confidence bounds for sampling-based group by estimates. ACM Trans. Database Syst. 33, 3 (2008), 16:1–16:44.
  • Zhang et al. (2017) Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2017. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS) 42, 4 (2017), 1–41.
  • Zhang et al. (2016) Xuhong Zhang, Jun Wang, and Jiangling Yin. 2016. Sapprox: Enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling. Proceedings of the VLDB Endowment 10, 3 (2016), 109–120.
  • Zheng et al. (2017) Wenting Zheng, Ankur Dave, Jethro G Beekman, Raluca Ada Popa, Joseph E Gonzalez, and Ion Stoica. 2017. Opaque: An oblivious and encrypted distributed analytics platform. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 283–298.

Appendix A Sensitivity summarised information

In order to obtain the allocation (sample size) s𝑠sitalic_s based on the inter/intra data provider data distribution, each data provider communicate Avg(Ri^)𝐴𝑣𝑔^subscript𝑅𝑖Avg(\widehat{R_{i}})italic_A italic_v italic_g ( over^ start_ARG italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) and NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT. The sensitivity of the NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT is straight forward, given a query Q𝑄Qitalic_Q and any two neighbouring database T𝑇Titalic_T and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : ΔNQ=|NQNQ|1subscriptΔsuperscript𝑁𝑄superscript𝑁𝑄superscript𝑁𝑄1\Delta_{N^{Q}}=\left|N^{Q}-N^{\prime Q}\right|\leq 1roman_Δ start_POSTSUBSCRIPT italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = | italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT - italic_N start_POSTSUPERSCRIPT ′ italic_Q end_POSTSUPERSCRIPT | ≤ 1. Adding/removing and individual at most add/remove a cluster C𝐶Citalic_C from CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT. For the sensitivity of Avg(Ri^)𝐴𝑣𝑔^subscript𝑅𝑖Avg(\widehat{R_{i}})italic_A italic_v italic_g ( over^ start_ARG italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ), we need to consider first the sensitivity of a single R𝑅Ritalic_R.

A.1. Sensitivity R𝑅Ritalic_R

Given a range query Q𝑄Qitalic_Q and two neighbouring databases T𝑇Titalic_T (with cluster C𝐶Citalic_C) and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (with cluster Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT), we consider the case where Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has and additional row for an extra individual. that implies:

(11) R=dDQRd and R=dDQ(Rd+1S)ΔR=|RR|=dDQ(Rd+1S)dDQRd𝑅superscriptproduct𝑑superscript𝐷𝑄superscript𝑅𝑑 and superscript𝑅superscriptproduct𝑑superscript𝐷𝑄superscript𝑅𝑑1𝑆missing-subexpressionsubscriptΔ𝑅superscript𝑅𝑅superscriptproduct𝑑superscript𝐷𝑄superscript𝑅𝑑1𝑆superscriptproduct𝑑superscript𝐷𝑄superscript𝑅𝑑missing-subexpression\begin{array}[]{ll}R=\prod^{d\in D^{Q}}R^{d}\text{ and }R^{\prime}=\prod^{d\in D% ^{Q}}(R^{d}+\frac{1}{S})\\ \Delta_{R}=\left|R^{\prime}-R\right|=\prod^{d\in D^{Q}}(R^{d}+\frac{1}{S})-% \prod^{d\in D^{Q}}R^{d}\end{array}start_ARRAY start_ROW start_CELL italic_R = ∏ start_POSTSUPERSCRIPT italic_d ∈ italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∏ start_POSTSUPERSCRIPT italic_d ∈ italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = | italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_R | = ∏ start_POSTSUPERSCRIPT italic_d ∈ italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ) - ∏ start_POSTSUPERSCRIPT italic_d ∈ italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL end_ROW end_ARRAY

where DQsuperscript𝐷𝑄D^{Q}italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT is the set of dimensions defining Q𝑄Qitalic_Q. In order to obtain the upper bound of ΔRsubscriptΔ𝑅\Delta_{R}roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, we consider R=1superscript𝑅1R^{\prime}=1italic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 which implies R=(11S)|DQ|𝑅superscript11𝑆superscript𝐷𝑄R=(1-\frac{1}{S})^{\left|D^{Q}\right|}italic_R = ( 1 - divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ) start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT :

(12) ΔR=1(11S)|DQ|subscriptΔ𝑅1superscript11𝑆superscript𝐷𝑄\Delta_{R}=1-(1-\frac{1}{S})^{\left|D^{Q}\right|}roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = 1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ) start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT

Since the values of |DQ|superscript𝐷𝑄\left|D^{Q}\right|| italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | and S𝑆Sitalic_S are publicly known, there no information leak when using ΔRsubscriptΔ𝑅\Delta_{R}roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT based on these values. The other possible scenarios of neighbouring are : 1) Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has on row less, which will give the same result as the previous one. 2) Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has new/lost individual but only affected the column ”Measure” of a row by ±plus-or-minus\pm± 1, then ΔR=0subscriptΔ𝑅0\Delta_{R}=0roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = 0. 3) Case where a cluster C𝐶Citalic_C wasn’t in CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT in Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has an additional row and his in CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT (or vice versa), in this case ΔR=1S|DQ|subscriptΔ𝑅1superscript𝑆superscript𝐷𝑄\Delta_{R}=\frac{1}{S^{\left|D^{Q}\right|}}roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG. We can prove that 1(11S)|DQ|1S|DQ|1superscript11𝑆superscript𝐷𝑄1superscript𝑆superscript𝐷𝑄1-(1-\frac{1}{S})^{\left|D^{Q}\right|}\geq\frac{1}{S^{\left|D^{Q}\right|}}1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ) start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG :

(13) 1(11S)|DQ|=1(1|DQ|S+|DQ|×(|DQ|1)2×S2)1(11S)|DQ|=|DQ|S|DQ|×(|DQ|1)2×S2+1superscript11𝑆superscript𝐷𝑄11superscript𝐷𝑄𝑆superscript𝐷𝑄superscript𝐷𝑄12superscript𝑆2missing-subexpression1superscript11𝑆superscript𝐷𝑄superscript𝐷𝑄𝑆superscript𝐷𝑄superscript𝐷𝑄12superscript𝑆2missing-subexpression\begin{array}[]{ll}1-(1-\frac{1}{S})^{\left|D^{Q}\right|}=1-(1-\frac{\left|D^{% Q}\right|}{S}+\frac{\left|D^{Q}\right|\times(\left|D^{Q}\right|-1)}{2\times S^% {2}}-...)\\ 1-(1-\frac{1}{S})^{\left|D^{Q}\right|}=\frac{\left|D^{Q}\right|}{S}-\frac{% \left|D^{Q}\right|\times(\left|D^{Q}\right|-1)}{2\times S^{2}}+...\end{array}start_ARRAY start_ROW start_CELL 1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ) start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT = 1 - ( 1 - divide start_ARG | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_ARG start_ARG italic_S end_ARG + divide start_ARG | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | × ( | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | - 1 ) end_ARG start_ARG 2 × italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - … ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ) start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT = divide start_ARG | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_ARG start_ARG italic_S end_ARG - divide start_ARG | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | × ( | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | - 1 ) end_ARG start_ARG 2 × italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + … end_CELL start_CELL end_CELL end_ROW end_ARRAY

Since SDmuch-greater-than𝑆𝐷S\gg Ditalic_S ≫ italic_D, we can assume 1(11S)|DQ||DQ|S1superscript11𝑆superscript𝐷𝑄superscript𝐷𝑄𝑆1-(1-\frac{1}{S})^{\left|D^{Q}\right|}\approx\frac{\left|D^{Q}\right|}{S}1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ) start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT ≈ divide start_ARG | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_ARG start_ARG italic_S end_ARG. Then :

(14) 1(11S)|DQ|>1S|DQ||DQ|S1S|DQ|1superscript11𝑆superscript𝐷𝑄1superscript𝑆superscript𝐷𝑄superscript𝐷𝑄𝑆1superscript𝑆superscript𝐷𝑄1-(1-\frac{1}{S})^{\left|D^{Q}\right|}>\frac{1}{S^{\left|D^{Q}\right|}}% \Leftrightarrow\frac{\left|D^{Q}\right|}{S}\geq\frac{1}{S^{\left|D^{Q}\right|}}1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ) start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT > divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG ⇔ divide start_ARG | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_ARG start_ARG italic_S end_ARG ≥ divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG

Which is always true (S0 and |DQ|1much-greater-than𝑆0 and superscript𝐷𝑄1S\gg 0\text{ and }\left|D^{Q}\right|\geq 1italic_S ≫ 0 and | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | ≥ 1)

A.2. Sensitivity Avg(R^)𝐴𝑣𝑔^𝑅Avg(\widehat{R})italic_A italic_v italic_g ( over^ start_ARG italic_R end_ARG )

The average of Rssuperscript𝑅𝑠R^{\prime}sitalic_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s, R^^𝑅\widehat{R}over^ start_ARG italic_R end_ARG, of a data provider’s set of cluster CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT is computed as follows : Avg(R^)=RR^RNQ𝐴𝑣𝑔^𝑅superscript𝑅^𝑅𝑅superscript𝑁𝑄Avg(\widehat{R})=\frac{\sum^{R\in\widehat{R}}R}{N^{Q}}italic_A italic_v italic_g ( over^ start_ARG italic_R end_ARG ) = divide start_ARG ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_ARG. For any two neighbouring databases T𝑇Titalic_T and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, there is two cases where the Avg(R^)𝐴𝑣𝑔^𝑅Avg(\widehat{R})italic_A italic_v italic_g ( over^ start_ARG italic_R end_ARG ) is effected: 1)One of the clusters in CQsuperscript𝐶𝑄C^{\prime Q}italic_C start_POSTSUPERSCRIPT ′ italic_Q end_POSTSUPERSCRIPT has additional row compared to his counter part in CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT 2) CQsuperscript𝐶𝑄C^{\prime}Qitalic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_Q has new cluster due to the presence of an individual thus NQ=NQ+1superscript𝑁𝑄superscript𝑁𝑄1N^{\prime Q}=N^{Q}+1italic_N start_POSTSUPERSCRIPT ′ italic_Q end_POSTSUPERSCRIPT = italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1. Which will give:

(15) ΔAvg(R^)={|RR^RNQΔR+RR^RNQ|=ΔRNQ|RR^RNQ1S|DQ|+RR^RNQ+1|subscriptΔ𝐴𝑣𝑔^𝑅casesmissing-subexpressionsuperscript𝑅^𝑅𝑅superscript𝑁𝑄subscriptΔ𝑅superscript𝑅^𝑅𝑅superscript𝑁𝑄subscriptΔ𝑅subscript𝑁𝑄missing-subexpressionsuperscript𝑅^𝑅𝑅superscript𝑁𝑄1superscript𝑆superscript𝐷𝑄superscript𝑅^𝑅𝑅superscript𝑁𝑄1\Delta_{Avg(\widehat{R})}=\left\{\begin{array}[]{ll}&\left|\frac{\sum^{R\in% \widehat{R}}R}{N^{Q}}-\frac{\Delta_{R}+\sum^{R\in\widehat{R}}R}{N^{Q}}\right|=% \frac{\Delta_{R}}{N_{Q}}\\ &\left|\frac{\sum^{R\in\widehat{R}}R}{N^{Q}}-\frac{\frac{1}{S^{\left|D^{Q}% \right|}}+\sum^{R\in\widehat{R}}R}{N^{Q}+1}\right|\end{array}\right.roman_Δ start_POSTSUBSCRIPT italic_A italic_v italic_g ( over^ start_ARG italic_R end_ARG ) end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL end_CELL start_CELL | divide start_ARG ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_ARG - divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT + ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_ARG | = divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL | divide start_ARG ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_ARG - divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG | end_CELL end_ROW end_ARRAY

We can simplify the the second ΔAvg(R^)subscriptΔ𝐴𝑣𝑔^𝑅\Delta_{Avg(\widehat{R})}roman_Δ start_POSTSUBSCRIPT italic_A italic_v italic_g ( over^ start_ARG italic_R end_ARG ) end_POSTSUBSCRIPT as follows:

(16) |RR^RNQ1S|DQ|+RR^RNQ+1|=|Avg(R^)NQ+11S|DQ|NQ+1| which is :11S|DQ|NQ+11NQ+1superscript𝑅^𝑅𝑅superscript𝑁𝑄1superscript𝑆superscript𝐷𝑄superscript𝑅^𝑅𝑅superscript𝑁𝑄1𝐴𝑣𝑔^𝑅superscript𝑁𝑄11superscript𝑆superscript𝐷𝑄superscript𝑁𝑄1missing-subexpression which is :11superscript𝑆superscript𝐷𝑄superscript𝑁𝑄11superscript𝑁𝑄1missing-subexpression\begin{array}[]{ll}\left|\frac{\sum^{R\in\widehat{R}}R}{N^{Q}}-\frac{\frac{1}{% S^{\left|D^{Q}\right|}}+\sum^{R\in\widehat{R}}R}{N^{Q}+1}\right|=\left|\frac{% Avg(\widehat{R})}{N^{Q}+1}-\frac{\frac{1}{S^{\left|D^{Q}\right|}}}{N^{Q}+1}% \right|\\ \text{ which is :}\leq\frac{1-\frac{1}{S^{\left|D^{Q}\right|}}}{N^{Q}+1}\leq% \frac{1}{N^{Q}+1}\end{array}start_ARRAY start_ROW start_CELL | divide start_ARG ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_ARG - divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG | = | divide start_ARG italic_A italic_v italic_g ( over^ start_ARG italic_R end_ARG ) end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG - divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG | end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL which is : ≤ divide start_ARG 1 - divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG ≤ divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG end_CELL start_CELL end_CELL end_ROW end_ARRAY

NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT can be replaced by it’s smallest possible value Nminsuperscript𝑁𝑚𝑖𝑛N^{min}italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT to maximise ΔAvg((^R))\Delta_{Avg(\widehat{(}R))}roman_Δ start_POSTSUBSCRIPT italic_A italic_v italic_g ( over^ start_ARG ( end_ARG italic_R ) ) end_POSTSUBSCRIPT : We can simplify the the second ΔAvg(R^)subscriptΔ𝐴𝑣𝑔^𝑅\Delta_{Avg(\widehat{R})}roman_Δ start_POSTSUBSCRIPT italic_A italic_v italic_g ( over^ start_ARG italic_R end_ARG ) end_POSTSUBSCRIPT as follows:

(17) |RR^RNQ1S|DQ|+RR^RNQ+1|=|Avg(R^)NQ+11S|DQ|NQ+1| which is :11S|DQ|NQ+11NQ+1superscript𝑅^𝑅𝑅superscript𝑁𝑄1superscript𝑆superscript𝐷𝑄superscript𝑅^𝑅𝑅superscript𝑁𝑄1𝐴𝑣𝑔^𝑅superscript𝑁𝑄11superscript𝑆superscript𝐷𝑄superscript𝑁𝑄1missing-subexpression which is :11superscript𝑆superscript𝐷𝑄superscript𝑁𝑄11superscript𝑁𝑄1missing-subexpression\begin{array}[]{ll}\left|\frac{\sum^{R\in\widehat{R}}R}{N^{Q}}-\frac{\frac{1}{% S^{\left|D^{Q}\right|}}+\sum^{R\in\widehat{R}}R}{N^{Q}+1}\right|=\left|\frac{% Avg(\widehat{R})}{N^{Q}+1}-\frac{\frac{1}{S^{\left|D^{Q}\right|}}}{N^{Q}+1}% \right|\\ \text{ which is :}\leq\frac{1-\frac{1}{S^{\left|D^{Q}\right|}}}{N^{Q}+1}\leq% \frac{1}{N^{Q}+1}\end{array}start_ARRAY start_ROW start_CELL | divide start_ARG ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT end_ARG - divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG | = | divide start_ARG italic_A italic_v italic_g ( over^ start_ARG italic_R end_ARG ) end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG - divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG | end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL which is : ≤ divide start_ARG 1 - divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG ≤ divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT + 1 end_ARG end_CELL start_CELL end_CELL end_ROW end_ARRAY

NQsuperscript𝑁𝑄N^{Q}italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT can be replaced by it’s smallest possible value Nminsuperscript𝑁𝑚𝑖𝑛N^{min}italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT to maximise ΔAvg((^R))\Delta_{Avg(\widehat{(}R))}roman_Δ start_POSTSUBSCRIPT italic_A italic_v italic_g ( over^ start_ARG ( end_ARG italic_R ) ) end_POSTSUBSCRIPT :

(18) ΔAvg(R^){ΔRNmin1Nmin+1ΔAvg(R^)=max(ΔRNmin,1Nmin+1)subscriptΔ𝐴𝑣𝑔^𝑅casesmissing-subexpressionsubscriptΔ𝑅superscript𝑁𝑚𝑖𝑛missing-subexpression1superscript𝑁𝑚𝑖𝑛1subscriptΔ𝐴𝑣𝑔^𝑅𝑚𝑎𝑥subscriptΔ𝑅superscript𝑁𝑚𝑖𝑛1superscript𝑁𝑚𝑖𝑛1\Delta_{Avg(\widehat{R})}\leq\left\{\begin{array}[]{cc}&\frac{\Delta_{R}}{N^{% min}}\\ &\frac{1}{N^{min}+1}\end{array}\right.\implies\Delta_{Avg(\widehat{R})}=max(% \frac{\Delta_{R}}{N^{min}},\frac{1}{N^{min}+1})roman_Δ start_POSTSUBSCRIPT italic_A italic_v italic_g ( over^ start_ARG italic_R end_ARG ) end_POSTSUBSCRIPT ≤ { start_ARRAY start_ROW start_CELL end_CELL start_CELL divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT + 1 end_ARG end_CELL end_ROW end_ARRAY ⟹ roman_Δ start_POSTSUBSCRIPT italic_A italic_v italic_g ( over^ start_ARG italic_R end_ARG ) end_POSTSUBSCRIPT = italic_m italic_a italic_x ( divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT end_ARG , divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT + 1 end_ARG )

Since ΔRsubscriptΔ𝑅\Delta_{R}roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and Nminsuperscript𝑁𝑚𝑖𝑛N^{min}italic_N start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT are not sensitives information, they can be used to express the ΔAvg((^R)subscriptΔ𝐴𝑣𝑔^(𝑅\Delta_{Avg(\widehat{(}R)}roman_Δ start_POSTSUBSCRIPT italic_A italic_v italic_g ( over^ start_ARG ( end_ARG italic_R ) end_POSTSUBSCRIPT.

Appendix B Sensitivity estimator 𝔼𝔼\mathbb{E}blackboard_E

For the estimator used to approximate the result of Q𝑄Qitalic_Q, we first will give the bound for its global sensitivity, then we show how we bound his local sensitivity.

B.1. Global sensitivity of the estimator 𝔼𝔼\mathbb{E}blackboard_E

Given a query Q𝑄Qitalic_Q, two neighbouring databases T𝑇Titalic_T and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT containing cluster CCQ𝐶superscript𝐶𝑄C\in C^{Q}italic_C ∈ italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and CCQsuperscript𝐶superscript𝐶𝑄C^{\prime}\in C^{\prime Q}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_C start_POSTSUPERSCRIPT ′ italic_Q end_POSTSUPERSCRIPT where Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has an additional row that covers Q𝑄Qitalic_Q. Thus both the sampling probability p𝑝pitalic_p and Q(C)𝑄𝐶Q(C)italic_Q ( italic_C ) are effected by this additional row, and we have psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and Q(C)𝑄superscript𝐶Q(C^{\prime})italic_Q ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), this implies :

(19) Δ𝔼=|Q(C)pQ(C)p| where Q(C)=Q(C)+1Δ𝔼=|Q(C)pQ(C)+1p|=|Q(C)pQ(C)+1p|=1p×p×|Q(C)×(pp)p|subscriptΔ𝔼𝑄𝐶𝑝𝑄superscript𝐶superscript𝑝 where 𝑄superscript𝐶𝑄𝐶1missing-subexpressionsubscriptΔ𝔼𝑄𝐶𝑝𝑄𝐶1superscript𝑝𝑄𝐶𝑝𝑄𝐶1superscript𝑝1𝑝superscript𝑝𝑄𝐶superscript𝑝𝑝𝑝missing-subexpression\begin{array}[]{ll}\Delta_{\mathbb{E}}=\left|\frac{Q(C)}{p}-\frac{Q(C^{\prime}% )}{p^{\prime}}\right|\text{ where }Q(C^{\prime})=Q(C)+1\\ \Delta_{\mathbb{E}}=\left|\frac{Q(C)}{p}-\frac{Q(C)+1}{p^{\prime}}\right|=% \left|\frac{Q(C)}{p}-\frac{Q(C)+1}{p^{\prime}}\right|=\frac{1}{p\times p^{% \prime}}\times\left|Q(C)\times(p^{\prime}-p)-p\right|\end{array}start_ARRAY start_ROW start_CELL roman_Δ start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT = | divide start_ARG italic_Q ( italic_C ) end_ARG start_ARG italic_p end_ARG - divide start_ARG italic_Q ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG | where italic_Q ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_Q ( italic_C ) + 1 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_Δ start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT = | divide start_ARG italic_Q ( italic_C ) end_ARG start_ARG italic_p end_ARG - divide start_ARG italic_Q ( italic_C ) + 1 end_ARG start_ARG italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG | = | divide start_ARG italic_Q ( italic_C ) end_ARG start_ARG italic_p end_ARG - divide start_ARG italic_Q ( italic_C ) + 1 end_ARG start_ARG italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG | = divide start_ARG 1 end_ARG start_ARG italic_p × italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG × | italic_Q ( italic_C ) × ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_p ) - italic_p | end_CELL start_CELL end_CELL end_ROW end_ARRAY

Since the p×p𝑝superscript𝑝p\times p^{\prime}italic_p × italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is in the denominator, we can minimise it to obtain the Δ𝔼subscriptΔ𝔼\Delta_{\mathbb{E}}roman_Δ start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT. Let’s consider the case where C𝐶Citalic_C contains only one row RC=1S|DQ|subscript𝑅𝐶1superscript𝑆superscript𝐷𝑄R_{C}=\frac{1}{S^{\left|D^{Q}\right|}}italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG that covers Q𝑄Qitalic_Q and the remain cluster in CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT fully covers Q𝑄Qitalic_Q so their R=1𝑅1R=1italic_R = 1 which implies :

(20) p=1/S|DQ|NQ1+1/S|DQ|=1(NQ1)×S|DQ|+1𝑝1superscript𝑆superscript𝐷𝑄superscript𝑁𝑄11superscript𝑆superscript𝐷𝑄1superscript𝑁𝑄1superscript𝑆superscript𝐷𝑄1p=\frac{1/S^{\left|D^{Q}\right|}}{N^{Q}-1+1/S^{\left|D^{Q}\right|}}=\frac{1}{(% N^{Q}-1)\times S^{\left|D^{Q}\right|}+1}italic_p = divide start_ARG 1 / italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT - 1 + 1 / italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG ( italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT - 1 ) × italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT + 1 end_ARG

In Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has an additional row so psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is:

(21) p=2|DQ|/S|DQ|NQ1+2|DQ|/S|DQ|=2|DQ|(NQ1)×S|DQ|+2|DQ|superscript𝑝superscript2superscript𝐷𝑄superscript𝑆superscript𝐷𝑄superscript𝑁𝑄1superscript2superscript𝐷𝑄superscript𝑆superscript𝐷𝑄superscript2superscript𝐷𝑄superscript𝑁𝑄1superscript𝑆superscript𝐷𝑄superscript2superscript𝐷𝑄p^{\prime}=\frac{2^{\left|D^{Q}\right|}/S^{\left|D^{Q}\right|}}{N^{Q}-1+2^{% \left|D^{Q}\right|}/S^{\left|D^{Q}\right|}}=\frac{2^{\left|D^{Q}\right|}}{(N^{% Q}-1)\times S^{\left|D^{Q}\right|}+2^{\left|D^{Q}\right|}}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 2 start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT / italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT - 1 + 2 start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT / italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG = divide start_ARG 2 start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT - 1 ) × italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT + 2 start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG

From equations (20) and (21) :

(22) {1p×p=((NQ1)×S|DQ|+1)×((NQ1)×S|DQ|+2|DQ|)2|DQ|pp=(2|DQ|1)×(NQ1)×S|DQ|((NQ1)×S|DQ|+1)×((NQ1)×S|DQ|+2|DQ|)casesmissing-subexpression1𝑝superscript𝑝superscript𝑁𝑄1superscript𝑆superscript𝐷𝑄1superscript𝑁𝑄1superscript𝑆superscript𝐷𝑄superscript2superscript𝐷𝑄superscript2superscript𝐷𝑄missing-subexpressionsuperscript𝑝𝑝superscript2superscript𝐷𝑄1superscript𝑁𝑄1superscript𝑆superscript𝐷𝑄superscript𝑁𝑄1superscript𝑆superscript𝐷𝑄1superscript𝑁𝑄1superscript𝑆superscript𝐷𝑄superscript2superscript𝐷𝑄\left\{\begin{array}[]{ll}&\frac{1}{p\times p^{\prime}}=\frac{((N^{Q}-1)\times S% ^{\left|D^{Q}\right|}+1)\times((N^{Q}-1)\times S^{\left|D^{Q}\right|}+2^{\left% |D^{Q}\right|})}{2^{\left|D^{Q}\right|}}\\ &p^{\prime}-p=\frac{(2^{\left|D^{Q}\right|}-1)\times(N^{Q}-1)\times S^{\left|D% ^{Q}\right|}}{((N^{Q}-1)\times S^{\left|D^{Q}\right|}+1)\times((N^{Q}-1)\times S% ^{\left|D^{Q}\right|}+2^{\left|D^{Q}\right|})}\end{array}\right.{ start_ARRAY start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_p × italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG = divide start_ARG ( ( italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT - 1 ) × italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT + 1 ) × ( ( italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT - 1 ) × italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT + 2 start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_p = divide start_ARG ( 2 start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT - 1 ) × ( italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT - 1 ) × italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG start_ARG ( ( italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT - 1 ) × italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT + 1 ) × ( ( italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT - 1 ) × italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT + 2 start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT ) end_ARG end_CELL end_ROW end_ARRAY

Which implies:

(23) Δ𝔼=12|DQ|×|2|DQ|1×(NQ1)×S|DQ|2|DQ||Δ𝔼=(NQ1)×S|DQ|21subscriptΔ𝔼1superscript2superscript𝐷𝑄superscript2superscript𝐷𝑄1superscript𝑁𝑄1superscript𝑆superscript𝐷𝑄superscript2superscript𝐷𝑄missing-subexpressionsubscriptΔ𝔼superscript𝑁𝑄1superscript𝑆superscript𝐷𝑄21missing-subexpression\begin{array}[]{ll}\Delta_{\mathbb{E}}=\frac{1}{2^{\left|D^{Q}\right|}}\times% \left|2^{\left|D^{Q}\right|-1}\times(N^{Q}-1)\times S^{\left|D^{Q}\right|}-2^{% \left|D^{Q}\right|}\right|\\ \Delta_{\mathbb{E}}=\frac{(N^{Q}-1)\times S^{\left|D^{Q}\right|}}{2}-1\end{array}start_ARRAY start_ROW start_CELL roman_Δ start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG × | 2 start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | - 1 end_POSTSUPERSCRIPT × ( italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT - 1 ) × italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT - 2 start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT | end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_Δ start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT = divide start_ARG ( italic_N start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT - 1 ) × italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - 1 end_CELL start_CELL end_CELL end_ROW end_ARRAY

This results shows that the sensitivity of this statistical estimator is very large and unbounded, if its result should be protected another alternative is mandatory.

B.2. Smooth sensitivity of the estimator 𝔼𝔼\mathbb{E}blackboard_E

To compute the smooth upper bound of the LS𝔼𝐿subscript𝑆𝔼LS_{\mathbb{E}}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT, we considered four possible scenarios of neighbouring T𝑇Titalic_T and Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

  1. (1)

    LS𝔼1=|Q(C)pQ(C)p|𝐿superscriptsubscript𝑆𝔼1𝑄𝐶𝑝𝑄𝐶superscript𝑝LS_{\mathbb{E}}^{1}=\left|\frac{Q(C)}{p}-\frac{Q(C)}{p^{\prime}}\right|italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = | divide start_ARG italic_Q ( italic_C ) end_ARG start_ARG italic_p end_ARG - divide start_ARG italic_Q ( italic_C ) end_ARG start_ARG italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG | and p=RΔR+RR^Rsuperscript𝑝𝑅subscriptΔ𝑅superscript𝑅^𝑅𝑅p^{\prime}=\frac{R}{\Delta_{R}+\sum^{R\in\widehat{R}}R}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_R end_ARG start_ARG roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT + ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG : another cluster gained a row

  2. (2)

    LS𝔼2=|Q(C)pQ(C)p|𝐿superscriptsubscript𝑆𝔼2𝑄𝐶𝑝𝑄𝐶superscript𝑝LS_{\mathbb{E}}^{2}=\left|\frac{Q(C)}{p}-\frac{Q(C)}{p^{\prime}}\right|italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = | divide start_ARG italic_Q ( italic_C ) end_ARG start_ARG italic_p end_ARG - divide start_ARG italic_Q ( italic_C ) end_ARG start_ARG italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG | and p=R1/S|DQ|+RR^Rsuperscript𝑝𝑅1superscript𝑆superscript𝐷𝑄superscript𝑅^𝑅𝑅p^{\prime}=\frac{R}{1/S^{\left|D^{Q}\right|}+\sum^{R\in\widehat{R}}R}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_R end_ARG start_ARG 1 / italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT + ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG : new cluster added to CQsuperscript𝐶𝑄C^{Q}italic_C start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT

  3. (3)

    LS𝔼3=|Q(C)pQ(C)p|𝐿superscriptsubscript𝑆𝔼3𝑄𝐶𝑝𝑄superscript𝐶superscript𝑝LS_{\mathbb{E}}^{3}=\left|\frac{Q(C)}{p}-\frac{Q(C^{\prime})}{p^{\prime}}\right|italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = | divide start_ARG italic_Q ( italic_C ) end_ARG start_ARG italic_p end_ARG - divide start_ARG italic_Q ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG | and p=R+ΔR1/S|DQ|+RR^R,Q(C)=Q(C)+1formulae-sequencesuperscript𝑝𝑅subscriptΔ𝑅1superscript𝑆superscript𝐷𝑄superscript𝑅^𝑅𝑅𝑄superscript𝐶𝑄𝐶1p^{\prime}=\frac{R+\Delta_{R}}{1/S^{\left|D^{Q}\right|}+\sum^{R\in\widehat{R}}% R},Q(C^{\prime})=Q(C)+1italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_R + roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG 1 / italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT + ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG , italic_Q ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_Q ( italic_C ) + 1 : the cluster gained a row.

  4. (4)

    LS𝔼4=|Q(C)pQ(C)p|𝐿superscriptsubscript𝑆𝔼4𝑄𝐶𝑝𝑄superscript𝐶𝑝LS_{\mathbb{E}}^{4}=\left|\frac{Q(C)}{p}-\frac{Q(C^{\prime})}{p}\right|italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = | divide start_ARG italic_Q ( italic_C ) end_ARG start_ARG italic_p end_ARG - divide start_ARG italic_Q ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_p end_ARG | and Q(C)=Q(C)+1𝑄superscript𝐶𝑄𝐶1Q(C^{\prime})=Q(C)+1italic_Q ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_Q ( italic_C ) + 1 : the cluster gained an individual ±plus-or-minus\pm± 1 in a measure and not a new row.

Our goal is to find the biggest one of these distances. we can quickly notice that LS𝔼1>LS𝔼2𝐿superscriptsubscript𝑆𝔼1𝐿superscriptsubscript𝑆𝔼2LS_{\mathbb{E}}^{1}>LS_{\mathbb{E}}^{2}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT > italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT since ΔR>1S|DQ|subscriptΔ𝑅1superscript𝑆superscript𝐷𝑄\Delta_{R}>\frac{1}{S^{\left|D^{Q}\right|}}roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT > divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUPERSCRIPT | italic_D start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT end_ARG. And we have LS𝔼4>LS𝔼3𝐿superscriptsubscript𝑆𝔼4𝐿superscriptsubscript𝑆𝔼3LS_{\mathbb{E}}^{4}>LS_{\mathbb{E}}^{3}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT > italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT because this constraint is always true:

(24) Q(C)/pQ(C)/p=pp=R+ΔRR>1LS𝔼4>LS𝔼3𝑄superscript𝐶𝑝𝑄superscript𝐶superscript𝑝superscript𝑝𝑝𝑅subscriptΔ𝑅𝑅1missing-subexpressionabsent𝐿superscriptsubscript𝑆𝔼4𝐿superscriptsubscript𝑆𝔼3missing-subexpression\begin{array}[]{ll}\frac{Q(C^{\prime})/p}{Q(C^{\prime})/p^{\prime}}=\frac{p^{% \prime}}{p}=\frac{R+\Delta_{R}}{R}>1\\ \implies LS_{\mathbb{E}}^{4}>LS_{\mathbb{E}}^{3}\end{array}start_ARRAY start_ROW start_CELL divide start_ARG italic_Q ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_p end_ARG start_ARG italic_Q ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG = divide start_ARG italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_p end_ARG = divide start_ARG italic_R + roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG italic_R end_ARG > 1 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ⟹ italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT > italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL end_ROW end_ARRAY

Between LS𝔼1𝐿superscriptsubscript𝑆𝔼1LS_{\mathbb{E}}^{1}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and LS𝔼4𝐿superscriptsubscript𝑆𝔼4LS_{\mathbb{E}}^{4}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, we need to find which is the bigger distance and under what conditions:

(25) Q(C)/pQ(C)/p=Q(C)×(ΔR+RR^R)R×R(Q(C)+1)×RR^RQ(C)/pQ(C)/p=Q(C)Q(C)+1×ΔR+RR^R)RR^R)Q(C)/pQ(C)/p>1Q(C)>RR^RΔR\begin{array}[]{ll}\frac{Q(C)/p^{\prime}}{Q(C^{\prime})/p}=\frac{Q(C)\times(% \Delta_{R}+\sum^{R\in\widehat{R}}R)}{R}\times\frac{R}{(Q(C)+1)\times\sum^{R\in% \widehat{R}}R}\\ \frac{Q(C)/p^{\prime}}{Q(C^{\prime})/p}=\frac{Q(C)}{Q(C)+1}\times\frac{\Delta_% {R}+\sum^{R\in\widehat{R}}R)}{\sum^{R\in\widehat{R}}R)}\\ \frac{Q(C)/p^{\prime}}{Q(C^{\prime})/p}>1\implies Q(C)>\frac{\sum^{R\in% \widehat{R}}R}{\Delta_{R}}\end{array}start_ARRAY start_ROW start_CELL divide start_ARG italic_Q ( italic_C ) / italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_Q ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_p end_ARG = divide start_ARG italic_Q ( italic_C ) × ( roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT + ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R ) end_ARG start_ARG italic_R end_ARG × divide start_ARG italic_R end_ARG start_ARG ( italic_Q ( italic_C ) + 1 ) × ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_Q ( italic_C ) / italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_Q ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_p end_ARG = divide start_ARG italic_Q ( italic_C ) end_ARG start_ARG italic_Q ( italic_C ) + 1 end_ARG × divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT + ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R ) end_ARG start_ARG ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R ) end_ARG end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_Q ( italic_C ) / italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_Q ( italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_p end_ARG > 1 ⟹ italic_Q ( italic_C ) > divide start_ARG ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG start_ARG roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG end_CELL start_CELL end_CELL end_ROW end_ARRAY

In conclusion, only LS𝔼1𝐿superscriptsubscript𝑆𝔼1LS_{\mathbb{E}}^{1}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and LS𝔼4𝐿superscriptsubscript𝑆𝔼4LS_{\mathbb{E}}^{4}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT need to be used in order to compute the smooth sensitivity, and for each cluster there is only one that dominated the other based if Q(C)>RR^RΔR𝑄𝐶superscript𝑅^𝑅𝑅subscriptΔ𝑅Q(C)>\frac{\sum^{R\in\widehat{R}}R}{\Delta_{R}}italic_Q ( italic_C ) > divide start_ARG ∑ start_POSTSUPERSCRIPT italic_R ∈ over^ start_ARG italic_R end_ARG end_POSTSUPERSCRIPT italic_R end_ARG start_ARG roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG. Both distances can be simplified as follows :

(26) LS𝔼={Q(C)×ΔRR from LS𝔼11p from LS𝔼4𝐿superscriptsubscript𝑆𝔼casesmissing-subexpression𝑄𝐶subscriptΔ𝑅𝑅 from 𝐿superscriptsubscript𝑆𝔼1missing-subexpression1𝑝 from 𝐿superscriptsubscript𝑆𝔼4LS_{\mathbb{E}}^{=}\left\{\begin{array}[]{ll}&\frac{Q(C)\times\Delta_{R}}{R}% \text{ from }LS_{\mathbb{E}}^{1}\\ &\frac{1}{p}\text{ from }LS_{\mathbb{E}}^{4}\end{array}\right.italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT = end_POSTSUPERSCRIPT { start_ARRAY start_ROW start_CELL end_CELL start_CELL divide start_ARG italic_Q ( italic_C ) × roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG italic_R end_ARG from italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG italic_p end_ARG from italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY

B.3. Bound the k for the smooth sensitivity

Since we have our distances LS𝔼k𝐿superscriptsubscript𝑆𝔼𝑘LS_{\mathbb{E}}^{k}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT are ascending function and ekβsuperscript𝑒𝑘𝛽e^{-k\beta}italic_e start_POSTSUPERSCRIPT - italic_k italic_β end_POSTSUPERSCRIPT is an exponential decay function, and since we are looking for the maxk=0,1,ekβ×LS𝔼k𝑚𝑎subscript𝑥𝑘01superscript𝑒𝑘𝛽𝐿superscriptsubscript𝑆𝔼𝑘max_{k=0,1,...}e^{-k\beta}\times LS_{\mathbb{E}}^{k}italic_m italic_a italic_x start_POSTSUBSCRIPT italic_k = 0 , 1 , … end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_k italic_β end_POSTSUPERSCRIPT × italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. To find the stopping point of k𝑘kitalic_k, we need to find where the this product starts decaying. In other words, we find the k𝑘kitalic_k such that :

(27) ekβ×LS𝔼k<e(k1)β×LS𝔼k1LS𝔼k1LS𝔼k>eβsuperscript𝑒𝑘𝛽𝐿superscriptsubscript𝑆𝔼𝑘superscript𝑒𝑘1𝛽𝐿superscriptsubscript𝑆𝔼𝑘1𝐿superscriptsubscript𝑆𝔼𝑘1𝐿superscriptsubscript𝑆𝔼𝑘superscript𝑒𝛽e^{-k\beta}\times LS_{\mathbb{E}}^{k}<e^{-(k-1)\beta}\times LS_{\mathbb{E}}^{k% -1}\implies\frac{LS_{\mathbb{E}}^{k-1}}{LS_{\mathbb{E}}^{k}}>e^{-\beta}italic_e start_POSTSUPERSCRIPT - italic_k italic_β end_POSTSUPERSCRIPT × italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT < italic_e start_POSTSUPERSCRIPT - ( italic_k - 1 ) italic_β end_POSTSUPERSCRIPT × italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ⟹ divide start_ARG italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG > italic_e start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT

For LS𝔼k𝐿superscriptsubscript𝑆𝔼𝑘LS_{\mathbb{E}}^{k}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT based on scenario 1:

(28) LS𝔼k1LS𝔼k=(k1)×Q(C)×ΔRR×Rk×Q(C)×ΔR=k1k𝐿superscriptsubscript𝑆𝔼𝑘1𝐿superscriptsubscript𝑆𝔼𝑘𝑘1𝑄𝐶subscriptΔ𝑅𝑅𝑅𝑘𝑄𝐶subscriptΔ𝑅𝑘1𝑘\frac{LS_{\mathbb{E}}^{k-1}}{LS_{\mathbb{E}}^{k}}=\frac{(k-1)\times Q(C)\times% \Delta_{R}}{R}\times\frac{R}{k\times Q(C)\times\Delta_{R}}=\frac{k-1}{k}divide start_ARG italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG = divide start_ARG ( italic_k - 1 ) × italic_Q ( italic_C ) × roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG italic_R end_ARG × divide start_ARG italic_R end_ARG start_ARG italic_k × italic_Q ( italic_C ) × roman_Δ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_k - 1 end_ARG start_ARG italic_k end_ARG

For LS𝔼k𝐿superscriptsubscript𝑆𝔼𝑘LS_{\mathbb{E}}^{k}italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT based on scenario 4:

(29) LS𝔼k1LS𝔼k=k1p×pk=k1k𝐿superscriptsubscript𝑆𝔼𝑘1𝐿superscriptsubscript𝑆𝔼𝑘𝑘1𝑝𝑝𝑘𝑘1𝑘\frac{LS_{\mathbb{E}}^{k-1}}{LS_{\mathbb{E}}^{k}}=\frac{k-1}{p}\times\frac{p}{% k}=\frac{k-1}{k}divide start_ARG italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_L italic_S start_POSTSUBSCRIPT blackboard_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG = divide start_ARG italic_k - 1 end_ARG start_ARG italic_p end_ARG × divide start_ARG italic_p end_ARG start_ARG italic_k end_ARG = divide start_ARG italic_k - 1 end_ARG start_ARG italic_k end_ARG

So for both our distances, the smooth upper bound is reached when:

(30) k1k>eβk>11eβ𝑘1𝑘superscript𝑒𝛽𝑘11superscript𝑒𝛽\frac{k-1}{k}>e^{-\beta}\implies k>\frac{1}{1-e^{-\beta}}divide start_ARG italic_k - 1 end_ARG start_ARG italic_k end_ARG > italic_e start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT ⟹ italic_k > divide start_ARG 1 end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT end_ARG

Where β=ϵ2×ln(2/δ)𝛽italic-ϵ22𝛿\beta=\frac{\epsilon}{2\times\ln(2/\delta)}italic_β = divide start_ARG italic_ϵ end_ARG start_ARG 2 × roman_ln ( 2 / italic_δ ) end_ARG. Based on the (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ ) budget we set for the estimator, we will obtain our exact upper bound for the k𝑘kitalic_k. this shows that our process terminates and don’t run indefinitely.