Private Approximate Query over Horizontal Data Federation
Abstract.
In many real-world scenarios, multiple data providers need to collaboratively perform analysis of their private data. The challenges of these applications, especially at the big data scale, are time and resource efficiency as well as end-to-end privacy with minimal loss of accuracy. Existing approaches rely primarily on cryptography, which improves privacy, but at the expense of query response time. However, current big data analytics frameworks require fast and accurate responses to large-scale queries, making cryptography-based solutions less suitable. In this work, we address the problem of combining Approximate Query Processing (AQP) and Differential Privacy (DP) in a private federated environment answering range queries on horizontally partitioned multidimensional data. We propose a new approach that considers a data distribution-aware online sampling technique to accelerate the execution of range queries and ensure end-to-end data privacy during and after analysis with minimal loss in accuracy. Through empirical evaluation, we show that our solution is able of providing up to times faster processing than the basic non-secure solution while maintaining accuracy, formal privacy guarantees and resilience to learning-based attacks.
1. Introduction
The extensive reliance of individuals on software solutions in daily and professional life has led to an exponential growth of data collected by companies, corporations, government organisations, and even hospitals. These vast mines of data, if carefully and efficiently analysed, can provide valuable insights that guide decision-making and business development. In large-scale studies and research, the analysis must be conducted on several data sources to obtain meaningful conclusions. An example of such a case is during a pandemic, where many hospitals jointly conduct studies to have a global view of the problem.
One of the most commonly used tools to analyse and explore these huge volumes of data is OLAP tasks, where various aggregation queries
(SUM
, COUNT
, etc.) can be issued to learn existing patterns and trends within the data. These aggregation queries may seem simple, but they are very time-consuming in big databases. The analysis of data from multiple data providers comes with two main challenges: privacy and resource/time efficiency.
The privacy issue arises from the fact that this data is personal and sensitive to individuals, and sharing it with other parties can be very harmful. Many regulations and restrictions like GDPR are imposed by governments on how to process and share such sensitive data.
In the case of a federated environment, where a joint study requires the collaboration of many data providers, data sharing is highly restricted. Each data provider must ensure the security and privacy of the data collected from their users during and after the analysis.
To satisfy the requirement of end-to-end privacy, many solutions have been proposed in the literature, and most of them rely on cryptography to ensure there is no data leakage during the exchange and query evaluation. Secure multiparty computation (SMC) solutions(Bater et al., 2018, 2020) appear to be a prominent solution in federated environments. Others use oblivious operations(Bater et al., 2017) or secure hardware (Zheng et al., 2017; Eskandarian and Zaharia, 2017; Qiu et al., 2023) so that during query evaluation, each data provider can maintain the confidentiality of their data. Additionally for securing the end result of any OLAP query, Differential Privacy (DP) (Dwork et al., 2014) is generally considered the gold standard by government and private institutions (Team et al., 2017; Abowd, 2018; Bittau et al., 2017; Erlingsson et al., 2014). Due to its strong formal confidentiality guarantees, DP allows individuals to deny their participation in the database. These query evaluation solutions in a federated environment meet end-to-end security and privacy requirements. However, what they have in common is their reliance on encryption. This causes a huge processing time overhead, and for time-sensitive tasks, utility is measured by both accuracy and speed. They certainly address the privacy issue, but they are time and resource consuming.
The issue of reducing query response time has been widely addressed in the literature, through the need to obtain Approximate Query
Processing (AQP). Existing AQP methods can be classified into two types, online approximation and offline synopsis creation. In online approximation, there is Online Aggregation based solutions (Hellerstein et al., 1997; Li et al., 2016; Qin and Rusu, 2014) that provide fast and reliable approximation of the query continuously, and other solutions based on applying online sampling to reduce the processed data and obtain an approximation from a sample (Zhang et al., 2016; Goiri et al., 2015; Song et al., 2018).In offline synopsis creation, views are generated offline using query workloads or/and data statistics (Acharya et al., 1999; Agarwal et al., 2013; Chaudhuri et al., 2007).
In this area of research, the main focus is on efficiency, but privacy has not been considered.
In our work, we address the challenge of answering OLAP aggregation range queries in a federated environment, while preserving end-to-end privacy and improving resource and time consumption for query processing.
Our solution relies heavily on differential privacy to secure collaboration and end results, and ensure no information leaks.
To speed up queries, we implement a cluster-based sampling method using a well-known statistical estimator that provides accurate estimates for range queries (such as SUM
and COUNT
) while processing minimal data portions.
While existing systems ensure either privacy or speedup for query approximation, to the best of our knowledge, our solution is the first to offer speedup over plain-text execution with end-to-end privacy in a federated environment. Our main contributions can be listed as follows:
-
(1)
Definition of a lightweight collaboration method that determines optimal sampling decisions for data providers to maximize accuracy without needing access to their full datasets or information leakage.
-
(2)
Introduction of data distribution-aware cluster sampling method with DP guarantees for individual privacy.
-
(3)
Meticulous integration of DP at every step with minimal loss of precision.
-
(4)
Extensive experimentation to empirically validate the performance of our approach in terms of accuracy and time efficiency.
-
(5)
Extensive experimentation to ensure the resilience of our system against learning-based attacks.
Roadmap. The paper is structured as follows: Section 2 reviews some existing works. Section 3 introduces the notions used throughout our paper. Section 4 gives a detailed description of the problem solved by our approach. Section 5 presents our proposed solution in detail. The extensive evaluation of our approach is given in Section 6. In Section 7, we discuss the limitations/extensions of our solution and we conclude in Section 8 by giving some future works.
2. Related Works
Due to the increasing size and distribution of databases, querying and exploring such vast volumes for analytical purposes, quickly and without revealing sensitive information, has become a challenge. Here, we describe the state-of-the-art related to our work.
Approximate Query Processing (AQP). As the quality of a query is based on its accuracy and response time, especially for time-sensitive tasks like OLAP (Wang and Jajodia, 2008) and Business Intelligence (BI), approximating the query offers the best way to strike a balance between these two quality factors.
In the early s, (Hellerstein et al., 1997) proposed a new interactive method for query processing that provides a quick initial answer with a certain error, refining it as processing continues. Other works followed in this direction (Xu et al., 2008; Li et al., 2016; Wu et al., 2010; Qin and Rusu, 2014), each enhancing specific aspects of the method by including support for group by or propose parallel and distributed versions. Another research direction focuses on processing a small subset of the original data, thereby reducing query run-time. In (Olken and Rotem, 1986, 1995; Piatetsky-Shapiro and Connell, 1984; Song et al., 2018), uniform row-level random sampling is applied online before query processing. Although row-level sampling may improve processing time for complex queries, it can introduce overhead and slow down queries that require a full table scan (Haas and König, 2004) (e.g. Bernoulli sampling). To avoid such overhead, the solutions from (Acharya et al., 1999; Agarwal et al., 2013; Chaudhuri et al., 2007) create the samples offline. Cluster sampling, also referred to as page-sampling (Haas and König, 2004), is utilized to speed-up aggregation queries in big databases. Methods in (Goiri et al., 2015; Zhang et al., 2016; Ahmadvand et al., 2019) use this sampling in the context of Hadoop Map-Reduce framework 111https://hadoop.apache.org, as it proves to be fast and I/O efficient compared to row-level sampling.
Federated query answering. Data is often distributed across multiple locations (e.g. data providers like hospitals and companies) and the collaboration among all parties is necessary to answer range aggregation queries. But for privacy and security reasons, each data provider cannot disclose their data to third parties.
Some solutions rely on secure hardware modules (i.e. enclaves), in which all sensitive code and data are processed. Methods in (Agrawal et al., 2006; Zheng et al., 2017; Eskandarian and Zaharia, 2017) focus on aggregation queries in this setting, and (Zheng et al., 2017; Eskandarian and Zaharia, 2017) use intel’s SGX for secure processing. These solutions are generally efficient, but their reliance on trusted hardware and weakness to side-channel attacks constitute a limitation. Recently in (Qiu et al., 2023), the notion of Differential Obliviousness was used to mitigate the risk of side channel attacks.‘
Other recent works presented Secure Multiparty Computation (SMC) query processing engines (Bater et al., 2017, 2018, 2020). These engines enable data providers to respond to OLAP queries securely by joining data with end-to-end privacy. Differential Privacy (DP) is used to perturb the final results, thereby mitigating any inference attacks based on the results. While these solutions incur computational overhead, (Bater et al., 2020) introduced online random sampling to improve secure computing performance by reducing the size of shared data for query processing. In (Cao et al., 2021), sampling is performed offline to create a synopsis to further improve performance. Another solution (Liagouris et al., 2021) focused on reducing the cost of SMC operation thus obtaining significant improvement in performances. All of these SMC (or enclaves)-based protocols are encryption-based, which prevents them from outperforming plain-text query execution. Even with significant improvements introduces in the past years, on real world big tables they still expensive for real-time queries(Liagouris et al., 2021).
To highlight the scale of this problem, we performed a simulation222https://github.com/AlaEddineLaouir/Federated-Range-Queries.git using a synthetic Adult(Becker and Kohavi, 1996) horizontally distributed on 4 data providers as a federated environment. We ran a set of random range queries, which are the type of queries we focus on. For the query processing, we considered two solutions using SMC: (i) data providers sharing the rows and collectively evaluating the query; and (ii) evaluating the query locally and only sharing the results and computing the final result.
We measured the time required to share the rows/results in SMC. The results in Figure 1 show that sharing only local results incurs an insignificant overhead of seconds. On average, this is less than times the time required for row sharing in SMC. Additionally, the cost of sharing only results remains constant and independent of the dataset, whereas the cost of sharing rows will increase with larger tables.
In our work, we propose a framework to approximate query processing in a federated environment, enabling accelerated query execution compared to plain text execution while ensuring end-to-end Differential Privacy guarantees.
3. Preliminaries
In this section, we give the notation and explain briefly notions used throughout the paper.
Data model. In a tabular database defined over a set of dimensions (or attributes) , each individual is a row with values on each dimension. We assume that each dimension is associated with a domain containing discrete and totally ordered values, the size of the domain is . For performance purposes during online analytics tasks, the table is transformed into a multidimensional data (or a count tensor) of dimensions , which has an attribute Measure storing the number of aggregated rows of . Figure 2 illustrates how to construct a count tensor from table by aggregating dimension Service. For simplicity, we use term “table” for “tabular data” and “count tensor”.
Queries. To analyze and extract insights from these tables, the analyst can issue aggregation queries, helping to explore the data and gain a general understanding of patterns and trends. In this work, we consider a range query defined as:
SELECT Aggregation FROM Table WHERE Range
, where:
-
•
Aggregation
isCOUNT(*)
orSUM(Measure)
. -
•
Range
is a set of intervals on each dimension inTable
, such that for every value .
In our work, we focus on COUNT and SUM queries because they are used in several analytics applications. For instance, in a big database aggregating per-stock order data for the NASDAQ exchange, these queries are typically used to analyze order data from past days. Additionally, aggregations, such as average, standard deviation, and variance, can be derived from COUNT and SUM.
Query Approximation and Sampling. The goal of query approximation is generally to speed up execution at the expense of answering the query exactly, while preserving answer accuracy as much as possible. Online sampling is employed for time-sensitive tasks to reduce the overhead of evaluating queries on large databases. Note that in this case, the sampling differs from one query to another. In statistical terms, random sampling is essentially the process of selecting a subpopulation from the total population where a sampling rate dictates the size of . This subpopulation contains sufficiently representative individuals and properties, capturing various characteristics of such that the analysis conducted on can be generalized to . All random sampling techniques can be categorized based on three main features:
-
•
Granularity: sampling elements are individuals or a bulk/cluster of individuals.
-
•
Uniformity: elements are sampled with equal/unequal probabilities.
-
•
Replacement: sampling elements can be chosen multiple times or only once.
Nowadays, all modern systems choose to split/store a big table into a set of smaller, manageable entities where each entity has a maximum size . The entity could be Table pages333https://www.postgresql.org/docs/current/storage-page-layout.html, HDFS file Blocks444https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html, etc.
In this paper, we call these storage entities Clusters and we assume that our tables are already stored as a set of clusters. Given this storage format, sampling on databases can be done at two levels: Row/Cluster level (Haas and König, 2004).
In tabular databases with range queries, it is particularly challenging to find an online sampling algorithm that offers speed-up while maintaining accuracy.
Data providers. For many real-world use cases, multiple organizations or institutions, called data providers, publish access to their databases for joint analysis. Let be the set of data providers. In this work, we assume that a large table is horizontally distributed over such that all data providers share the same schema (i.e. a set of dimensions) of but each contains different rows. All data providers use clusters of the same size to store their local tables. More importantly, for privacy reasons, data providers collaborate on joint analyzes without revealing their data.
Differential Privacy (DP). A privacy model that provides formal guarantees of indistinguishability such that the query results do not yield much information about the presence or absence of any particular individual. Consequently, it hides information about which of the neighbouring tables (Dwork et al., 2014) was used to answer the query.
Definition 3.0 (Neighbouring Tables(Dwork et al., 2014)).
Two tables and are neighbouring if we can obtain one of them by inserting at most a row into the other.
We use to represent the distance between two tables and and we say that two tables are neighbouring if their distance is or less.
Definition 3.0 (-Differential Privacy(Dwork et al., 2014)).
A mechanism satisfies -Differential Privacy (or -DP) if, for any two neighboring tables , and for any possible output of :
where represents the failure probability. We refer to as the privacy budget.
In practice, is a randomized algorithm, which has many possible outputs under the same input. It is well known that DP is used to answer specific queries on databases. Let be a query on a table whose its answer returns a number. The global sensitivity of is the amount by which the output of changes for all neighboring tables.
Definition 3.0 (Global Sensitivity(Dwork et al., 2014)).
For any two neighboring tables and , the global sensitivity of function is:
where is the norm.
For instance, if is a COUNT
range query then is .
The Laplace Mechanism is a randomized mechanism for enforcing -DP (or -DP referred to as pure DP), which adds calibrated noise to the output of a function based on its global sensitivity .
Definition 3.0 (Laplace Mechanism (Dwork et al., 2014)).
The Laplace Mechanism adds noise to as:
where is the global sensitivity of , and denotes sampling from the Laplace distribution with center and scale .
Unlike the Laplace Mechanism, which is used to release noisy numerical values, the Exponential Mechanism can be used for biased selection of elements from a set based on a scoring function while preserving ()-DP (Dwork et al., 2014).
Definition 3.0 (Exponential Mechanism (Dwork et al., 2014)).
Given a set of elements and a scoring function , the Exponential Mechanism randomly selects with the probability of the element being proportional to:
where is the sensitivity of .
Local and Smooth Sensitivity. In many applications of DP, the global sensitivity cannot bounded. In this case, there is an alternative definition of sensitivity called local sensitivity, where the maximum difference between the query’s results is based on a fixed database and any database neighbouring to it:
Definition 3.0 (Local Sensitivity(Nissim et al., 2007)).
Given a database and as any of its possible neighbouring tables, the local sensitivity of function is:
where is the norm.
The local sensitivity is often much less than the global sensitivity because it is based on a specific instance of the data . This also makes it unsafe to use, as it can leak information about on which it is based. Nassim et al (Nissim et al., 2007). suggest the use of a smoothing function that finds a safe upper bound for and can be used to calibrate the randomness (noise) without any risk. These functions usually require that the local sensitivity be computed at any arbitrary distance from .
Definition 3.0 (Local Sensitivity at Distance (Nissim et al., 2007)).
Given a table , the local sensitivity of function is:
where is the norm.
A safe approximate upper bound of , , which is insensitive to small variations of data can be obtained by the smooth sensitivity framework (Nissim et al., 2007).
Definition 3.0 (Smooth Sensitivity Framework (Nissim et al., 2007)).
where .
After a number of iterations, this upper bound can be used to calibrate noise for the Laplace mechanism to ensure -DP.
DP Properties. Combining several DP mechanisms is possible, and the privacy accounting is managed using the sequential and the parallel composition properties of DP. Let be mechanisms satisfying -DP.
Theorem 3.9 (Sequential Composition (Dwork et al., 2014)).
Applying sequentially satisfies -DP.
Theorem 3.10 (Parallel Composition (Dwork et al., 2014)).
A mechanism that applies on disjoint parts of the data satisfies:
-DP
The post-processing property states that it is safe to execute any function on the output of a DP mechanism.
Theorem 3.11 (Post-Processing (Dwork et al., 2014)).
For any -DP mechanism and any function , satisfies -DP.
In the context of online query answering, each query consumes to secure the results. In order to manage/limit the information released to the analyst, a total budget is given which will be consumed by queries such that and . The analyst can continue sending queries until their total budget is consumed.
Secure Multiparty Computation (SMC). it refers to cryptographic protocols that enable a set of independent parties to collaboratively evaluate a query without revealing their private inputs to each other. It also allows them to avoid trusting a third party with the union of their data for query evaluation. However, this safety assurance comes at the cost of resources and processing time. Using SMC is several times slower than insecure alternatives.
4. Problem Statement
Given a federated system in which data providers pool their private data for analysis querying. Consider a private table (as in Figure 2) which is horizontally partitioned among data providers as tables , , . Each data provider wants to keep the individual tuples of their local table confidential and only the schema of is public. Suppose an end user sends the following range query :
SELECT COUNT(*) FROM Table WHERE 20 <= Age <= 40
where is performed on the union of tables stored at the data providers, . However, even though may seem very simple at first glance, the big data associated with makes very complex and time-consuming.
To solve the problem of slow query response time, we can resort to Approximate Query Processing (AQP) to find a trade-off between accuracy and speed of results via approximation. One very straightforward technique of AQP is to perform random sampling, given a sampling rate , to obtain a set of tuples from . For example, an end user can request an answer for based only on of the entire . Even for a single table , to obtain a good approximation of , the sampled tuples must contain meaningful data in the ranges of . Random sampling can be done at the row or cluster level. Although cluster-level sampling is faster than row-level sampling, both have linear performance with respect to sampling rate. The larger the sample, the more accurate and slower the result, and vice versa.
Consider is stored as a set of clusters. To get an accurate estimate of when processing a few parts of the data, we use a statistical estimator (Lohr, 2009). To do this, we need to consider the distribution of rows between all clusters. It should be noted that the assumption of a uniform distribution of rows among all clusters is rarely valid in real databases. Indeed, the rows generally follow a skewed distribution. In contrast, unequal probability cluster sampling is more effective at providing better estimates, where the probability of a cluster being sampled is based on the data distribution for .
Assume that each partition of is stored using clusters. How to apply the unequal probability cluster sampling in our federated context? Note that each cluster within each data provider should have a specific probability of being sampled to estimate , taking into account all other clusters (even those from other data providers). As a result, capturing the inter/intra data distribution will bias the sampling toward clusters or data providers that hold most of the data related to . We refer to this sampling as global sampling.
The other solution is local sampling, where each data provider computes the sampling probabilities for its clusters (without considering other data providers). In this sampling, the sample size is distributed uniformly on data providers, so it does not require a collaboration between data providers. This lack of global data distribution awareness makes this solution less appealing than global sampling.
To apply global data distribution-aware sampling and approximation, data providers must provide appropriate information about their data to quickly and accurately estimate . The optimal solution to capture the data distribution in this context is achieved if data providers have access to each other’s data and sampling probabilities are computed collectively. This collaboration will lead to an overhead in processing time. The challenge is then to define the summarized and small pieces of information that data providers can share and be sufficient to capture the data distribution while producing negligible overhead. Once this global data distribution is captured, each data provider can locally sample clusters, estimate the query, and send its result. All results from data providers will be added together and the final result will be returned to the end user.
Another dimension of our problem concerns privacy and data protection. In the federated context, the end-to-end privacy property must be guaranteed. This essentially ensures that data is protected (i) during and after query execution, (ii) for intermediate results during collaboration, and (iii) for the final response. Differential Privacy (DP) is a widely accepted privacy model, typically applied to query results to prevent any inference about the presence or absence of individuals. As for the intermediate results produced during collaboration between data providers, they must also be protected, with each data provider seeking to prevent any leakage of information on its table. Even if the exchange is limited to summarized (aggregated) information, there will be no privacy guarantee. Thus, DP can also be used to publish intermediate results between data providers.
An alternative solution to DP is the use of Secure Multiparty Computation (SMC) to implement collaboration between data providers. This solution has two major drawbacks: If data providers use the summary information for sampling in SMC, query approximation (which includes running the query on each cluster) must also be done in SMC because the sampling is based on sensitive information and its results may disclose information to other data providers. Second, SMC relies heavily on cryptography, which will significantly reduce the utility of the query in terms of processing time, thereby diluting the purpose of approximations.
In this work, we aim to provide fast and accurate responses to range queries in a federated setup while preserving end-to-end privacy. The challenges we address are: defining a lightweight sampling algorithm considering data distribution for query approximation in a federated environment and carefully applying Differential Privacy to ensure end-to-end privacy with minimal loss of query accuracy.
5. Our solution
5.1. Overview
In our proposal, we combine DP with lightweight SMC to protect intermediate results when collaborating between data providers. This allows us to obtain significantly better performance in terms of speed-up and achieve end-to-end privacy, while maintaining high utility answers for online range queries. To achieve these goals, we propose an efficient and lightweight collaboration method, allowing data providers to decide how many samples to extract from each, guided by the summary information shared during this collaboration. To integrate knowledge of the data distribution into our sampling and approximation steps, we use the probability proportional to size (pps) method (Lohr, 2009). Here, the probability of including (or sampling) a cluster is determined by the proportion of rows in falling within the ranges of the query . Computing is expensive and requires similar overhead as running the query. To minimize the processing time of , we will approximate each of any cluster using lightweight metadata associated with .
Our solution has two main phases: offline data preprocessing and online query answering. In the offline data preprocessing phase, each data provider constructs global and individual metadata for its clusters. This metadata makes query approximation easier without imposing a significant overhead in terms of processing time. All data providers agree on the same maximum cluster size (more details are given in Section 7) before initiating the system. The size may not reflect the actual size of their clusters, but it would be used to calculate the of each cluster. The offline phase and metadata creation are detailed in Section 5.2, and Figure 3 (b) shows the general architecture of our system with each data provider as well as its metadata.
Once preprocessing is complete for all data providers, the system goes online. In the online query response phase, the end user interacts with an aggregator by sending their query and desired sampling rate and receives a secure response in return. The aggregator manages the rest of the exchanges with the data providers. The query lifecycle (see Figure 3 (a)) as well as the collaboration (exchange of summary data) are described as follows:
-
(1)
First, the aggregator sends the query to the data providers. Each data provider performs two tasks: i) identify the set of clusters covering such that , ii) compute the proportion of rows for each . The data provider uses previously stored metadata to avoid overhead when performing these two tasks.
-
(2)
Each data provider securely (using DP) sends to the aggregator the summarized data needed for collaboration. The number of clusters and average of proportions where .
-
(3)
The aggregator computes and sends the best allocation (sample size ) for each data provider while respecting the total sample size given by .
-
(4)
Each data provider tests the condition in order to compute “regularly” without approximation. The is a threshold set by each data provider to trigger the approximation only if the query is significantly large (more details about are given in Section 5.2).
-
(5)
If the previous condition does not hold, each data provider randomly and securely with DP samples , where .
-
(6)
After sampling, each data provider estimates over locally and then securely sends the result to the aggregator with DP guarantees.
-
(7)
Alternatively, data providers may use SMC to share their local estimations and ”sensitivities”. Then, the aggregator obliviously sums the estimations and applies DP using the maximum sensitivity before safely releasing the final result.
In Section 5.2, we will focus on the approximation via cluster sampling and the metadata created offline. Afterward, section 5.3 will be dedicated to the second phase of our solution. In Section 5.3.1, we will describe the allocation step and how it preserves the same semantics as the naive (sharing all data) method of collaboration by keeping the sampling data distribution aware without an overhead. In Section 5.3.2, we will present the privacy-preserving sampling used by each data provider locally to create . In Section 5.3.3, we detail how to obtain a calibrated DP noise for the end result obtained by using a statistical estimator. Finally in section 5.4, we explain how the privacy budget for each query is managed and consumed.
5.2. Query Approximation and sampling
As previously mentioned in Section 5.1, our unequal probability sampling is based on the proportion of rows in cluster that corresponds to . Computing the exact for each cluster is as costly as evaluating the query itself, rendering the approximation useless. Inspired by (Zhang et al., 2016), we will only approximate to avoid an overhead in response time. Given a query defined by a set of ranges: on each dimension, we assume that the dimensions are not correlated (independent). We will compute the sub-proportions on each dimension as follows:
The proportion is computed based on the proportions and of records whose dimension values are and , respectively. Based on the assumption of independence between dimensions, can be obtained as follows:
(1) |
where is the number of clusters covering . The approximated can then be used to obtain the sampling probabilities for the cluster as shown in Equation 1. Even this approximation requires a lot of calculations, which may cause similar overhead as the exact . To bypass this limitation, we associate each cluster with a set of metadata that accelerates these computations for any given query (see Algorithm 1).
For each cluster and for each distinct value of dimension in (Lines 5,6 Algorithm 1), is stored in the dedicated meta file for the cluster where the entry is in the form (Line 8 Algorithm 1). These metadata will be used by each data provider to quickly access precomputed proportions that correspond to the range of a given . Thus significantly reducing the overhead in the online phase. To further improve the performances, Algorithm 1 stores additional global metadata about the clusters Clusters_metas, enabling the system to easily identify the clusters that correspond to before even computing the proportions. In a dedicated global file Clusters metas, for each dimension in cluster , Algorithm 1 (Line 11,13) stores (), the minimum (maximum) value of in . Based on these metadata in Clusters metas, the system is able to focus only on a small subset of the database that actually contains rows matching instead of , thus reducing the processing time of . The set is defined as follows:
(2) |
Since we are able to identify the clusters concerned by , it only makes sense to approximate only when is bigger than a certain threshold . This threshold can be set independently by each data provider based on the size of the clusters, the processing time required for a single cluster, and the hardware and software infrastructure. The cost of saving these metadata is very negligible compared to the actual table and clusters. We used the same data structure like (Zhang et al., 2016) which is very efficient. In Section 6 we show the space needed for each database.
Once the sampling is applied according to the probability computed using Equation 1, the Hansen-Hurwitz estimator (Lohr, 2009) is used to obtain the final estimation of . The estimation is done as follows:
(3) |
where is the sampling probability of the cluster and is the query execution result on the cluster
5.3. Federated protocol
In this section, we will review all the steps of online query approximation and how we were able to carefully integrate DP into each step.
5.3.1. Allocation phase
In this step, the data providers need to jointly decide the number of clusters to be sampled from each one of them based on the distribution (’s) of data related to . So upon receiving the query, each data provider identifies and computes the for each using the metadata stored locally. Then each one sends to the its and , where is the set of ’s of the clusters in and stands for Average. indicates the number of clusters within that data provider that overlap with , while shows the average proportion of rows within those clusters that corresponds to . Based on this information, we obtain an aggregated (summary) view of the data distribution of records corresponding to in each data provider. Using these insights, the finds the best sample size for data provider using an optimization problem given in Equation 4 that aims to assign a bigger allocation to the data provider with the most data related to .
(4) |
In Equation 4, the data provider that holds the most data related to (has the bigger ) gets more allocation, thus sampling more clusters to approximate locally. This reflects the same behaviour as the original collaboration method (described in Section 4): sampling probabilities are computed globally and the clusters of the data provider with the bigger are more likely to be sampled than others (higher probabilities, Equation 3). So with our collaboration method, we are able to reproduce similar results and behaviour. It is important to highlight that comparing the from each data provider is only possible because we imposed they use the same in order to compute the proportions during the metadata creation phase.
To solve the problem in Equation 4, each data provider shares the and . Both are sensitive pieces of information that may reveal insights about the individuals within the database. Even if the optimisation in Equation 4 is done over encrypted data, the released allocation might give a data provider insights about the other data providers. To secure the release of this information, each data provider uses Laplace mechanism to ensure formal guarantees of privacy. Given a privacy budget of , each data provider perturbs these two values as follows:
(5) |
where the sensitivity of to the absence/presence of an individual is 1, and the sensitivity of , is .
Theorem 5.1 (Sensitivity of estimator ).
For any two neighbouring databases , the sensitivity of is defined as:
The proof is given in appendix A.
With this perturbation, the collaboration between data providers for deciding the allocation does not reveal any sensitive information. So the optimization problem is formulated as follows:
(6) |
The test of comes after the allocation (collaboration) phase in order to encourage all data providers to participate. Otherwise, if a data provider does not participate in allocation because locally approximating is not possible, this may reveal information about the size of its data to other data providers.
5.3.2. Sampling phase
After the allocation phase, each data provider receives an allocation : the number of clusters to process for the approximation. Using the computed locally, the data provider computes the sampling probabilities for and then performs unequal probability sampling to randomly select clusters. Since the sampling probabilities are computed based on the rows (individuals) in the database, the result of the sampling (choices) may leak information about the presence/absence of any individual. To guarantee DP, our system uses the Exponential Mechanism (EM) to select the clusters (Algorithm 2) while consuming privacy budget.
The score of the cluster is its own sampling probability (Algorithm 2 line 1), which means the scoring function of EM is defined by the computation in Equation 1. So to calibrate the noise (randomness) of EM, we must find the sensitivity of this function to the absence/presence of any individual in the database.
Consider two neighbouring databases and , where is obtained by adding any random record (which represents an individual) to at any possible cluster. Given a range query , in order to measure (sensitivity of , which is the same as ) we assume the worst case scenario for and : all clusters of ) each have a record that corresponds to . In this case, their probabilities are the same: . In , one record is added to another cluster outside of that matches . Thus and , and for all the clusters have the same sampling probability: . So the can be computed as follows:
(7) |
We notice that is dependent on the query . To find the global maximum value for , we replace the by its minimum possible value .
Theorem 5.2 (Sensitivity of sampling probability).
For any two neighbouring databases , the sensitivity of the sampling probability of any cluster is bounded by :
where is the norm.
5.3.3. Approximation phase
To obtain the final result from , each data provider uses the estimator defined in Equation 3. In order to release the final results securely and have DP privacy guarantees, a well-calibrated noise will be added to the final answer using Laplace Mechanism. To apply Laplace Mechanism, we need to find the sensitivity of the estimator. Let us define . We can re-write as follows :
(8) |
Which implies that :
(9) |
To find , we will focus on finding , and deduce afterwards based on this implication. Given that is a fraction of two real values, it gives a hint that its sensitivity might be unbounded similarly to operator (Near and Abuah, 2021). Upon further analysis (see appendix B), we find that is unbounded, which implies is also unbounded.
Theorem 5.3 (Sensitivity of estimator ).
For any two neighbouring databases , the sensitivity of the estimator for any cluster and query is unbounded:
where is the norm.
Given that a global sensitivity does not exist, we resort to the Local Sensitivity (LS) which is measured based on the database instance . For any database neighbouring to obtained by adding 1 row (one individual) that matches the query , we can distinguish four scenarios for a cluster (we focus on one cluster because we are looking for ) that might affect :
-
•
Scenario 1: Cluster did not receive the new row, but another cluster did.
-
•
Scenario 2: Cluster did receive the new row.
-
•
Scenario 3: Cluster did not receive the new row but another cluster has been added to , such that .
-
•
Scenario 4: Cluster did receive the new individual, but only add to the attribute of existing aggregate row.
Our aim is to find the upper bound of , thus we must consider the distance that provides the largest sensitivity. An analysis of each of these scenarios (see Appendix B.2) showed that under a certain condition, either scenario 1 or scenario 4 will yield the biggest distance. For a given cluster , we can choose the Dominant scenario (which will yield the biggest ) between scenarios 1 and 4 without needing to compute any of them.
Theorem 5.4 (Dominant distance LS).
the neighbouring scenario 1 will give bigger distance than scenario 4 iff:
See Appendix B.2 for proof.
Since the is computed based on , it cannot be used directly to inject noise because the scale of the noise may reveal sensitive information about (Near and Abuah, 2021). To avoid such information leakage, we will use the smooth sensitivity framework (Nissim et al., 2007) for finding a safer upper bound for the local sensitivity . So we redefine our in terms of a distance between and :
-
•
Scenario 1:
-
•
Scenario 4:
See Appendix B.2 for proof.
The safe smooth upper is defined as follows:
(10) |
and is the privacy budget allocated for releasing the final result.
Based on the definitions we gave for , the computational overhead to compute the smooth sensitivity for each cluster is very negligible because: i) All the ’s and ’s are computed before this step, and will be reused for each iteration over ; ii) the maximum value of (steps) is also bounded by (see Appendix B.3 for proof), which guarantees that the process will terminate; iii) Theorem 5.4 allows to determine which scenario is dominant for any given cluster, thus only computing one .
Algorithm 3 describes the process of estimating over the subset of cluster . It 3 starts in line 1 by estimating according to Equation 3. Then it proceeds to compute the smooth sensitivity (Lines 2-6), where the function is responsible for computing the smooth sensitivity for each cluster as described in Equation 10. Depending on the chosen setup by the data providers, either they compute and send a DP result to the aggregator (Algorithm 3, Lines 10–11) and the aggregator returns the sum to the user. The second option is that data providers share their estimations and computed sensitivities (Algorithm 3, Line 8) with the securely using SMC, and obliviously compute the sum of estimations and the max sensitivity to perturb the final result with Laplace Mechanism.
5.4. Privacy accounting
In the online query answering settings under DP, the end user is limited by a total privacy budget of . For each query , a budget is consumed in order to publish the answer and the end user can interact with system as long as the total budget is not consumed. In this section, we will track the privacy budget consumption for each query.
In our proposed protocol the data providers do not share their data, and is processed (data access and publishing) in parallel by each data provider. We can just track the consumption on one data provider, and based on the parallel composition property of DP we can deduce the budget consumption for on the full system. A data provider starts by publishing the and using Laplace mechanism for the allocation phase, while consuming a total budget of . Based on the post-processing property of DP, obtaining the sample size is DP. Afterwards, each data provider uses Exponential Mechanism to sample a subset while consuming a budget of . To publish an estimation of , each data provider uses Laplace mechanism once more, and consumes a budget of . The final step does not in fact guarantee pure DP, since the smooth sensitivity has a failure probability. Based on the sequential composition property of DP, the total budget is: . Given the parallel composition property, the budget consumption for is .
In case the data providers used SMC to inject a single noise, based on parallel composition property we deduce that data providers consumed for the local computation. Afterwards they collectively consumed (once) for publishing the result. By the sequential composition property of DP, the budget consumption for is .
Based on these results, a set of hyperparameters can be set in our system (by database admin for example) that regulates the budget distribution at each step of the query processing.
Let be this set of hyperparameters
(where and ) such that : , and .
6. Evaluation
6.1. Setup
Datasets. We used two big datasets: (i) Adult (Becker and Kohavi, 1996) contains demographic and income information for individuals with dimensions and records, synthetically scaled up records. (ii) Amazon Review (Ni et al., 2019) is about reviews from Amazon clients across different product categories, with only three “range querable” dimensions and records ( Gb). We synthetically added three randomly populated dimensions and random records to reach records.
A count tensor with column Measure is created from each dataset, aggregating six dimensions of Adult and one dimension of Amazon Review.
Queries and Workloads. We generated random ranges for the queries and ran only those that lead to the approximation () on all data providers. A workload is a set of distinct queries with ranges over dimensions.
Metrics. An online query is useful if it has a low error rate and low processing time. To measure the query error, we used . For performance in terms of response time, we used: Speed-UP= .
Configuration. In our experiments, we assumed that there are one aggregator and four data providers and that each data provider has its own database. Datasets Adult and Amazon Review are horizontally partitioned equally across data providers.
Source code. Based on PostgreSQL555https://developers.google.com/optimization, our solution666https://github.com/AlaEddineLaouir/Federated-Range-Queries.git coded in Python uses the libraries: (i) OrTools777https://developers.google.com/optimization as solver; (ii) Pyro5888https://pyro5.readthedocs.io/en/latest/index.html as communication medium; and, (iii) MPyC999https://mpyc.readthedocs.io/en/latest/mpyc.html as SMC environment. Our implementation is a proof-of-concept in which the clusters of the original table are other smaller tables.
Hyperparameters. In our experiments, the total privacy budget for each query is set with and (unless other values are indicated for ). The budget is shared between each step of our solution as follows: , and . To get clusters of the same size, we set the cluster size to and of the total size of each data provider for Adult and Amazon Review, respectively.
Metadata space allocation. The metadata for Amazon Review dataset was about 11 MB (56 KB/cluster). As for Adult dataset, it occupied 6.4 MB (64 KB/cluster).
Hardware101010Grid5000: Grisou cluster https://www.grid5000.fr/w/Nancy:Hardware. For each of the data providers and the aggregator, we allocated a dedicated server with the following configuration: X Intel Xeon E v cores/CPU x, RAM GB and TB HDD, and a network with Gbps + x Gbps (SR‑IOV).
6.2. Dimension-based analysis
In these experiments, we evaluated the impact of the number of dimensions in queries on accuracy.
To this end, we generated random workloads with distinct queries (SUM
and COUNT
) and dimension for Adult and for Amazon Review.
For the sampling rate, we set it to and for Amazon Review and Adult datasets, respectively.
The results presented in Figure 4 show that our solution achieves very high accuracy for COUNT
and SUM
queries. The relative error is less than (resp. ) on average for COUNT
queries on Amazon Review (resp. Adult).
As for SUM
queries, the error is less than (resp. ) on Amazon Review (resp. Adult).
This performance difference is due to the size difference between the databases. In big tables, query results are larger (contain more data), therefore less affected by Laplace Mechanism noise.
Interestingly, the results also indicate that queries become more accurate as the number of dimensions decreases. Specifically, with workloads having only dimensions on both datasets, we reached an error close to .
This observed behavior corresponds to our expectations. Because in Equation 1, we approximate of each cluster and the accuracy of this approximation improves as the number of dimensions decreases, bringing the approximation closer to the exact . Thus, we have more accurate sampling probabilities which affect the estimation of the final result.
For the speedup, the results in Figure 7 show that the higher the number of dimensions, the less speedup is gained. From the results in Figure 7, the speedup drops from approximately to as the number of dimensions increases from to on Amazon Review dataset. This drop is attributed to the sampling probabilities approximation phase, where our algorithm looks up the preprocessed metadata. The higher the number of dimensions, the more metadata it needs to look up. However, this effect becomes negligible on larger databases. Because even in these results, the speedup remains very significant.
6.3. Sampling rate-based analysis
In this analysis, we examined the effect of sampling rate on query quality. For each database, we generated two random workloads for COUNT
and SUM
queries of and .
We varied the sampling rate between and for each experiment and measured the quality obtained in terms of accuracy and speed-up. From the results in Figure 5, we observe that a higher sampling rate provides slightly better accuracy: reaching a relative error of less than with a sampling rate for COUNT
queries on Amazon Review dataset.
Regarding the speed-up, we note that our solution reaches up to a compared to a normal execution (without approximation) on Amazon Review (with dimensional queries). Additionally, the speed-up gains in Amazon Review are more significant than those in Adult. This result indicates that our solution provides more speed for larger datasets. Also based on the results in Figure 5, the tradeoff between speed-up and accuracy is noticeable. We observe that the larger the sampling, the less the speed-up is gained. On the other hand, accuracy improves with higher sampling rates. We can say that, based on the results shown in this experiment, accuracy gains with higher sampling are very costly in terms of speed-up. But it is up to the users (data analysts) to define the sampling rate according to their needs.
6.4. Privacy budget-based analysis
In these experiments, we analyzed the effect of the privacy budget on query quality. We generated two random workloads of and for COUNT
and SUM
queries and set the sampling rate to and for Amazon Review and Adult, respectively.
We varied between and and captured the performance on each workload. From the results in Figure 6, we can immediately observe the typical trend of any DP mechanism (larger leads to better accuracy).
Interestingly, SUM
queries are able to provide better utility (lower relative error) than COUNT
queries. This happens because SUM
queries yield more substantial results (larger query responses) than COUNT
queries, making them less affected by noise added to the response.
A similar observation applies when comparing results between the two databases, with workloads on Amazon Review preserving more accuracy than those on Adult. This is attributed to the fact that the Amazon Review dataset is much larger than Adult, causing queries to be less affected by the added noise.
Based on this observation, we can predict that as the database size increases, the accuracy of our solution will improve by using smaller values for . Regarding speed-up, the results in Figure 7 show that levels have no effect.
6.5. SMC vs DP in terms of sharing results
To examine the performance of our SMC-based solution to share final results, we conducted experiments using an Adult dataset split across four data providers.
We generated five random two-dimensional COUNT
queries. Each query was repeated five times (with and without SMC) and we measured the speed-up and the the range of noise added using the Laplace mechanism at each iteration.
The results in Figure 8 show, for each query, the range of noise sampled using the Laplace mechanism for both solutions at each iteration and speed-up. We notice in Figure 8 that using SMC to share only the sensitivity and the local result does not produce significant overhead, which corresponds to the simulation results in Figure 1. Concerning the injected noise, which affects the precision of the query result, the use of SMC allows a more restricted range of perturbation. Meanwhile, if each data provider perturbs its local data without SMC, there could be two cases: (i) the noises from the data providers cancel each other out, or (ii) the noise accumulates. In the first case, the sum of noises is close to zero because some are positive and others negative, which will help improve accuracy. In the second case, which represents the worst case where most of the noise is positive or negative, the accuracy of the results will be greatly affected.
Based on the experiment results, a user/data provider can choose the appropriate query execution process (with or without SMC) based on their needs, preferring accuracy over speed-up or vice versa.
6.6. Resilience to Learning-Based Attacks
DP prevents membership attacks revealing the presence/absence of an individual in the database. In (Cormode, 2010), the author introduced a simple attack that allows the disclosure of an individual’s sensitive attribute based on anonymized data.
This attack relies on training a Naive Bayes Classifier (NBC) using the results of COUNT
queries from a noisy database, and this classifier will be used to predict the value of based on a given set of (quasi-identifiers) attribute values of an individual.
In our data model, corresponds to one of the dimensions , and is the subset . Given for ,
a NBC attaches a probability to each possible value of ().
The predicted value is the one with the highest probability according to Bayes Theorem (Cormode, 2010):
To make these predictions, the classifier goes through a training phase during which it learns the conditional probabilities using the queries COUNT(*)
(or SUM(Measure)
) issued by the attacker to the database.
The learned probabilities are saved and later used to make predictions.
The number of queries needed is:
which is used to compute the size of the database, and for all values and dimensions.
For instance, consider a table T
with rows and is the dimension for Age attribute. To compute , we use the following COUNT
query:
SELECT COUNT(*) FROM T WHERE 25 <= Age <= 25 )/ 10000
.
This huge number of queries can be easily issued to a published database using a DP algorithm with a fixed privacy budget (e.g. PrivBayes(Zhang et al., 2017)), and from which the attacker can infer some knowledge (Cormode, 2010; Gkountouna et al., 2022).
However, the database is not published in our system. As we showed in Section 5.4 the attacker has a limited budget , from which each issued query consumes a privacy budget based on a sequential composition 3.9. Since can be very large, must be very small and , thus losing the utility of query answers. An alternative to sequential composition is Advanced composition (Lohr, 2009; Kairouz et al., 2015), which allows the queries to have a greater budget without exceeding . With the advanced composition, the budget of each query is: . We notice that , which means queries have better utility.
To evaluate the resilience of our system against this learning-based attack, we tested both sequential compositions and the two allowed queries COUNT
and SUM
.
We also considered parallel composition which allows multiple attackers to create a coalition, where each of them executes only one query (to maximize utility) and combines it with those of other attackers to train the classifier. The ingredients of our experiments are as follows:
Setup: We used Adult dataset with four data providers. We selected dimensions of our table to be and dimension to be where 100 (i.e. the number of classes for NBC). We also set and we varied between and since there is no standard value (Lohr, 2009; Laud and Pankova, 2019).
Evaluation: To assess the quality of the learning attack, we measured the accuracy of the NBC in predicting the value of for each row in the original table .
Sequential / COUNT | ||||
Sequential / SUM | ||||
Advanced / COUNT | ||||
Advanced / SUM | ||||
Coalition / COUNT | ||||
Coalition / SUM |
The results in Table 1 show that in all scenarios the accuracy is . Since the we used had possible values, this means that the trained classifier is given similar accuracy as randomly assigning a value for in each row. Three reasons can be put forward to explain the failure of the learning-based attack:
i) our system is interactive (the database is not released) and the budget is limited, thus it is difficult to have good accuracy for large numbers of queries by a single attacker;
ii) query answers in our system are approximated with random sampling, which will introduce some error;
iii) the smooth sensitivity has a considerable scale, and in the case of queries that collects small values, the accuracy can be lost even for large values of .
Similar results were obtained when fixing the and changing the number of dimensions in from 1, 3, 5 to 8. This shows the resilience of our system in different settings.
7. Discussion
In this section, we discuss the constraints, limits and points of improvement that could be integrated into our solution. In order to approximate the sampling probabilities in Section 5.2, we assumed that the dimensions are independent and that there is no correlation between them. However, this assumption is not valid in some cases. For example, if an individual’s is less than , this implies with a high probability that he/she is still studying (). Likewise, if , the attribute . When it comes to range queries, capturing and managing these dependencies is non-trivial; so we will leave it for future work.
We also restricted the data providers to using the same value of in order to approximate the . Otherwise, we cannot compare the in the allocation phase (Section 5.3.1). To agree on the same , each data provider can share their true with the others, and they will use then the maximum (which will guarantee that all the computed are ). The value of itself is not sensitive since it is usually a constant in a database system. But if this is deemed sensitive in a particular case, then data providers can simply share a randomly chosen value such that: where is an upper bound chosen by each data provider (e.g. ).
In our solution, we focused on protecting the intermediate (summary information) and final result from inference attacks with the use of Differential Privacy. However, we have not directly addressed the risks associated with side-channel attacks. It is easy to see that thanks to the collaboration method that we propose, we manage to avoid certain risks mentioned in (Qiu et al., 2023), such as: memory access models and communication volumes since all data-based computations are performed locally at each data provider and the communication cost is constant and independent of the query. But we have postponed further consideration of this aspect of the problem to dedicated work.
Our solution serves as the first building block towards a more comprehensive solution that handles more complex queries, such as GROUP-BY
queries. Integrating such clauses in the SQL query is not so trivial, and adding noise to the final result will not be enough to guarantee privacy (Desfontaines et al., 2020). Other aggregations, such as average, standard deviation, and variance, can be derived from SUM and COUNT using the sequential composition of DP. However, to handle other aggregations (such as Min, Max and Mode), different estimators are required.
Finally, during our evaluation, we built a proof of concept of our solution on PostgreSQL. It would be interesting to incorporate it directly into any DBMS, which would further improve our results.
8. Conclusion
In our study, we introduced a lightweight collaborative approach for online range query approximation in a federated environment. Our experimental results demonstrated the performance improvements our solution is capable of delivering, with processing times improved by up to 8x compared to plain-text execution, while ensuring end-to-end privacy with minimal loss of accuracy. Our solution uses cluster sampling and query estimation techniques that take into account data distribution to preserve query utility in terms of speed and accuracy. This work lays a solid foundation for future work to handle more complex queries while maintaining the same level of performance.
References
- (1)
- Abowd (2018) John M Abowd. 2018. The US Census Bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2867–2867.
- Acharya et al. (1999) Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. The aqua approximate query answering system. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 574–576.
- Agarwal et al. (2013) Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European conference on computer systems. 29–42.
- Agrawal et al. (2006) Rakesh Agrawal, Dmitri Asonov, Murat Kantarcioglu, and Yaping Li. 2006. Sovereign joins. In 22nd International Conference on Data Engineering (ICDE’06). IEEE, 26–26.
- Ahmadvand et al. (2019) Hossein Ahmadvand, Maziar Goudarzi, and Fouzhan Foroutan. 2019. Gapprox: using gallup approach for approximation in big data processing. Journal of Big Data 6 (2019), 1–24.
- Bater et al. (2017) Johes Bater, Gregory Elliott, Craig Eggen, Satyender Goel, Abel N Kho, and Jennie Rogers. 2017. SMCQL: Secure Query Processing for Private Data Networks. Proc. VLDB Endow. 10, 6 (2017), 673–684.
- Bater et al. (2018) Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, and Jennie Rogers. 2018. Shrinkwrap: efficient sql query processing in differentially private data federations. Proceedings of the VLDB Endowment 12, 3 (2018).
- Bater et al. (2020) Johes Bater, Yongjoo Park, Xi He, Xiao Wang, and Jennie Rogers. 2020. Saqe: practical privacy-preserving approximate query processing for data federations. Proceedings of the VLDB Endowment 13, 12 (2020), 2691–2705.
- Becker and Kohavi (1996) Barry Becker and Ronny Kohavi. 1996. Adult. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5XW20.
- Bittau et al. (2017) Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. 2017. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th symposium on operating systems principles. 441–459.
- Cao et al. (2021) Lei Cao, Dongqing Xiao, Yizhou Yan, Samuel Madden, and Guoliang Li. 2021. ATLANTIC: making database differentially private and faster with accuracy guarantee. (2021).
- Chaudhuri et al. (2007) Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. ACM Transactions on Database Systems (TODS) 32, 2 (2007), 9–es.
- Cormode (2010) Graham Cormode. 2010. Individual privacy vs population privacy: Learning to attack anonymization. arXiv preprint arXiv:1011.2511 (2010).
- Desfontaines et al. (2020) Damien Desfontaines, James Voss, Bryant Gipson, and Chinmoy Mandayam. 2020. Differentially private partition selection. arXiv preprint arXiv:2006.03684 (2020).
- Dwork et al. (2014) Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 (2014), 211–407.
- Erlingsson et al. (2014) Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. 1054–1067.
- Eskandarian and Zaharia (2017) Saba Eskandarian and Matei Zaharia. 2017. Oblidb: Oblivious query processing for secure databases. arXiv preprint arXiv:1710.00458 (2017).
- Gkountouna et al. (2022) Olga Gkountouna, Katerina Doka, Mingqiang Xue, Jianneng Cao, and Panagiotis Karras. 2022. One-off disclosure control by heterogeneous generalization. In 31st USENIX Security Symposium (USENIX Security 22). 3363–3377.
- Goiri et al. (2015) Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D Nguyen. 2015. Approxhadoop: Bringing approximations to mapreduce frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. 383–397.
- Haas and König (2004) Peter J Haas and Christian König. 2004. A bi-level bernoulli scheme for database sampling. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data. 275–286.
- Hellerstein et al. (1997) Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. 1997. Online Aggregation. In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA. 171–182.
- Kairouz et al. (2015) Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2015. The composition theorem for differential privacy. In International conference on machine learning. PMLR, 1376–1385.
- Laud and Pankova (2019) Peeter Laud and Alisa Pankova. 2019. Interpreting epsilon of differential privacy in terms of advantage in guessing or approximating sensitive attributes. arXiv preprint arXiv:1911.12777 (2019).
- Li et al. (2016) Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data. 615–629.
- Liagouris et al. (2021) John Liagouris, Vasiliki Kalavri, Muhammad Faisal, and Mayank Varia. 2021. Secrecy: Secure collaborative analytics on secret-shared data. arXiv preprint arXiv:2102.01048 (2021).
- Lohr (2009) Sharon L. Lohr. 2009. Sampling : Design and Analysis.
- Near and Abuah (2021) Joseph P. Near and Chiké Abuah. 2021. Programming Differential Privacy. Vol. 1. https://programming-dp.com/
- Ni et al. (2019) Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 188–197.
- Nissim et al. (2007) Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. 75–84.
- Olken and Rotem (1986) Frank Olken and Doron Rotem. 1986. Simple random sampling from relational databases. (1986).
- Olken and Rotem (1995) Frank Olken and Doron Rotem. 1995. Random sampling from databases: a survey. Statistics and Computing 5 (1995), 25–42.
- Piatetsky-Shapiro and Connell (1984) Gregory Piatetsky-Shapiro and Charles Connell. 1984. Accurate estimation of the number of tuples satisfying a condition. ACM Sigmod Record 14, 2 (1984), 256–276.
- Qin and Rusu (2014) Chengjie Qin and Florin Rusu. 2014. PF-OLA: a high-performance framework for parallel online aggregation. Distributed and Parallel Databases 32 (2014), 337–375.
- Qiu et al. (2023) Lina Qiu, Georgios Kellaris, Nikos Mamoulis, Kobbi Nissim, and George Kollios. 2023. Doquet: Differentially Oblivious Range and Join Queries with Private Data Structures. Proceedings of the VLDB Endowment 16, 13 (2023), 4160–4173.
- Song et al. (2018) Guangxuan Song, Wenwen Qu, Xiaojie Liu, and Xiaoling Wang. 2018. Approximate calculation of window aggregate functions via global random sample. Data Science and Engineering 3 (2018), 40–51.
- Team et al. (2017) ADP Team et al. 2017. Learning with privacy at scale. Apple Mach. Learn. J 1, 8 (2017), 1–25.
- Wang and Jajodia (2008) Lingyu Wang and Sushil Jajodia. 2008. Security in Data Warehouses and OLAP Systems. In Handbook of Database Security - Applications and Trends, Michael Gertz and Sushil Jajodia (Eds.). Springer, 191–212.
- Wu et al. (2010) Sai Wu, Beng Chin Ooi, and Kian-Lee Tan. 2010. Continuous sampling for online aggregation over multiple queries. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 651–662.
- Xu et al. (2008) Fei Xu, Christopher M. Jermaine, and Alin Dobra. 2008. Confidence bounds for sampling-based group by estimates. ACM Trans. Database Syst. 33, 3 (2008), 16:1–16:44.
- Zhang et al. (2017) Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2017. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS) 42, 4 (2017), 1–41.
- Zhang et al. (2016) Xuhong Zhang, Jun Wang, and Jiangling Yin. 2016. Sapprox: Enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling. Proceedings of the VLDB Endowment 10, 3 (2016), 109–120.
- Zheng et al. (2017) Wenting Zheng, Ankur Dave, Jethro G Beekman, Raluca Ada Popa, Joseph E Gonzalez, and Ion Stoica. 2017. Opaque: An oblivious and encrypted distributed analytics platform. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 283–298.
Appendix A Sensitivity summarised information
In order to obtain the allocation (sample size) based on the inter/intra data provider data distribution, each data provider communicate and . The sensitivity of the is straight forward, given a query and any two neighbouring database and : . Adding/removing and individual at most add/remove a cluster from . For the sensitivity of , we need to consider first the sensitivity of a single .
A.1. Sensitivity
Given a range query and two neighbouring databases (with cluster ) and (with cluster ), we consider the case where has and additional row for an extra individual. that implies:
(11) |
where is the set of dimensions defining . In order to obtain the upper bound of , we consider which implies :
(12) |
Since the values of and are publicly known, there no information leak when using based on these values. The other possible scenarios of neighbouring are : 1) has on row less, which will give the same result as the previous one. 2) has new/lost individual but only affected the column ”Measure” of a row by 1, then . 3) Case where a cluster wasn’t in in has an additional row and his in (or vice versa), in this case . We can prove that :
(13) |
Since , we can assume . Then :
(14) |
Which is always true ()
A.2. Sensitivity
The average of , , of a data provider’s set of cluster is computed as follows : . For any two neighbouring databases and , there is two cases where the is effected: 1)One of the clusters in has additional row compared to his counter part in 2) has new cluster due to the presence of an individual thus . Which will give:
(15) |
We can simplify the the second as follows:
(16) |
can be replaced by it’s smallest possible value to maximise : We can simplify the the second as follows:
(17) |
can be replaced by it’s smallest possible value to maximise :
(18) |
Since and are not sensitives information, they can be used to express the .
Appendix B Sensitivity estimator
For the estimator used to approximate the result of , we first will give the bound for its global sensitivity, then we show how we bound his local sensitivity.
B.1. Global sensitivity of the estimator
Given a query , two neighbouring databases and containing cluster and where has an additional row that covers . Thus both the sampling probability and are effected by this additional row, and we have and , this implies :
(19) |
Since the is in the denominator, we can minimise it to obtain the . Let’s consider the case where contains only one row that covers and the remain cluster in fully covers so their which implies :
(20) |
In , has an additional row so is:
(21) |
From equations (20) and (21) :
(22) |
Which implies:
(23) |
This results shows that the sensitivity of this statistical estimator is very large and unbounded, if its result should be protected another alternative is mandatory.
B.2. Smooth sensitivity of the estimator
To compute the smooth upper bound of the , we considered four possible scenarios of neighbouring and :
-
(1)
and : another cluster gained a row
-
(2)
and : new cluster added to
-
(3)
and : the cluster gained a row.
-
(4)
and : the cluster gained an individual 1 in a measure and not a new row.
Our goal is to find the biggest one of these distances. we can quickly notice that since . And we have because this constraint is always true:
(24) |
Between and , we need to find which is the bigger distance and under what conditions:
(25) |
In conclusion, only and need to be used in order to compute the smooth sensitivity, and for each cluster there is only one that dominated the other based if . Both distances can be simplified as follows :
(26) |
B.3. Bound the k for the smooth sensitivity
Since we have our distances are ascending function and is an exponential decay function, and since we are looking for the . To find the stopping point of , we need to find where the this product starts decaying. In other words, we find the such that :
(27) |
For based on scenario 1:
(28) |
For based on scenario 4:
(29) |
So for both our distances, the smooth upper bound is reached when:
(30) |
Where . Based on the budget we set for the estimator, we will obtain our exact upper bound for the . this shows that our process terminates and don’t run indefinitely.