Integer polynomial recovery from outputs and its application to cryptanalysis of a protocol for secure sorting

Srinivas Vivek; Shyam Murthy; Deepak Kumaraswamy

doi:10.1515/jmc-2021-0054

Open Access Published by De Gruyter August 18, 2022

Integer polynomial recovery from outputs and its application to cryptanalysis of a protocol for secure sorting

Srinivas Vivek , Shyam Murthy and Deepak Kumaraswamy

From the journal Journal of Mathematical Cryptology

https://doi.org/10.1515/jmc-2021-0054

Abstract

We investigate the problem of recovering integer inputs (up to an affine scaling) when given only the integer monotonic polynomial outputs. Given n integer outputs of a degree- d integer monotonic polynomial whose coefficients and inputs are integers within known bounds and n ≫ d , we give an algorithm to recover the polynomial and the integer inputs (up to an affine scaling). A heuristic expected time complexity analysis of our method shows that it is exponential in the size of the degree of the polynomial but polynomial in the size of the polynomial coefficients. We conduct experiments with real-world data as well as randomly chosen parameters and demonstrate the effectiveness of our algorithm over a wide range of parameters. Using only the polynomial evaluations at specific integer points, the apparent hardness of recovering the input data served as the basis of security of a recent protocol proposed by Kesarwani et al. for secure k -nearest neighbor computation on encrypted data that involved secure sorting. The protocol uses the outputs of randomly chosen monotonic integer polynomial to hide its inputs except to only reveal the ordering of input data. By using our integer polynomial recovery algorithm, we show that we can recover the polynomial and the inputs within a few seconds, thereby demonstrating an attack on the protocol of Kesarwani et al.

Keywords: polynomial reconstruction; somewhat homomorphic encryption; sorting; low-depth circuit

MSC 2010: 68P10

1 Introduction

The problem of polynomial reconstruction, posed in different flavours, has received good attention in the past [1,2]. Guruswami and Sudan [1] presented a list decoding algorithm for Reed–Solomon codes, which is closely related to polynomial reconstruction over a field, and Naor and Pinkas [2] gave a new cryptographic primitive based on intractibility assumptions related to noisy polynomial reconstruction problem. Given n points (i.e., input/output pairs) on a d -degree polynomial and n > d , a well-known technique is the Lagrange interpolation that uses ( d + 1 ) points to output a unique polynomial of degree at most d that fits the points. This problem also occurs in the context of decoding other error-correcting codes in general, with many proposed techniques to recover polynomials even when a sufficiently small fraction of the input–output pairs are error prone [3,4].

Section 2 discusses some of the works available on the problem of polynomial reconstruction. We would like to emphasize that, to the best of our knowledge, in all the previous works, both the input to and the output of the polynomials are given. In the present setting, we look at the study of recovery of monotonic polynomials with nonnegative integer coefficients (sampled from a certain interval) when only the integer outputs are provided and we are not provided the inputs (except that we only know that they are integers within a given range and the degree of polynomial). The goal is to recover the inputs. The motivation comes from the field of cryptanalysis in a setting that is used to perform computation on encrypted data, where the computation involves homomorphic evaluation of a monotonic integer polynomial at specific points such that the ordering of these points are preserved to enable sorting. It is easy to see that in the case of polynomials over the real or the complex numbers, there will be typically infinitely many satisfying polynomials to the aforementioned polynomial recovery problem. Hence, it seems quite intuitive to expect the same for integer polynomials, and this apparent hardness of recovering the original polynomial has found applications in areas such as privacy-preserving machine learning to hide the input data [5]. Formally, the problem is stated as follows.

Let N be the set of natural numbers (including 0), Z be the set of integers, Z + be the set of positive integers, and Q be the set of rational numbers.

Problem statement 1

Polynomial reconstruction given only outputs:

Let p ( x ) = a 0 + a 1 ⋅ x + a 2 ⋅ x 2 + … + a d ⋅ x d ∈ N [ x ] ( deg ( p ( x ) ) = d and p ( x ) is monotonic). The coefficients a i , 0 ≤ i ≤ d are independent and uniformly random in the range [ 1 , 2 α − 1 ] , α ∈ N .

Given only n integer outputs y 1 = p ( x 1 ) , y 2 = p ( x 2 ) , … , y n = p ( x n ) evaluated at n distinct values with n ≫ d , recover x i , or its affine shifts, on which p ( x ) is evaluated, x i ∈ N , x i ∈ [ 0 , 2 β − 1 ] and β ∈ N . The values of α , β , and d are known.

This article is organized as follows. Section 2 discusses some of the works related to polynomial reconstruction. Section 3 introduces the problem setting in detail with the help of a motivating application followed by contributions of the article. Section 4 describes our algorithm for polynomial recovery. Section 5 discusses the experiments and results obtained. Section 6 gives a variation of the original protocol and a possible attack, followed by a section on conclusion and future work.

2 Related work on polynomial reconstruction

Polynomial reconstruction has applications in coding theory in the context of decoding received codewords. A block error-correcting code C is a collection of strings called codewords, all of which have the same length, over some finite alphabet Σ [1]. It is defined such that any pair of codewords in the range of C differ in at least d locations out of, block length, n . The decoding problem for such an error-correcting code is the problem of finding a codeword in Σ n , which is at a distance at most e (the error bound) from a received codeword R ∈ Σ n .

In the Reed–Solomon code, the alphabet Σ is a finite field F , and the message specifies a univariate polynomial of degree- d . The codewords are evaluations of the polynomial at n distinct points chosen from F [6].

Definition 1

Polynomial reconstruction [1, p. 3]: Given integers k , t , and n points ( x i , y i ) , 1 ≤ i ≤ n , where x i , y i ∈ F , output all univariate polynomials p of degree at most k such that y i = p ( x i ) , for at least t values of i ∈ [ 1 , n ] .

For Reed–Solomon family of codes, the decoding problem reduces to the polynomial reconstruction problem [1]. List decoding formalizes the notion of error correction.

Definition 2

List decoding problem for a code C [6, p. 3]: Given a received word r ∈ Σ n and error bound e , output a list of all codewords c 1 , … , c m ∈ C that differ from r in at most e places.

Hence, the list decoding problem for (generalized) Reed–Solomon code is a variant of the polynomial reconstruction problem over a field F .

Definition 3

List decoding for Reed–Solomon codes [6, p. 13]: Given integers e , k , d , and n pairs ( x 1 , r 1 ) , … , ( x n , r n ) , find all degree- ( k − 1 ) polynomials p such that p ( x i ) = r i for at least n − e values of i .

Sudan [6] gives an algorithm that works for t ≥ 2 d n , whereas the classical algorithm of Berlekamp [3] solves for t ≥ n + d 2 . Guruswami and Sudan [1] gave a solution for t ≥ d n for the same problem.

Gopalan et al. [7] studied the problem of polynomial reconstruction for low-degree multivariate polynomials over F [ 2 ] , where given a set of points x ∈ { 0 , 1 } n and their evaluations f ( x ) at each of those points, the goal is to find a degree d polynomial that has good agreement with f at 1 − ε fraction of the points. They show that it is NP-hard to find a polynomial that agrees with f on more than 1 − 2 − d + δ + ε fractions of points for any ε , δ > 0 .

Naor and Pinkas [2] describe the noisy polynomial reconstruction problem, the description of which closely resembles that of the list decoding problem of Reed–Solomon codes since the two problems are closely related.

Definition 4

Noisy polynomial reconstruction [2, p. 8]: Given integers k and t , and n points ( x i , y i ) , 1 ≤ i ≤ n , where x i , y i ∈ F , output any univariate polynomial p of degree at most k such that p ( x i ) = y i for at least t values i ∈ [ 1 , n ] .

Kiayias and Yung [8] presented a short survey on cryptography based on polynomial reconstruction and described the cryptographic primitive “oblivious polynomial evaluation (OPE)” of Noar and Pinkas [2]. OPE is based on the intractability assumption of the noisy polynomial reconstruction. In OPE, Bob has a polynomial P and lets Alice compute P ( x ) , where x is private to Alice in such a way that Bob does not learn anything about x , and Alice does not gain any additional information about p . Bleichenbacher and Nguyen [9] presented new methods to solve noisy polynomial interpolation using lattice-based methods.

Definition 5

Noisy polynomial interpolation [9, p. 2]: Let p be a k -degree polynomial over F . Given n > k + 1 sets S 1 , … , S n and n distinct elements x 1 , … , x n ∈ F such that S i = { y i , j } 1 ≤ j ≤ m contains m − 1 random elements and p ( x i ) , recover the polynomial p , provided that the solution is unique.

We note here in all the scenarios presented earlier, both the evaluation points (inputs) as well as the corresponding outputs are available for polynomial reconstruction and so seemingly do not apply to our problem.

3 Motivating application

In this section, we describe the background leading to the problem setting, adversarial model, and contributions of the article.

We look at the client-server setting, wherein a client has large amount of data on which some computation is to be performed and the server has computation power to perform the needed operations. One such operation is to obtain k -nearest neighbors described more in the following section. To protect privacy, a client may choose to encrypt its data, thus limiting the server’s ability to perform operations on the same. Encrypting data using homomorphic schemes are one way to allow computation on encrypted data but at the cost of efficiency. A number of works have been proposed in the literature, which aim to improve efficiency while allowing computation on encrypted data. We look at cryptanalysis of one such work and in the process look at the problem of integer polynomial reconstruction from only outputs, details of which is the raison d’être of this article.

3.1 k -Nearest Neighbor protocol

k -NN algorithm is a basic method used in data mining, machine learning, and pattern recognition used for the classification [10,11]. k -NN is used to search, as per a given metric, k neighbors closest to a given Δ -tuple query point in a database containing n many Δ -tuples. In the privacy preserving version of the k -NN algorithm, the query point is an encrypted Δ -tuple and the database contains n encrypted Δ -tuples. Many efficient solutions have been proposed for determining k -NN on encrypted data [5,12,13, 14], to name a few. Among these, [5] is a solution in a noncolluding federated two-cloud setting more described in Section 3.3.

3.2 Computation on encrypted data

Homomorphic encryption schemes enable computation on encrypted data such that the result of computation is available only after decryption. Current fully/somewhat homomorphic encryption (F/SHE) [15,16,17] schemes come with a price in terms of the large size of ciphertexts and are inefficient when high multiplicative depth circuits like search and sort operations are involved, and hence are impractical when handling a few hundred data elements [18,19,20]. One way to overcome this is a solution that employs two servers, Server A where encrypted data are stored and Server B that has keys needed to decrypt and perform efficient plaintext computations. Server A does some basic operations on encrypted data followed by some transformation so as to “obscure” it before sharing it with Server B , Server B then decrypts the received information, performs operations on plaintext data efficiently, re-encrypts the results and shares it with Server A to give the processed data back to the end-user client. In these models, it needs to be ensured that Servers A and B do not collude with one another. Moreover, protocols between them should ensure that Server A obscures the data in such a way that Server B is provided only as much data as required for the required computation and in a form from which Server B cannot glean anything more. There are many examples of noncolluding multiserver solutions in the literature deployed in disparate settings, and we present a couple of example scenarios. Aono et al. [21] considered a two-cloud model to compute logistic regression securely, and a solution by Hardy et al. [22] used a three-cloud federated setting to do learning on vertically partitioned data, which are examples in a machine learning environment; while pRide [23] and PSRide [24] are two solutions in a ride-hailing environment that use two-cloud models.

3.3 k -NN protocol over encrypted data [5]

At EDBT 2018, Kesarwani et al. [5] presented a new secure k -NN protocol in the two-party honest-but-curious federated cloud model for computing k -NN on encrypted data. Their solution uses a semantically secure (levelled) FHE (LFHE) [25] to compute squared euclidean distances (EDs) directly on encrypted data. To compute the ranking among the distances, the distances are suitably transformed to preserve the order. A federated noncolluding public cloud having the secret key performs the comparison using the transformed data. Since this cloud has access only to transformed results of computations performed on plaintext data, this article claims that public cloud does not learn anything useful about the original database, the results, or the query. The federated cloud setting consisting of two Servers A and B is depicted in Figure 1.

Figure 1

k -NN setting from [5].

This article claims to provide an asymptotically faster solution than the current state-of-the-art protocol. They implement their protocol and run experiments on two real-world datasets from UCI machine learning repository [26], one from healthcare domain and another from financial domain having a relatively large number of dimensions (32 and 23, respectively) and containing 858 and 30,000 data points, respectively; both of which contain personally identifiable sensitive information like gender, age, marital status, etc. This article states that the data were preprocessed to remove negative integer values. They run their experiments for k = 2 , 8, and 16. While the datatype in each dimension is not explicitly mentioned in this article, by examining the datasets, we find that they are nonnegative integers of size ≤ 16 bits.

This article makes the following security guarantee regarding the leakage profile for the Server B as given in ref. [5, Theorem 4.2], reproduced here for reference.

Theorem 4.2. Secure k -NN guarantee: Server B : Our secure k -NN protocol leaks no information to Server B except the number of nearest neighbors k to be returned by the protocol and the number of equidistant points in the database with respect to a given query point q .

It may be noted that though the protocol of [5] has been described specifically in the context of securely evaluating k -NN, their technique of transforming through a random monotonic polynomial has applications in many settings where sorting of SHE or LFHE encrypted data is needed. We remark here that if sorting is the only functionality required, then order-preserving or order-revealing encryption schemes would suffice for the purpose [27,28,29].

3.4 Problem setting

We refer to Figure 1 where the data owner outsources his/her database in an encrypted form using a semantically secure homomorphic encryption scheme for storage in Server A . Each data item in the database is of dimension Δ . Server A does not have access to any secret keys and operates only on encrypted data. Server A provides storage for the database and provides services on the encrypted database homomorphically. One of these services is the computation of the k -NN of a given query point q , of dimension Δ , received from an end-user or client. Server A homomorphically computes squared EDs between the query point and each of the n data points in the database. Because a single ED 2 computation is of multiplicative depth 1, it can be efficiently evaluated using an LFHE scheme. Since Server A does not possess the decryption key, it will not be able to efficiently uncover the underlying plaintext information of either the query point or the data points.

Computing ED is of low multiplicative depth, but sorting is not. Hence, Server A transforms the distances while preserving its order. It picks a monotonic polynomial p ( x ) of degree d of the form a 0 + a 1 ⋅ x + a 2 ⋅ x 2 + … + a d ⋅ x d for some chosen d ∈ N , where each of the integer coefficients a i are picked uniform randomly and independently in the range [ 1 , 2 α − 1 ] , for example, in the range [ 1 , 2 32 − 1 ] as presented in ref. [5, Section 3.4]. This polynomial is then evaluated homomorphically for each of the EDs and the output ciphertexts are re-ordered using a permutation σ picked uniformly at random, before sending them to the Server B for sorting.

Server B possesses the decryption key using which it will decrypt the values received from Server A and sorts them. As the decrypted values are outputs of a random polynomial, the original distances as computed by Server A are “hidden” from Server B . Server B then sends the indices of k -NNs to Server A , which then applies σ − 1 to the received ordering of the indices and forwards the same to the client. We refer to the security guarantee with respect to Server B given in [5, Theorem 4.2] in Section 3.3 and ask the question whether Server B can glean more information about the polynomial chosen by Server A and, eventually, the plaintext integer inputs used by Server A for its homomorphic polynomial evaluation.

Remark: The plaintext data points and the query point are encoded as tuples of integers, since in the context of F/SHE schemes, fixed-point values too are (exactly) encoded using essentially the scaled-integer representation [30]. The only exception is the CKKS FHE scheme [31] that natively supports floating-point arithmetic but was not considered in ref. [5]. However, it may be noted that a successful key recovery attack on CKKS was published in EUROCRYPT 2021 [32]. We leave it for future work to extend our attack to also include any implementation based on the CKKS scheme.

3.5 Adversarial model

Servers A and B are assumed to be honest but curious. Each will perform the computations correctly but is keen to learn more about the distances between the query point and the data set points. Server A has access only to encrypted data. The semantic security of the underlying encryption scheme will guarantee the protocol is secure against Server A .

Using the decryption key, Server B decrypts the permuted results of computation performed by Server A . After decryption, the Server B would observe only the outputs of the polynomial evaluation (and not the input squared distances). That is, it only sees the values on the L.H.S. of the following set of equations:

(1) p ( x 1 ) = a 0 + a 1 ⋅ x 1 + a 2 ⋅ x 1 2 + … + a d ⋅ x 1 d p ( x 2 ) = a 0 + a 1 ⋅ x 2 + a 2 ⋅ x 2 2 + … + a d ⋅ x 2 d ⋮ p ( x n ) = a 0 + a 1 ⋅ x n + a 2 ⋅ x n 2 + … + a d ⋅ x n d .

It is assumed that the Server B , seen here as the adversary, knows the degree d , which is usually small since the homomorphic evaluation of polynomials in encrypted form are efficient only for small degrees. It also knows the range [ 1 , 2 α − 1 ] for the unknown coefficients a i , and the range [ 0 , 2 β − 1 ] for the unknown inputs x i . For our attack, we need not know the exact values for the three parameters, just an upper bound on them would suffice. Also, note that all the parameters above take nonnegative integer values.

As noted earlier, Server A homomorphically evaluates the polynomial p ( x ) at n (squared ED) integer values x 1 , … , x n , and we can assume without loss of generality that 0 ≤ x 1 ≤ x 2 ≤ … ≤ x n . Since p ( x ) is monotonic, the ordering of p ( x i ) is identical to the ordering of x i . If a 0 ≤ x 1 , then for any given tuple of coefficients ( a 0 , a 1 , … , a d ) , there will be a set of positive real roots ( χ 1 , χ 2 , … , χ d ) to (1). Hence, the authors of ref. [5] seem to argue that if the range for a i is large enough, then it will be infeasible to search for all possible polynomials satisfying (1). The authors claim that the probability that Server B successfully recovers the coefficients a i , followed by x i , is approximately 1 / 2 α ⋅ ( d + 1 ) , which is negligible when α is large. Referring to the example given in ref. [5, Section 4.2], for α = 16 and d = 9, this probability is approximately 2 − 160 , which is negligible. Hence, the protocol leaks only a negligible amount of information, as claimed in ref. [5], about either a i or x i to the Server B and nothing else other than the ordering of the x i ’s. We note here that the x i values may never be uniquely recovered in (1) with probability = 1 since p ( x + c ) is also an equivalent polynomial satisfying the equation for c ∈ Z , and there may be many values of c such that 0 ≤ x i − c < 2 β . Hence, the inputs and the polynomial may only be recovered up to an affine scaling. Other nonlinear transformations may also result in an equivalent solution. For instance, p ( x ) can be a potential solution when all the x i are perfect squares and p ( x ) contains only even powers. But these possibilities will likely be significantly small when n ≫ d , and the coefficients a i and/or the inputs x i come from a random source with sufficiently high entropy, which indeed is the scenario in ref. [5]. Please refer to Section 4.10 where we describe the uniqueness of the obtained solution.

3.6 Our contribution

We give an algorithm (see Algorithm 1 on p. 36) to the earlier defined polynomial recovery problem where the goal is to recover the integer inputs (up to an affine scaling) by observing only the outputs of the monotonic polynomial with positive integer coefficients, assuming the number of evaluation points is much greater in number compared to the degree of the polynomial. Sections 4.1–4.7 explain the core idea of our proposed algorithm. Our algorithm has a heuristic expected time complexity that is exponential in the size of the inputs ( β ) and the degree ( d ) of the chosen polynomial, but polynomially dependent on the size of the coefficients ( α ). As derived in equation (11) on p. 11, the heuristic expected time complexity of our algorithm is O ˜ ( d 2 β 2 β ( α + d β ) + n d 3 ( α + d β ) d ) . We would like to note that in many real-world scenarios the inputs are or can be encoded as integers of 16- or 32-bit length, and our method runs efficiently for inputs of such size. Also note that in SHE applications d is required to be not large as well. Since there can be many solutions to the aforementioned polynomial reconstruction problem, hence we will output one solution that satisfies all the output points (there is also a possibility to enumerate all the solutions). But as discussed earlier, when the number of output values is far bigger than the input degree, and the coefficients and/or inputs are sufficiently random, then the number of equivalent solutions will likely be small. Indeed, we show in Section 4.10 that when the polynomial coefficients are chosen uniform randomly, the original polynomial along with the inputs are recovered by our algorithm with a high probability. Our algorithm extends to recovering any integer polynomial (not necessarily a monotonic integer polynomial) and any input range (not necessarily [ 0 , 2 β − 1 ] ), as long as the range of the coefficients and inputs are known apriori.

The aforementioned results invalidate the security claim in ref. [5, Theorem 4.2] regarding the leakage profile for the Server B . In particular, Server B will be able to learn the square of the EDs between the query point and the data set points. It may not be able tell the exact distance to a given point due to random re-ordering but will be able to know all such values. Such an information can potentially help the adversary to narrow down further if it has access to extra information about the underlying data set or query point. For the concrete parameters suggested in ref. [5, Section 4.2], i.e., β = 16 and d = 9 , we can recover the inputs for α = 16 in a few seconds (see Table 1 on p. 29). We then extend it further up to d = 96 , α = 128 , and β = 48 and show that we can recover the polynomial and the inputs efficiently (see Table 2 on p. 28). Finally, we investigate in Section 6 another variant of the protocol of [5] where the (homomorphically) transformed polynomial outputs are perturbed by a noise yet maintaining the monotonicity. In this case, our previously mentioned attack will not work. But we show, in a straightforward way, that it is still possible to recover the ratios of the inputs.

Table 1

Run times for polynomial reconstruction for a real-world data with degree 9 polynomial

α (bits)	β (bits)	Time to recover polynomial and x 1 (s)
16	24	3.3
20	24	3.5
24	24	3.6
32	24	3.9

Table 2

Run times for polynomial reconstruction for different polynomial degrees and β values

α (bits)	β (bits)	Degree	Time to recover polynomial and x 1 (s)
128	24	12	10
128	24	24	42
128	24	48	293
128	24	96	2,299
128	28	12	129
128	28	24	748
128	28	48	5,084
128	28	96	38,810
128	32	12	7,382
128	32	24	35,387
128	64	12	12,190
128	64	24	48,758

A preliminary version of our work has appeared as part of 17th IMA International Conference of Cryptography and Coding (IMACC’17), 2019 [33]. We have made a number of improvements to our algorithm (and the material in this article has more than 50% new material compared to the earlier version). We have also run our experiments with much larger bounds. We list hereunder some of the important changes and enhancements that are present as part of this submission:

The heuristic expected running time of our earlier polynomial recovery algorithm (found in the conference proceedings) was O ˜ ( n 2 β ( α + d β ) d ) , where n is the number of polynomial integer outputs, the parameters d and α denote the degree and the size of coefficients of the original polynomial, and β denotes the size of the original inputs. This earlier algorithm used a brute force approach to iterate over the input space to try and find a suitable polynomial in the second step.
In the current version, we have improved the algorithm significantly and the heuristic expected running time of the polynomial recovery algorithm is now O ˜ ( d 2 β 2 β ( α + d β ) + n d 3 ( α + d β ) d ) . This is made possible by avoiding the brute force search mentioned earlier.
In addition to running tests with parameters that were claimed secure in the article by Kesarwani et al. mentioned earlier, we show that our algorithm can recover the polynomial and the inputs for larger bounds on the size of inputs, size of coefficients, and the degree of the polynomial. In the earlier version, input values were in the range [ 0 , 2 24 − 1 ] , polynomials up to degree 9, and coefficient space was [ 0 , 2 32 − 1 ] . In this version, the range of the parameters have been greatly increased; the range of input values is now [ 0 , 2 64 − 1 ] , polynomials up to degree 96, and the polynomial coefficient space is now [ 0 , 2 128 − 1 ] .
When the coefficients are chosen uniformly randomly, we analyze and show that the recovered polynomial and the input values are, with a very high probability, the same as what was chosen initially.

The earlier version of our polynomial recovery algorithm has also found applications in the privacy-preserving ride-hailing Service (RHS) scenario [34] to recover driver locations that are homomorphically transformed by a service provider as a means to protect drivers’ privacy.

4 Recovering integer inputs given only integer polynomial outputs

We give here a high level overview of the steps we follow to recover all x i . The key idea is to dramatically reduce the search space of x i by using the fact that we are looking for only the nonnegative integer roots of the polynomial, not just nonnegative real numbers.

The n number of y i integer polynomial outputs are input to the algorithm.
Obtain integer divisors of y i − y 1 that are less than 2 β with 1 < i ≤ n . One of these divisors will be x i − x 1 , which is to be determined.
Using the divisors, isolate the possible values of δ i = x i − x 1 , which constitute the guess for differences of the x inputs.
Using x 1 as the unknown parameter, construct a “candidate” degree- d Lagrange polynomial L ( x ) in x using inputs ( x 1 , x 2 = x 1 + δ 1 , … , x d + 1 = x 1 + δ d + 1 ) and outputs ( y 1 , y 2 , … , y d + 1 ) , respectively, as follows:
(2) L ( x ) = L 0 ( x 1 ) + L 1 ( x 1 ) ⋅ x + L 2 ( x 1 ) ⋅ x 2 + … + L d ( x 1 ) ⋅ x d ,
whose coefficients L i ( ⋅ ) are in turn polynomials in the unknown x 1 .
It is known that coefficients of p ( x ) are nonnegative integers less than 2 α , and so are that of L ( x ) . Find one or more candidate γ for x 1 such that 0 ≤ L i ( x 1 = γ ) ≤ 2 α − 1 ∀ i = 0 , … , d .
When one such γ is obtained, use the differences δ i to find all other inputs x i . The coefficients of p ( x ) that correspond to γ are immediately obtained as L i ( x 1 = γ ) .
Using p ( x ) thus obtained, use the remaining ( n − d − 1 ) outputs to verify the correctness by computing the roots of the polynomial and checking if they are integers that lie in the range [ 0 , 2 β − 1 ] . If these verifications are successful, algorithm outputs the satisfying polynomial.

Remarks:

The polynomial p ( x ) output by the algorithm is just one possible solution set satisfying the given y i values, since, for example, ( x i + c ) with c ∈ Z gives the same y i outputs for the polynomial p ( x − c ) .
In Section 4.10, we give an analysis on how many such values of c can exist when coefficients of p ( x ) are chosen at random.
While our method works for arbitrary values of α , β , and d , we demonstrate the practicality of our solution by running experiments with α up to 128, β up to 64, d up to 96, as described in detail in Section 5.
While n is typically bigger, n = 2 d was sufficient for polynomial recovery in our experiments.

The procedure is given as Algorithm 1 in Appendix A, and each of the steps are described in detail in the following sections.

4.1 Divisors from y -differences

Summary of this step: Given a nondecreasing order of n outputs y i = p ( x i ) with y 1 being the smallest, the compute y -differences y i , j = y i − y j , where i > j , followed by finding divisors of y i , j , of which one divisor is equal to x i − x j . Differences of y values are stored in Matrix Y , and the corresponding divisors are stored in Matrix D at the end of this step.

Refer to the procedure GuessTheDifference from Algorithm 1.

Lemma 1

Let p ( x ) ∈ N [ x ] be a degree- d polynomial, y i = p ( x i ) , y j = p ( x j ) , x i ≠ x j , then ( x i − x j ) ∣ ( y i − y j ) .

Proof

Let y i = a d x i d + ⋯ + a 1 x i + a 0 and y j = a d x j d + ⋯ + a 1 x j + a 0 , where a i , 0 ≤ i ≤ d , are the coefficients of p ( x ) .

Then, y i − y j = a d ( x i d − x j d ) + ⋯ + a 1 ( x i − x j ) ⇒ ( x i − x j ) ∣ ( y i − y j ) .□

Lemma 2

Let p ( x ) ∈ N [ x ] be a degree- d polynomial, y i = p ( x i ) , y j = p ( x j ) , and x i < 2 β ∀ i , then 0 < ( x i − x j ) < 2 β .

Proof

By our choice of ordering the polynomial outputs, y i > y j and p ( x ) is monotonic as all its coefficients are nonnegative integers, ⇒ x i − x j > 0 , and each x i is upper bounded by 2 β .□

Let y i , j = y i − y j . From Lemmas 1 and 2, ( x i − x j ) < 2 β is a divisor of y i , j , and as part of this step, we find all positive divisors of y i , j < 2 β . For β ≤ 28 , we use the sieve method to obtain factors by dividing y i , j by successive primes and forming positive divisors < 2 β . When β > 28 , and since we know that there will exist a small factor (as in our case, since the range of x i are known, and the factors are differences of x i ), we use elliptic-curve factorization method (ECM) [35] to find the required factors.

Time complexity for finding factors. In the expected case, ECM is a subexponential algorithm with complexity O ( e ( log p log log p ) ( 1 + O ( 1 ) ) ) to find a single factor, where p is the smallest factor. We first use the sieve method to obtain all small factors ≤ 2 28 and then invoke SageMath ECM.find_factor() on Q , where Q is the quotient obtained by dividing y i , j by all the small factors and their multiplicity, to obtain the next possible factor f . If f < 2 β , we continue the process of invoking ECM.find_factor() on Q / f as long as we obtain a factor < 2 β or we hit a timeout. The timeout value is empirically set such that the aforementioned function is able to find “small” factors ( < 2 β ) well within the timeout.

Remark. There is a possibility of losing factors due to our timeout being insufficient. Section 5 presents results on loss of factors for different timeout values. But we would like to stress that we must enumerate all the factors of the differences that are less than 2 β if we want to eliminate any risk of missing any satisfying polynomial.

Let D i , j be the set of all divisors of y i , j . This set constitutes the guesses for the differences of the (to be determined) values x i . It turns out that for many values of y i , j , there may be too many divisors that are < 2 β , so we need to sample larger number of output values (i.e., larger n ) and carefully pick d number of y i , j ’s such that the number of elements in each of D i , j is a small positive number (say, ≤ ψ ), whereby the search space for the guesses becomes feasible to enumerate. We also store the corresponding y i , j values in Matrix Y that are to be used later in the Lagrange interpolation step .

Remark. In our experiments as well as in our analysis, we consider ψ = O ( α + d β ) , since the number of divisors of an integer N is bounded by N O 1 log log N and on the average case, it is log N [36].

Y = 0 0 0 … y 2 , 1 0 0 … y 3 , 1 y 3 , 2 0 … ⋮ y d + 1 , 1 y d + 1 , 2 … y d + 1 , d .

The output of this step is Matrix D of divisor sets:

D = 0 0 0 … D 2 , 1 0 0 … D 3 , 1 D 3 , 2 0 … ⋮ D d + 1 , 1 D d + 1 , 2 … D d + 1 , d .

Time complexity. The worst-case time complexity of this step (in terms of bit operations) to find the factors less than 2 β is

(3) O ˜ ( d 2 ( α + d β ) ⋅ β ⋅ 2 β )

since the size of each Y i , j is O ( α + d β ) , and there are O ( d 2 ) elements in D i , j .

4.2 Consistency check on the guessed differences

Summary of this step: The key idea in this step is to eliminate as many divisors as possible before performing Lagrange Interpolation in the following step.

Refer to the procedure CheckConsistencyInColumn from Algorithm 1 in Appendix A.

Definition 6

Consistent divisor: Given the matrix of divisors D and f ∈ D i , k , if ∃ g ∈ D j , k such that ( f − g ) ∈ D i , j ∀ j , 1 < j < i and ∀ k , 1 < k < i − 1 , then f and g are termed as consistent divisors of their respective sets.

Remark. If x i − x k is a divisor of y i − y k , and x j − x k is a divisor of y j − y k , then x i − x j must be a divisor of y i − y j .

The output of the previous step is the matrix D . Element D i , j ∈ D , is a list of divisors of y i , j , and we know that only one of them is (but as yet unknown) x i − x j . We create sets of consistent divisors for each D i , j ( 2 ≤ i ≤ d + 1 ) and ( j < i ) . All elements that are not part of the consistent set are set equal to 0.

For each nonzero element in D 2 , 1 , pick its corresponding “consistent” element δ i − 1 from D i , 1 , 2 < i ≤ d + 1 to form a tuple of d + 1 elements ( δ 0 = 0 , δ 1 , … , δ d ) . Store each tuple in an array of tuples C . The output of this step is Array C and ∣ C ∣ .

Remark. In the beginning of Section 4, we use the notation δ i to mean the x i − x 1 differences, but in this section, we use the same notation to mean “consistent” elements. This is because they represent the same element. In other words, a “consistent” element is one that is a divisor of the y output difference as well as (yet to be identified) x input difference.

Time complexity. The worst-case time complexity of this step is

(4) O ˜ ( d 3 ψ 2 β )

as O ( d 2 ) elements each having O ( ψ ) factors of size O ( β ) bits are compared in O ( d ) columns.

4.3 Recursive consistency check

Summary of this step: In Section 4.2, we obtain the consistency set by examining the divisors of differences of the polynomial outputs and using that we obtain a set of tuples that constitute the sets of guessed differences of inputs. In this step, the differences are obtained in a different manner as described later in this section. These two sets of guessed differences are compared to enable us to converge faster on the set of x differences.

Remark. In the worst case, we need to run this step for ψ d number of times, so we enable this procedure only for experiments with small polynomial degree. In practice, this step quickly reduces the number of inconsistent tuples in Array C , thus reducing the overall running time of the algorithm.

Refer to the procedure RecursiveConsistencyCheck from Algorithm 1 in Appendix A.

This step takes the tuple Array C as input. The d elements of the each tuple of Array C, for example, C [ i ] [ 1 ] to C [ i ] [ d ] of the ith tuple are factors of y i output differences, as obtained the previous step. By using the d elements of the first tuple, we can rewrite the first column of Matrix Y as follows:

y 2 , 1 = y 2 − y 1 = ( x 2 − x 1 ) [ ⋅ ] 2 , 1 y 3 , 1 = y 3 − y 1 = ( x 3 − x 1 ) [ ⋅ ] 3 , 1 ⋮ y d + 1 , 1 = y d + 1 − y 1 = ( x d + 1 − x 1 ) [ ⋅ ] d + 1 , 1 .

We start with Column m = 1 of the Matrix Y , and rename y i outputs as z i , 0 (for ease of notation) for 1 ≤ i ≤ d + 1 . In other words, y 1 is renamed z 1 , 0 , y 2 is renamed z 2 , 0 , and so on. In Step 1 of this procedure, we compute quotients [ ⋅ ] 2 , 1 = z 2 , 0 − z 1 , 0 x 2 − x 1 , [ ⋅ ] 3 , 1 = z 3 , 0 − z 1 , 0 x 3 − x 1 , and so on. Generalizing this, in step m , we compute the quotients [ ⋅ ] i , m , for ( m + 1 ) ≤ i ≤ ( d + 1 ) , as follows:

z i , m = z i , m − 1 − z m , m − 1 x i − x m .

Next, we find the differences z i , m − z m + 1 , m for ( m + 1 ) < i ≤ ( d + 1 ) . We then find divisors ( < 2 β ) of each of these differences. We call this divisor set as D N i , m + 1 . In Matrix D , we set D i , m + 1 = D N i , m + 1 ∩ D i , m + 1 . We then build the consistent set for Column m similar to the way described in Section 4.2. Finally, we pick the nonzero element in Column m in every row to form a tuple τ of ( d − m ) number of elements, which form the guessed difference divisors for the next recursive step. The recursion terminates in d + 1 steps as given in Lemma 3. In each step, the computed τ is used as input to the subsequent call.

Lemma 3

Given p ( x ) ∈ N [ x ] , a degree-d polynomial, the procedure recursive consistency check terminates recursion for a single tuple of guessed differences in at most d steps.

Proof

In step 1 of the procedure, the y i outputs are that of p ( x ) , a degree- d polynomial. By taking the difference of the y i outputs, the coefficient a 0 is eliminated. The input tuple of guessed differences is factored out from y i differences leaving integer quotients. In step 2, by taking differences of quotients, the next coefficient, namely, a 1 is eliminated, and the process is repeated in each subsequent step. Thus, in step i , the coefficient a i − 1 is eliminated. Finally, in step d :

the coefficient a d − 1 is eliminated
the guessed differences at this step are factored out leaving the quotient [ ⋅ ] d + 1 , d = a d , the coefficient of x d of p ( x )
the algorithm terminates for the input tuple of guessed differences and returns a d .

During any recursive step, if a noninteger quotient is obtained or if no nonzero element remains in the consistent set, then the recursion is terminated returning 0.□

As shown in Lemma 3, for a single tuple of guessed differences, this step terminates in at most d steps. If no a d is returned, then the tuple in Array C is discarded, and the procedure is repeated using the next tuple in Array C .

When recursion completes successfully, we obtain

z d + 1 , d = z d + 1 , d − 1 − z d , d − 1 x d + 1 − x d .

We obtain z d + 1 , d = a d as given in Lemma 3. We note that a tuple in Array C having values with successive differences matching that of the original x i values will successfully go through all the d + 1 steps until an integer a d is obtained.

Time complexity. In this step, the factors are obtained for O ( d 2 ) elements followed by running consistency check on d − 1 columns. For one tuple, the worst-case time complexity of this step based on (3) and (4) is

(5) O ˜ ( d 2 ( α + d β ) ⋅ β ⋅ 2 β + ( d 3 ψ 2 β ) ) .

4.4 Choosing optimal divisors sets

Summary of this step: In Section 4.2, we enumerate over ψ d many tuples of divisors/guesses. We provide here a way to choose the divisor sets such that the enumeration complexity is as small as possible.

Definition 7

Output difference graph: A complete undirected graph G = ( V , E ) with ∣ V ∣ = n , E ⊆ V × V , where the number of elements in D i , j is the edge cost between nodes V i and V j .

We refer to equation (1) giving the polynomial outputs. As explained in Section 4.1, we compute the differences of polynomial outputs and form the Matrix D of divisors sets (less than 2 β ) of the output differences. We need to find a set of d many D i , j elements of the Matrix D such that ∏ D i , j is minimum. In other words, the product of the number of guesses is minimum. Graph G is the Output Difference Graph of Matrix D . Now finding the minimum ∏ D i , j is akin to finding the d -minimum spanning tree (i.e., a minimum weight tree with d edges only) in G , where the weight of the tree is represented by the product of the weights. The requirement that the subgraph is a tree comes from the linear independence requirement of the corresponding set of equations. Essentially, we are transforming the problem of finding the small search space of divisors to the problem of finding a d -minimum spanning tree having the least cost (in terms of divisor product) across all divisors sets of Matrix D yet satisfying the linear independence condition. It is shown in ref. [37] that the d -MST problem is NP-hard for points in the Euclidean plane. The same article provides an approximation algorithm to find d -MST [37, Theorem 1.2] and is reproduced here for reference.

Theorem 1.2. There is a polynomial-time algorithm that, given an undirected graph G on n nodes with nonnegative weights on its edges, and a positive integer k ≤ n , constructs a tree spanning at least k nodes of weight at most 2 k times that of a minimum-weight tree spanning any k nodes.

Note that this approximation algorithm also works for multiplication of edge weights (weights greater than 1) since by extraction of logarithms this can be trivially turned into addition of edge weights. By using this algorithm, we can carefully select d < n nodes having close to the minimum enumeration complexity to make our search space feasible to guess the differences ( x i − x j ) with x i ≥ x j . From the d -MST so obtained, we can now go on to find the set of divisors ( x i − x j ) such that they are consistent as explained in Section 4.2 and continue with finding the polynomial coefficients using Lagrange interpolation as described in Algorithm 1. However, we did not implement this optimization in our code as the concrete running time was already small enough. But for larger instances, this optimization will be useful.

4.5 Lagrange interpolation to recover all coefficients

Summary of this step: Use Lagrange interpolation to construct a degree- d polynomial using the guessed difference set.

Refer to the procedure ApplyLagrangeInterpolation from Algorithm 1 in Appendix A.

Each tuple ( δ 0 = 0 , δ 1 , … , δ d ) in the Array C obtained as described in Section 4.2 presents possible candidates for the differences ( x i − x 1 ) , 1 ≤ i ≤ d + 1 . Optionally, we could use the method described in Section 4.3 to narrow down the possible candidates to only those tuples for which an integer a d coefficient is obtained.

Since δ i represents differences with respect to some unknown value x 1 , we treat x 1 as an unknown parameter and add it to each of the d differences to obtain ( x 1 , x 2 , … , x d + 1 ) = ( x 1 , x 1 + δ 1 , … , x 1 + δ d ) . The Lagrange polynomial of degree- d is obtained using these values for inputs x , expressed in terms of x 1 , and the corresponding output values come from the y set as given in Section 4.1. We now have a degree- d polynomial L in x with parameter x 1 as shown in equation (6).

(6) L ( x ) = ∑ k = 1 d + 1 ∏ j = 1 , j ≠ k j = d + 1 ( x − x j ) ∏ j = 1 , j ≠ k j = d + 1 ( x k − x j ) ⋅ y k .

Lemma 4

Given a degree-d polynomial L ( x ) with parameter x 1 as shown in equation (6), it can be written as a polynomial in x whose coefficients are polynomials in x 1 .

Proof

(1) The denominator of each of the summands in equation (6) ∈ Z since it is a product of only integers and does not contain the parameter x 1 .

(2) The numerator of each of the summands in equation (6) is a product of d degree-1 integer binomials. We know that the product of d degree-1 integer binomials of the form ( x + p 1 ) ( x + p 2 ) … ( x + p d ) = x d + x d − 1 ∑ p i + x d − 2 ∑ p i p j + … + ∏ p i .

Now since each of the p i = x 1 + δ i , it follows that the coefficient of x i is a degree-( d − i ) polynomial in x 1 and ∏ d p i is a degree-d polynomial in x 1 .

From statements (1) and (2), L ( x ) = L 0 ( x 1 ) + L 1 ( x 1 ) x + L 2 ( x 1 ) x 2 + ⋯ + L d ( x 1 ) x d and each L i ( x 1 ) ∈ Q [ x 1 ] .□

From Lemma 4, each L i is a degree- ( d − i ) polynomial in x 1 , 0 ≤ i ≤ d , and L d ( x 1 ) = a d is a constant polynomial. Thus, we have recovered a d , the coefficient of x d in p ( x ) .

We know that the coefficients of p ( x ) are positive integers less than 2 α , and p ( x ) is evaluated at integers in the range [ 0 , 2 β − 1 ] . Our next objective is to output an integer x 1 ∈ [ 0 , 2 β − 1 ] such that ∀ i = 1 , 2 , … , d , 1 ≤ x i = x 1 + g i ≤ 2 β − 1 and ∀ i = 0 , … , d − 1 , L i ( x 1 ) is an integer with 0 ≤ L i ( x 1 ) ≤ 2 α − 1 as described in Section 4.6. If no such x 1 exists, then we repeat this process using the next tuple in C .

Time complexity. The worst-case time complexity of computing the Lagrange interpolation polynomial for one tuple is:

(7) O ˜ ( d 2 ( α + d β ) 2 ) .

4.6 Obtaining x 1 by finding roots of polynomials L i ( x 1 )

Summary of this step: Given α , β and polynomials L 0 , L 1 , … L d − 1 (each L i ∈ Q [ x 1 ] and is of degree d i , respectively, where d i = d − i ), find x ∈ Z satisfying the condition: 0 ≤ x ≤ 2 β − 1 and 0 ≤ L i ( x ) ≤ 2 α − 1 ∀ i = 0 , … , d − 1 .

Refer to the procedure RecoverPossibleXValuesAndPoly from Algorithm 1 in Appendix A.

It is easy to see that the following algorithm always outputs one such x if it exists (if no integer satisfies the required condition, then this algorithm returns failure).

For each i = 0 , … d − 1 , find the ordered set S i = { [ a k , b k ] : a k , b k ∈ Z , 0 ≤ a k ≤ b k ≤ 2 β − 1 } of disjoint intervals on the real line such that every integer m satisfying 0 ≤ m ≤ 2 β − 1 and 0 ≤ L i ( m ) ≤ 2 α − 1 is present in some interval [ a k , b k ] ∈ S i .
If there exists an integer x in the range [ 0 , 2 β − 1 ] that is present in some interval from each S i output it, else return failure.

Step 1. Finding the ordered set S i of intervals for each L i :

Compute the set R i , 0 of all the (approximations to the) real roots of L i ( x ) = 0 that lie in the range [ 0 , 2 β − 1 ] . Similarly let R i , α contain all the real roots of L i ( x ) = 2 α − 1 in the range [ 0 , 2 β − 1 ] . Let R i = R i , 0 ∪ R i , α be an ordered set with its elements in the increasing order (as they appear on the real line).

Intuition. We first provide an intuition for our algorithm by looking at one particular case. Let a , b (with a ≤ b ) be two consecutive elements of the ordered set R i = R i , 0 ∪ R i , α . For now, assume that a ∈ R i , 0 and b ∈ R i , 0 . Then, the following lemma easily follows.

Lemma 5

There are only two possibilities for the values taken by L i ( x ) for x ∈ ( a , b ) : either 0 ≤ L i ( x ) ≤ 2 α − 1 ∀ a < x < b , or L i ( x ) < 0 ∀ a < x < b .

Proof

Note that L i is continuous everywhere in its domain. Since a , b are consecutive elements in R i , graphically, the curve L i does not intersect the lines y = 0 and y = 2 α − 1 at any point x between a and b . Since L i ( a ) = L i ( b ) = 0 , there are only two possibilities: the curve either lies entirely between y = 0 and y = 2 α − 1 or entirely below y = 0 .□

To determine which of the aforementioned possibilities occurs, it is enough to compute L i ( c ) at any c between a and b . If 0 < L i ( c ) < 2 α − 1 , the interval [ ⌈ a ⌉ , ⌊ b ⌋ ] contains all integers between a and b satisfying 0 < L i ( ⋅ ) < 2 α − 1 . Add this interval to S i . (In our algorithm, we take the midpoint c = a + b 2 .)

Similarly, it is straightforward to analyze other cases: a belongs to R i , 0 and b belongs to R i , α , and so on. When we consider all such consecutive pairs a , b in R i , we end up with a set of intervals S i such that every integer m between 0 and 2 β − 1 that satisfies 0 < L i ( m ) < 2 α − 1 is present in some interval of S i .

Main idea. Here, we extend the aforementioned arguments and describe the algorithm for finding S i . Let a be the first element in R i . If 0 ∉ R i , then for all 0 < x < a , there are three possibilities: L i ( x ) is entirely above y = 2 α − 1 (when a ∈ R i , α ), or entirely below y = 0 (when a ∈ R i , 0 ), or entirely between y = 0 and y = 2 α − 1 . Therefore, add [ 0 , ⌊ a ⌋ ] to S i if 0 < L i ( a 2 ) < 2 α − 1 and 0 ∉ R i .

Next for every pair of consecutive elements a , b in the (sorted) enumeration of R i , there are four cases based on whether they belong to R i , 0 or R i , α .

a ∈ R i , 0 and b ∈ R i , 0 : If 0 < L i a + b 2 < 2 α − 1 , then add [ ⌈ a ⌉ , ⌊ b ⌋ ] to S i . If L i a + b 2 < 0 , and if a (resp. b ) is an integer, we still include them since L i ( a ) = 0 and L i ( b ) = 0 . In this case, add [ a , a ] (resp. [ b , b ] ) to S i .
a ∈ R i , 0 and b ∈ R i , α : now the only possibility is that for all x , a < x < b ⇒ 0 < L i ( x ) < 2 α − 1 (i.e., L i is nondecreasing in this region). This is because L i ( a ) = 0 , L i ( b ) = 2 α − 1 and L i ( c ) cannot be 0 or 2 α − 1 for any a < c < b . So, add [ ⌈ a ⌉ , ⌊ b ⌋ ] to S i .
a ∈ R i , α and b ∈ R i , 0 : Similar to the previous case, we once again have a < x < b ⇒ 0 < L i ( x ) < 2 α − 1 and L i is nonincreasing in this region. Add [ ⌈ a ⌉ , ⌊ b ⌋ ] to S i .
a ∈ R i , α and b ∈ R i , α : For each a < x < b , L i ( x ) does not intersect y = 0 and y = 2 α − 1 . Since L i ( a ) = L i ( b ) = 2 α − 1 , this means that L i is entirely above or below y = 2 α − 1 . So, add [ ⌈ a ⌉ , ⌊ b ⌋ ] to S i if 0 < L i a + b 2 < 2 α − 1 , otherwise if a (resp. b ) is an integer, then add [ a , a ] (resp. [ b , b ] ) to S i .

Let b be the last element in R i . If 2 β − 1 ∉ R i , for all b < x < 2 β − 1 , there are three possibilities: L i ( x ) is entirely above y = 2 α − 1 (when b ∈ R i , α ), or entirely below y = 0 (when b ∈ R i , 0 ), or between y = 0 and y = 2 α − 1 . Add [ ⌈ b ⌉ , 2 β − 1 ] to S i if 0 < L i ( b + 2 β − 1 2 ) < 2 α − 1 and 2 β − 1 ∉ R i .

To ensure S i consists of disjoint intervals, combine consecutive elements [ a , b ] , [ c , d ] ∈ S i into a single interval [ a , d ] if b = c . Since intervals were added to S i in a sorted manner, after combining them, the disjoint intervals in S i are ordered as they would appear in the real line. This property will be useful when the intersection of intervals is computed in the next step.

Time complexity. For every L i , the equations L i ( x ) = 0 and L i ( x ) = 2 α − 1 can each have at most d i solutions. All real roots for these equations can be computed in time O ˜ ( d i 3 τ ) [38,39], where τ is the minimum number of bits required to represent all coefficients of L i . In our case, τ = O ˜ ( α + d β ) . The size of R i is at most 2 d i and sorting it takes time O ˜ ( d i ) . Adding intervals to S i for every pair of consecutive elements in R i (and ensuring that they are disjoint) takes O ˜ ( d i ) . We perform this for all polynomials L 0 , L 1 , … , L d − 1 leading to a time complexity of

(8) O ˜ ∑ i = 0 d − 1 ( d i + d i 3 τ ) = O ˜ ( d 3 ( α + d β ) ) .

Remark. When finding real roots of L i ( x ) = 0 and L i ( x ) = 2 α − 1 , it is sufficient to compute approximations of these roots up to certain decimal places. For the values of d , α , β that we consider for our experiments in Section 5.1 (refer Table 2), obtaining roots accurately up to 30 decimal places is sufficient for our algorithm to correctly output the required integer.

Step 2. Finding an x belonging to some interval in each S i :

When we refer to an interval I = [ a , b ] (with start point a = I . s t a r t and end point b = I . e n d ), it is understood that a ≤ b . A number x ∈ I if a ≤ x ≤ b . Two intervals [ a , b ] , [ c , d ] intersect if max ( a , c ) ≤ min ( b , d ) in which case the interval denoting their intersection is [ max ( a , c ) , min ( b , d ) ] . Simultaneous intersection of k intervals I 1 , I 2 , … I k can be computed iteratively: Let X store the required result, which is initially set to I 1 . For i = 2 … k , set X as the intersection of intervals X and I i .

Let S i , j be the j th element/interval in the (ordered) set S i . Our objective is to identify potential intervals in each S i that contain a common integer x ∈ [ 0 , 2 β − 1 ] . In other words, find an integer x ∈ [ 0 , 2 β − 1 ] and indices i 0 , i 1 , … i d − 1 such that x ∈ S 0 , i 0 , x ∈ S 1 , i 1 … x ∈ S d − 1 , i d − 1 . A recursive approach can be used: Of S 0 , S 1 , … S d − 1 , let S k be the set with the smallest end point in its first interval, in other words, S k , 1 . e n d = min i = 0 … d − 1 { S i , 1 . e n d } . Let I be the interval representing the intersection of S 0 , 1 , S 1 , 1 … S d − 1 , 1 . We have two possibilities:

I is empty: This implies that S k , 1 will never contribute towards the required common element in future iterations of the recursion since it ends first among all intervals from S 0 ∪ S 1 ⋯ ∪ S d − 1 on the real line. So remove S k , 1 from S k and recurse on the new sets S 0 … S d − 1 .
I is nonempty: If I contains an integer x in the range [ 0 , 2 β − 1 ] output it since x is common to all S 0 , 1 , S 1 , 1 , … S d − 1 , 1 . Otherwise, remove S k , 1 from S k as before and recurse.

The size of S 0 ∪ S 1 ⋯ ∪ S d − 1 decreases by one in each iteration. The recursion terminates by default when any S i becomes empty, since the required integer must be present in some interval in each S i . Therefore, once any S i becomes empty, no such x exists and the algorithm returns failure.

Time complexity. The number of intervals in each S i is O ( d i ) since the size of R i is bounded by 2 d i . In the worst case, when no integer satisfies the desired condition, the recursion runs for ∣ S 0 ∪ S 1 ∪ ⋯ ∪ S d − 1 ∣ = O ∑ i = 0 d − 1 d i steps. Finding the iterative intersection of S 0 , 1 , S 1 , 1 … S d − 1 , 1 in each step can be done in time O ( d ) . The worst-case time complexity is

(9) O d ∑ i = 0 d − 1 d i = O ˜ ( d 2 ) .

4.7 Verification of the polynomial against the remaining outputs

Once a candidate polynomial is obtained as described in Section 4.6, we use the remaining ( n − d − 1 ) outputs to verify the correctness by computing the roots of the polynomial and check if they are integers that lie in the range [ 0 , 2 β − 1 ] . If these verifications are successful, then we say that the algorithm has output a satisfying polynomial.

Time complexity. We need to compute d roots of O ( n ) polynomials where the size of each coefficient is O ( α + d β ) . The worst-case time complexity of this step is

(10) O ˜ ( n d 3 ( α + d β ) ) .

4.8 Correctness of Algorithm 1

Theorem 1

Given n number of evaluations of a d-degree monotonic integer polynomial p ( x ) , our algorithm will output a polynomial p ′ ( x ) ∈ N [ x ] that satisfies the bound requirements on inputs and polynomial coefficients as stated in Problem Statement 1.

From Lemmas 1 and 2, ( x i − x j ) < 2 β is a positive divisor of ( y i − y j ) . As described in Section 4.1, we obtain all divisors < 2 β of y -differences of which one of them is a respective x -difference yet to be determined. Next, by forming the consistency-set as described in Section 4.2, we try to eliminate divisors that are “unlikely” to be x -differences. Valid x -differences are guaranteed to be in the consistent set, since given two guessed differences ( x i − x 1 ) and ( x j − x 1 ) , ( x i − x 1 ) − ( x j − x 1 ) = ( x i − x j ) is another guessed difference that should be present in the divisor set of ( y i − y j ) . Tuples of guessed differences are formed by picking one consistent element from each row of Column 1 of Matrix D as given in Section 4.1. One of these tuples will be the correct set of x -differences. We then use each tuple to obtain a candidate polynomial using Lagrange interpolation as described in Section 4.5 and then find x 1 as described in Section 4.6, followed by verification against all outputs. At the end of Algorithm 1, we isolate the set of guessed differences with respect to x 1 and output a polynomial p ′ ( x ) such that p ′ ( x i ) = y i for 1 ≤ i ≤ n . Hence, as long as we are successful in obtaining all divisors of the y differences, our algorithm will output a polynomial p ′ ( x ) satisfying all requirements.

4.9 Consolidated time complexity of the polynomial reconstruction algorithm

Theorem 2

Our algorithm for reconstruction of a degree-d monotonic integer polynomial when given only the outputs of the polynomial evaluated at a sufficient number of inputs, as stated in Problem Statement 1, has a worst-case time complexity of O ˜ ( d 2 ( α + d β ) ⋅ β ⋅ 2 β + n d 3 ( α + d β ) d ) .

For each of the steps of our polynomial reconstruction algorithm described earlier, we either provided a worst-case or a (heuristic) expected-case analysis of the running time. However, it looks difficult to do a tight/rigorous analysis of some steps as they deal with, for instance, the distribution of the divisors of the polynomial outputs evaluated at arbitrarily chosen inputs. Hence, we could only provide a heuristic bound on the expected running time.

By using (3) and (4), and considering that (7), (8), (9), and (10) need to be executed ψ d times in the worst-case, we obtain the total heuristic expected running time of Algorithm 1 (in terms of bit operations) as follows:

(11) O ˜ ( d 2 ( α + d β ) ⋅ β ⋅ 2 β + n d 3 ( α + d β ) d ) .

Remark. We do not consider the time complexity of Section 4.3 while computing the consolidated time analysis in (11) since it is not enabled for experiments involving large parameters. Equation (5) in Section 4.3 gives the time complexity for RecursiveConsistencyCheck procedure for one tuple. When it is included as part of Algorithm 1, in the worst-case, it needs to be executed ψ d times making the total heuristic expected running time of Algorithm 1 as follows:

(12) O ˜ ( d 2 ( α + d β ) d ⋅ 2 β ) .

4.10 On the uniqueness of the satisfying polynomial when coefficients are uniform randomly sampled

We recall that after executing the Algorithm 1, we recover a satisfying polynomial and the corresponding input values x i that satisfies the given output values y i . As we have remarked earlier, it may be possible to recover the x i values only up to a affine scaling. This is because the outputs for p ( x ) evaluated at x i are identical to p ( x + c ) (for c ∈ Z ) evaluated at x i − c , respectively.

We consider here the case when the coefficients a i ( 0 ≤ i ≤ d ) of p ( x ) = a 0 + a 1 ⋅ x + a 2 ⋅ x 2 + … + a d ⋅ x d are picked uniform randomly and independently from the range [ 1 , 2 α − 1 ] . We show below, within certain constraints, that the value of x 1 obtained as described in Section 4.6 along with the guessed differences that results in a polynomial satisfying all conditions are, with a high probability, the original integer inputs that were used to compute the polynomial outputs. Let x p denote the value of x 1 obtained after the execution of Step 2 of Section 4.6. We examine below the resulting coefficients of the polynomial p ( x ) evaluated at integer values less than and greater than x p , namely, ( x p − c ) for 0 < c < x p and ( x p + c ) for 0 < c < 2 β − x p .

4.10.1 Evaluation at ( x p − c )

Theorem 3

Given p ( x ) ∈ N [ x ] with p ( x ) = a 0 + a 1 ⋅ x + a 2 ⋅ x 2 + … + a d ⋅ x d and Y = { p ( x p ) , p ( x p + δ 1 ) , p ( x p + δ 2 ) , … } . The probability that at least one coefficient of p ( x ) goes beyond the coefficient range of [ 1 , 2 α − 1 ] , to obtain Y when p ( x ) is respectively evaluated at x p − c , x p + δ 1 − c , x p + δ 2 − c , … , where c ∈ Z + , is at least 1 − 1 d ⋅ c .

Proof

The polynomial p ( x ) evaluated at ( x p − c ) is

p ( x p − c ) = a d ⋅ ( x p − c ) d + a d − 1 ⋅ ( x p − c ) d − 1 + … + a 0 = a d ⋅ x p d + ( − d C 1 ⋅ c ⋅ a d + a d − 1 ) ⋅ x p d − 1 + … + ∑ i = 0 d ( − 1 ) i a i ⋅ c i .

Let T d − 1 be the coefficient of x d − 1 in p ( x p − c ) . T d − 1 > 0 only when a d − 1 > d ⋅ c ⋅ a d .

We look at the probability of a d − 1 going beyond the coefficient range [ 1 , 2 α − 1 ] when a d − 1 > d ⋅ c ⋅ a d . P ( a d − 1 ≥ 2 α ) ≥ P ( d ⋅ c ⋅ a d ≥ 2 α ) = P a d ≥ 2 α d ⋅ c . Since the coefficients are picked randomly from an uniform distribution and the coefficient range is [ 1 , 2 α − 1 ] , P ( a d − 1 ≥ 2 α ) ≥ 2 α − 1 − ( 2 α / d ⋅ c ) 2 α − 2 ≥ 1 − 1 d ⋅ c .□

From Theorem 3, we see that even for polynomials of small degree, say d = 10 , in order for T d − 1 to be positive, a d − 1 ≥ 2 α with a high probability, which is a contradiction to our assumption that all the coefficients of p ( x ) are within the given range for coefficients.

4.10.2 Evaluation at ( x p + c )

Theorem 4

Given p ( x ) ∈ N [ x ] with p ( x ) = a 0 + a 1 ⋅ x + a 2 ⋅ x 2 + … + a d ⋅ x d and Y = { p ( x p ) , p ( x p + δ 1 ) , p ( x p + δ 2 ) , … } . The probability that at least one coefficient of p ( x ) goes beyond the coefficient range of [ 1 , 2 α − 1 ] , to obtain Y when p ( x ) is respectively evaluated at x p + c , x p + δ 1 + c , x p + δ 2 + c , … , where c ∈ Z + , is at least 1 − 1 d ⋅ c .

Proof

The polynomial p ( x ) evaluated at ( x p + c ) is

p ( x p + c ) = a d ⋅ ( x p + c ) d + a d − 1 ⋅ ( x p + c ) d − 1 + … + a 0 = a d ⋅ x p d + ( C 1 d ⋅ c ⋅ a d + a d − 1 ) ⋅ x p d − 1 + … + ∑ 0 d a i c i

Let T d − 1 be the coefficient of x d − 1 in p ( x p + c ) .

T d − 1 = d ⋅ c ⋅ a d + a d − 1 ⇒ T d − 1 ≥ d ⋅ c ⋅ a d .

P ( T d − 1 ≥ 2 α ) ≥ P ( d ⋅ c ⋅ a d ≥ 2 α ) = P a d ≥ 2 α d ⋅ c .

Since the coefficients are picked randomly from an uniform distribution and the coefficient range is [ 1 , 2 α − 1 ] , P ( T d − 1 ≥ 2 α ) ≥ 2 α − 1 − ( 2 α / d ⋅ c ) 2 α − 2 ≥ 1 − 1 d ⋅ c .□

From Theorem 4, we find that P ( T d − 1 ≥ 2 α ) will be high even for small degree polynomials, say d = 10.

Remark. In our experiments, we have been able to verify that the recovered polynomial matched the one used to obtain the initial set of outputs.

5 Experiments and results

SageMath 8.6 [40] was used to implement the procedures mentioned in Section 4. Our source code is available at [41]. All our experiments were run on a Lenovo ThinkStation P920 workstation having a 2.2 GHz Intel ® Xeon ® processor with 20 cores. However, our implementation is not multithreaded and hence did not explicitly exploit the parallelism available. We classify our experiments into two sets depending on how the input x i values were sampled – either uniform randomly or come from a real dataset. The coefficients a i were always uniform randomly chosen from the underlying set. The goal is to recover the inputs and the polynomial. The second set consists of using data available from the UCI Machine learning repository [26], which is a real-world hospital data obtained from a hospital in Caracas, Venezuela. For this case, we have chosen the degree of the polynomial d = 9 as in ref. [5].

5.1 Experiments with random values

In our experiments with random values, the polynomial coefficients a i were chosen uniformly random in the range [ 1 , 2 α − 1 ] , with α = 128 . We have run a number of experiments for different values of β and polynomial degrees. For β up to 28, where we use the sieve method to obtain factors < 2 β , we have run experiments for polynomials of degrees d = 12 , 24, 48, and 96. For experiments with degrees d = 12 and 24, we also run the procedure RecursiveConsistencyCheck described in Section 4.3. For β values of 32, 64, where we first use sieving to find factors ≤ 2 28 followed by the ECM factoring method for larger factors < 2 β , we have run experiments for polynomials of degrees 12 and 24. We randomly chose x i values from [ 0 , 2 β − 1 ] and computed n = 2 d polynomial outputs. Table 2 presents the time taken for our algorithm (Algorithm 1) to run to completion. In each of the trials, our algorithm was successful in recovering the input polynomial and input values. The time taken for recovery is given in seconds and is averaged over 5 runs in each case. As described in Section 4.1, we use SageMath function ECM.find_factor() to find factors. This likely returns the next smallest factor (prime in most cases). We need to extract all factors < 2 β , and hence, we call this function iteratively. At each invocation of this function, we start a timer with a timeout, obtained empirically, such that its value is likely sufficient to obtain the biggest prime factor < 2 β . The iteration ends when we hit the timeout. Table 3 presents timeout values and the percentage of number of times where a suitable polynomial was not recovered, in about 200 trials for different values of β . It is to be noted that there is a direct correlation between failure of polynomial recovery and the timeout occurrence before obtaining all factors < 2 β using ECM.find_factor(). Based on this, we have set a timeout value of 120 seconds in all our experiments for β ≤ 64 .

Table 3

Percentage of tests where a suitable polynomial was not recovered in about 200 trials where ECM.find_factor() was used to obtain factors. Timeout indicates for how long the ECM.find_factor() was allowed to run for each factor before aborting

β (bits)	Timeout (s)	Percentage of tests where a suitable polynomial was not recovered
64	5	95
64	10	85
64	30	10
64	60	0
64	120	0
80	60	25
80	120	10
80	180	8
96	60	40
96	120	37
96	180	38

In the experiments, n many polynomial output values were input to our algorithm. The choice of n = 2 d was based on observations from the experiments. For the instances where we used parameters mentioned in ref. [5], we could bound the number of divisors to less than 32, thereby making the search space less than 3 2 10 . We then used the divisor set and ( d + 1 ) polynomial outputs to compute a possible polynomial using Lagrange interpolation, which we then used to verify successfully against the remaining ( n − d − 1 ) output values. We note that our search space is significantly less than the estimate of 2 160 in [5]. For instances with larger parameters, we could bound the number of divisors to less than 200. In the Procedure CheckConsistencyInColumn described in Section 4.2, as soon as any divisor turns out to not belong to the consistent set, it is discarded by setting to 0, thereby reducing the number of divisors to be checked. Also, it looks like many further optimization could be done to reduce the search space.

5.2 Experiments with real-world data

We used the cervical cancer (risk factors) data set, same as one of the datasets used by ref. [5], also available from the UCI Machine learning repository [26]. This data set consists of information pertaining to 858 patients, each consisting of 32 attributes comprising demographic information, habits, and historic medical records. The dataset had a few missing values due to privacy concerns, and these were set to 0. Values with fractional part were rounded off to the nearest integer. We repeated the experiment with different random degree d = 9 polynomials and were able to recover the original polynomial successfully. We also tested with 16, 20, 24, and 32 bit values of α and have tabulated the time taken by SageMath to compute the polynomial in each of the cases. β = 24 was sufficient to encode this data. Time for execution is given in seconds and is averaged over five runs in each case.

Our results invalidate the security claims in ref. [5, Theorem 4.2] regarding the leakage profile for Server B . We use the parameters suggested in ref. [5, Section 4.2], i.e., d = 9 , α = 16 and the (squared plaintext) distances are in the range [ 0 , 2 β − 1 ] , where, β = 24 . For the parameters mentioned there, with only n = 20 output values, we could recover the coefficients of the polynomial in a few seconds as given in Table 1.

Because of the random re-ordering of the distances, Server B will not learn the exact distance of the query point to a specified point (say the i th point in the original order). Nevertheless, in many real-world scenarios, the data set is publicly available and this, and perhaps other auxiliary information, may potentially be used in combination with our results to leak information about the query point.

6 Attack on the secure k -NN protocol in the noisy setting

In this section, we give another attack on the protocol of ref. [5], if one tries to overcome our attack from Section 4 by perturbing the polynomial outputs by adding noisy error terms. This modified protocol is not mentioned in ref. [5], but we consider it here for completeness.

In the original solution given in ref. [5], to hide the ED values, Server A chooses a monotonic polynomial and homomorphically evaluates this polynomial on its computed distances and permutes the order before sending them to Server B . Now, instead of sending these (encrypted) polynomial outputs as it is, if they are perturbed with some noise such that the ordering is still maintained, it will make our attack in Section 4 unsuccessful in recovering the polynomial or the inputs, as the attack relies on the exact difference of the polynomial outputs. It is easy to see that the error value can only be as large as the sum of all the coefficients except the constant term. Let p ( x ) = a 0 + a 1 ⋅ x + … + a d ⋅ x d be the chosen monotonic polynomial, then, p ( 0 ) = a 0 , p ( 1 ) = a 0 + a 1 + … + a d and the maximum value of the added noise needs to be less than ( p ( 1 ) − p ( 0 ) ) so as to maintain the original ordering of polynomial outputs, meaning the perturbation error may only be chosen from the set [ 0 , … , ( a 1 + … + a d ) ] . This safe choice of the error term is due to the fact that the polynomial output values are encrypted, and hence, it is not possible for the Server A to inspect the value and accordingly choose the error term. The range of perturbation error terms still depends on the size of the coefficient space that can potentially be very large (unlike the plaintext space as assumed).

Theorem 5

Given a set of noisy outputs of a degree-d monotonic integer polynomial where the outputs are perturbed with some random noise from the range 0 , ∑ k = 1 d a k , then the Server B will be able to leak ratios of the inputs.

Proof

Let two of the values that the Server B obtains after decryption be F ( x i ) = p ( x i ) + e i and F ( x j ) = p ( x j ) + e j , where e i and e j are the random error terms such that 0 ≤ e i , e j < ∑ k = 1 d a k , 1 ≤ a k < 2 α , and F ( x i ) , F ( x j ) > 0.

Consider the ratio F ( x i ) / F ( x j ) with 0 ≤ x j ≤ x i < 2 β :

(13) F ( x i ) F ( x j ) = ∑ k = 0 d a k + a 1 ⋅ x i + … + a d ⋅ x i d ∑ k = 0 d a k + a 1 ⋅ x j + … + a d ⋅ x j d .

When x i and x j are sufficiently large, we obtain that the ratio in (13) is approximately close to ( x i / x j ) d .□

The attack presented in Section 4 will not work in this new setting, where the inputs are perturbed using bounded random noise, because in the attack, we rely on the exact differences of the polynomial outputs. In the noisy setting, Theorem 6 shows that it is still possible to leak ratios of the inputs to the Server B by taking the d th root of ( x i / x j ) d , although recovering the exact values (even up to an affine scaling) may be challenging. But still a lot more information about the inputs is leaked than just a single bit. Note also that if the error terms e k were not significantly smaller than the leading terms (which, fortunately, is not the case), then we would not be able to recover the ratios.

7 Conclusion and future work

In this article, we give an attack on the protocol of ref. [5] for securely computing k -NN on SHE encrypted data. This attack is based on our algorithm for integer polynomial reconstruction given only the integer outputs. While, by just using the outputs, it is not possible to accurately determine the coefficients or the inputs, we show that we can feasibly recover the inputs (up to an affine scaling) when the number of outputs is much bigger than the degree of the polynomial. Our experiments were conducted both on uniformly randomly selected values as well as a real-world dataset. Since many of the datasets are available in the public domain, it may be possible for an adversary to derive more information about the exact values using our method together with some other available metadata.

Our method for polynomial reconstruction runs in exponential time in plaintext space β and in degree d of the chosen polynomial. In many real-world scenarios, both these parameters will be small. Future work can look at a lower bound analysis of the time required for this polynomial reconstruction problem. It can also look at relaxing the practical limits on α and β . The requirement of having the integer polynomial to be monotonic was needed by ref. [5] since they needed the polynomial outputs to follow the input ordering to be able to sort the inputs. Relaxing this requirement from the view point of polynomial recovery can be investigated as part of a future work. Finally, an FHE solution that can perform efficient sorting and searching on large datasets would eliminate the need for service providers to be entrusted with decryption keys, thereby providing a more secure cloud computation environment.

Acknowledgments

We thank Debdeep Mukhopadhyay and Sikhar Patranabis for helpful discussions. We also thank V. N. Muralidhara for pointing out the literature on the d -MST problem. We thank the anonymous reviewer for the invaluable comments and suggestions, which helped us improve the manuscript.

Funding information: This work was funded partly by Sonata Software Limited, Bengaluru, India, partly by the INSPIRE Faculty Award (DST, Govt. of India) and the Infosys Foundation Career Development Chair Professorship grant for Srinivas Vivek and partly by the Centre for Internet of Ethical Things (IIIT Bangalore).
Conflict of interest: The authors state no conflict of interest.

Appendix

A Polynomial reconstruction algorithm

Algorithm 1: Integer polynomial reconstruction from only the integer outputs

Procedure Main ( { p ( x 1 ) , … , p ( x n ) } , ψ ):

∕ ∗ I n p u t s : p ( x i ) are the polynomial outputs in the ascending order , ψ is the bound on the number of divisors ∕ ∗ ∕ ∕ ∗ O u t p u t s : All recovered polynomial p ( x ) and corresponding x 1 D = GuessTheDifference ( { p ( x 1 ) , … , p ( x n ) } , ψ ) D = CheckConsistencyInColumn ( D ) C , t u p l e _ c o u n t = FormConsistentTuple ( 1 ) for i = 1 to t u p l e _ c o u n t do ∕ ∗ NOTE : Optional loop and function RecursiveConsistencyCheck ( ) enabled when executed for small polynomial degree ∗ ∕ discardFlag = RecursiveConsistencyCheck ( 1 , C [ i ] ) if discardFlag is TRUE then discard C [ i ] ; t u p l e _ c o u n t − = 1 end end χ [ ] ≔ Column 1 of Matrix Y for i = 1 to t u p l e _ c o u n t do L = ApplyLagrangeInterpolation ( C [ i ] , χ ) x 1 , p = RecoverPossibleXValuesAndPoly ( L ) if x 1 i s n o t failure then Compute roots of p ( x ) = y i , ( d + 1 ) < i ≤ n ∕ * For the other ( n − d − 1 ) outputs * ∕ if N o p o s i t i v e i n t e g e r r o o t < 2 β then ∣ Discard current tuple, move to next one end Output Polynomial p and x 1 end end return

Procedure GuessTheDifference { p ( x 1 ) , … , p ( x n ) } , ψ :

c o u n t = 0 for i = 2 to n do y _ v a l = p ( x i ) for j = 1 to ( i − 1 ) do y i , j = ( y _ v a l − p ( x j ) ) ∕ ∗ p ( x i ) > p ( x j ) ∗ ∕ Use sieve method to obtain all the (positive) divisors of y i , j < 2 β D i , j ≔ Set of all divisors of y i , j < 2 β if ∣ D i , j ∣ > ψ then ∕ ∗ Discard current ( y i , D i ) and pick next output ∗ ∕ i + = 1 ; j = 1 ; y _ v a l = p ( x i ) continue ∕ ∗ Restart inner loop ∗ ∕ end c o u n t + = 1 if c o u n t = = d + 1 then ∣ return D end end end return D ∕ ∗ Matrix of divisors set ∗ ∕

Procedure CheckConsistencyInColumn Matrix D:

for c o l = 1 to ( d − 1 ) do t o p _ r o w = c o l + 1 for i = ( t o p _ r o w + 1 ) to ( d + 1 ) do c _ f l a g = 0 for j = t o p _ r o w to ( i − 1 ) do Iterate over all f ≠ 0 ∈ D i , c o l Iterate over all g ≠ 0 ∈ D j , c o l if ( f − g ) ∈ D i , c o l + 1 then c _ f l a g = 1 if c _ f l a g = = 0 then Set f = 0 in D i , c o l end end ∕ * Check consistency of topmost non-zero element in Column * ∕ forall g ≠ 0 ∈ D t o p _ r o w , c o l do for i = ( t o p _ r o w + 1 ) to ( d + 1 ) do c _ f l a g = 0 Iterate over all f ≠ 0 ∈ D i , c o l if ( f − g ) ∈ D i , c o l + 1 then c _ f l a g = 1 if c _ f l a g = = 0 then Set g = 0 in D t o p _ r o w , c o l end end return D ∕ * Matrix of divisors with consistent elements in each column * ∕ end

Procedure FormConsistentTuple ( i : Column id of D Matrix):

t u p l e _ c o u n t = 0 Initialize Array C [ ] [ ] to zero ∕ ∗ Examine elements in column i of Matrix D ∗ ∕ while i t e r a t e o v e r e a c h n o n − z e r o e l e m e n t e i n D _ i + 1 , i do t u p l e _ c o u n t + = 1 Append e to C [ t u p l e _ c o u n t ] for j = ( i + 1 ) to ( d + 1 ) do Find the element f consistent with e in D j , i Append f to C [ t u p l e _ c o u n t ] end end return C , t u p l e _ c o u n t

Procedure RecursiveConsistencyCheck (Col_id, Divisor Tuple τ ):

r e t = F A L S E if Col _ id < degreed then y _ v a l = Y [ C o l _ i d + 1 ] [ C o l _ i d ] for i = ( C o l _ i d + 1 ) to d do N e w _ y = ( Y [ i ] [ C o l _ i d ] − y _ v a l ) ∕ τ [ i ] D i v _ n e w ≔ Divisors of N e w _ y < 2 β D i v _ i n t ≔ D i v _ n e w ∩ D i C o l _ i d D i C o l _ i d ≔ D i v _ i n t Y [ i ] [ C o l _ i d ] = N e w _ y τ , t u p l e _ c o u n t = FormConsistentTuple ( i ) if t u p l e _ c o u n t = = 0 then ∕ ∗ Tuple τ leads to an inconsistent state, can be discarded ∗ ∕ return TRUE end end C o l _ i d + = 1 r e t = RecursiveConsistencyCheck ( C o l _ i d , τ ) return r e t end else ∣ return r e t end

Procedure ApplyLagrangeInterpolation (CT: tuple of d + 1 differences, χ ):

Declare x 1 as a variable for i = 0 to d do ∕ ∗ Express each difference in terms of x 1 ∗ ∕ δ i = x 1 + C T [ i + 1 ] − C T [ 1 ] end δ = [ δ 0 , δ 1 , … , δ d ] Apply Lagrange interpolation on ( δ , χ ) to get a degree d polynomial L in x . The coefficients L 0 , L 1 , … , L d of L are polynomials in x 1 , such that the coefficient of x is a degree d polynomial in x 1 and that of x d is a constant integer . return L

Procedure RecoverPossibleXValuesAndPoly ( L )

∕ ∗ Coefficients L 0 , … , L d of L are polynomials ∗ ∕ for i = 0 to ( d − 1 ) do R i , 0 ≔ All real roots of L i ( x ) = 0 in [ 0 , 2 β − 1 ] R i , α ≔ All real roots of L i ( x ) = 2 α − 1 in [ 0 , 2 β − 1 ] Sort the elements of R i = R i , 0 ∪ R i , α S i = ϕ ; Let a be first element of R i if 0 ∉ R i and 0 < L i ( a ∕ 2 ) < 2 α − 1 then ∣ S i = S i ∪ { [ 0 , ⌊ a ⌋ ] } for every pair of consecutive elements a , b in R i do if a is an integer then ∣ S i = S i ∪ { [ a , a ] } if a ∈ R i , 0 and b ∈ R i , 0 and 0 < L i a + b 2 < 2 α − 1 then ∣ S i = S i ∪ { [ ⌈ a ⌉ , ⌊ b ⌋ ] } if a ∈ R i , 0 and b ∈ R i , α then ∣ S i = S i ∪ { [ ⌈ a ⌉ , ⌊ b ⌋ ] } if a ∈ R i , α and b ∈ R i , 0 then ∣ S i = S i ∪ { [ ⌈ a ⌉ , ⌊ b ⌋ ] } if a ∈ R i , α and b ∈ R i , α and 0 < L i a + b 2 < 2 α − 1 then ∣ S i = S i ∪ { [ ⌈ a ⌉ , ⌊ b ⌋ ] } if b is an integer then ∣ S i = S i ∪ { [ b , b ] } end Let b be the last element of R i . if 2 β − 1 ∉ R i and 0 < L i b + 2 β − 1 2 < 2 α − 1 then ∣ S i = S i ∪ { [ ⌈ b ⌉ , 2 β − 1 ] } end while every S i isnon - empty do Let S i , j denote the j th interval in S i . Find k such that S k has the least end-point in its first interval . Compute I = S 0 , 1 ∪ S 1 , 1 ∪ ⋯ ∪ S d − 1 , 1 if I is non - empty and I contains an integer x in [ 0 , 2 β − 1 ] then ∕ ∗ Found x ∗ ∕ Compute polynomial p whose coefficients are L 0 ( x ) , … , L d ( x ) return x , p end else ∣ Remove S k , 1 from S k . end end return failure

References

[1] Guruswami V, Sudan M. Improved decoding of Reed–Solomon and algebraic-geometry codes. IEEE Trans Inform Theory. Sep. 1999;45(6):1757–67. 10.1109/18.782097Search in Google Scholar

[2] Naor M, Pinkas B. Oblivious transfer and polynomial evaluation. In: Vitter JS, Larmore LL, Leighton FT, editors. Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing. Atlanta, Georgia, USA: ACM; 1999. p. 245–54. 10.1145/301250.301312Search in Google Scholar

[3] Berlekamp E. Algebraic coding theory. Vol. 8. New York: McGraw-Hill; 1968. Search in Google Scholar

[4] Goldreich O, Rubinfeld R, Sudan M. Learning polynomials with queries: the highly noisy case. SIAM J Discrete Math. 2000;13(4):535–70. 10.1109/SFCS.1995.492485Search in Google Scholar

[5] Kesarwani M, Kaul A, Naldurg P, Patranabis S, Singh G, Mehta S, et al. Efficient secure k-nearest neighbors over encrypted data. In: Proceedings of the 21th International Conference on Extending Database Technology, EDBT 2018. Vienna, Austria; March 26–29, 2018. p. 564–75. Search in Google Scholar

[6] Sudan M. List decoding: algorithms and applications. SIGACT News. Mar. 2000;31(1):16–27. 10.1007/3-540-44929-9_3Search in Google Scholar

[7] Gopalan P, Khot S, Saket R. Hardness of reconstructing multivariate polynomials over finite fields. In: 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07). 2007. p. 349–59. 10.1109/FOCS.2007.37Search in Google Scholar

[8] Kiayias A, Yung M. Directions in polynomial reconstruction based cryptography. IEICE Trans Fundamentals Electronics Commun Comput Sci. 2004;87:978–85. Search in Google Scholar

[9] Bleichenbacher D, Nguyen PQ. Noisy polynomial interpolation and noisy chinese remaindering. In: Preneel B, editor. Advances in cryptology – EUROCRYPT 2000. Berlin, Heidelberg: Springer; 2000. p. 53–69. 10.1007/3-540-45539-6_4Search in Google Scholar

[10] Kordos M, Blachnik M, Strzempa D. Do we need whatever more than k-nn? In: Rutkowski L, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM, editors. Artificial Intelligence and Soft Computing. Berlin, Heidelberg: Springer Berlin Heidelberg; 2010. p. 414–21. 10.1007/978-3-642-13208-7_52Search in Google Scholar

[11] Pujianto U, Wibawa AP, Akbar MI. K-nearest neighbor (k-nn) based missing data imputation. In: 2019 5th International Conference on Science in Information Technology (ICSITech). IEEE; 2019. p. 83–88. Search in Google Scholar

[12] Choi S, Ghinita G, Lim H-S, Bertino E. Secure kNN query processing in untrusted cloud environments. IEEE Trans Knowledge Data Eng. 2014;26:2818–31. 10.1109/TKDE.2014.2302434Search in Google Scholar

[13] Songhori EM, Hussain SU, Sadeghi A-R, Koushanfar F. Compacting privacy-preserving k-nearest neighbor search using logic synthesis. 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). 2015. p. 1–6. 10.1145/2744769.2744808Search in Google Scholar

[14] Elmehdwi Y, Samanthula BK, Jiang W. Secure k-nearest neighbor query over encrypted data in outsourced environments. In: IEEE 30th International Conference on Data Engineering, IL, USA, March 31–April 4. Chicago: ICDE; 2014. p. 664–75. 10.1109/ICDE.2014.6816690Search in Google Scholar

[15] Gentry C. Fully homomorphic encryption using ideal lattices. In: Mitzenmacher M, editor, 41st Annual ACM Symposium on Theory of Computing. Bethesda, MD, USA: ACM Press; 2009. p. 169–78. 10.1145/1536414.1536440Search in Google Scholar

[16] Gentry C, Sahai A, Waters B. Homomorphic encryption from learning with errors: Conceptually-simpler, asymptotically-faster, attribute-based. In: Canetti R, Garay JA, editors. Advances in Cryptology - CRYPTO 2013, Part I. Vol. 8042. of Lecture Notes in Computer Science, Santa Barbara, CA, USA, Aug. 18–22. Heidelberg, Germany: Springer; 2013. p. 75–92. 10.1007/978-3-642-40041-4_5Search in Google Scholar

[17] Chillotti I, Gama N, Georgieva M, Izabachène M. The: fast fully homomorphic encryption over the torus. J Cryptol. Jan 2020;33(1) :34–91. 10.1007/s00145-019-09319-xSearch in Google Scholar

[18] Çetin GS, Doröz Y, Sunar B, Savas E. Depth optimized efficient homomorphic sorting. In: Lauter KE, Rodríguez-Henríquez F, editors. Progress in cryptology - LATINCRYPT 2015: 4th International Conference on Cryptology and Information Security in Latin America. Vol. 9230. of Lecture Notes in Computer Science, Guadalajara, Mexico, Aug. 23–26, 2015. Heidelberg, Germany: Springer; 2015. p. 61–80. 10.1007/978-3-319-22174-8_4Search in Google Scholar

[19] Çetin GS, Sunar B. Homomorphic rank sort using surrogate polynomials. In: Lange T, Dunkelman O, editors. Progress in cryptology - LATINCRYPT 2017. Cham: Springer International Publishing; 2019. p. 311–26. 10.1007/978-3-030-25283-0_17Search in Google Scholar

[20] Chatterjee A, Sengupta I. Searching and sorting of fully homomorphic encrypted data on cloud. IACR Cryptol ePrint Arch. 2015;2015:981. Search in Google Scholar

[21] Aono Y, Hayashi T, Phong LT, Wang L. Scalable and secure logistic regression via homomorphic encryption. In: Bertino E, Sandhu R, Pretschner A, editors, Proceedings of the Sixth ACM on Conference on Data and Application Security and Privacy, CODASPY 2016, March 9-11, 2016. New Orleans, LA, USA: ACM; 2016. p. 142–4. 10.1145/2857705.2857731Search in Google Scholar

[22] Hardy S, Henecka W, Ivey-Law H, Nock R, Patrini G, Smith G, et al. Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. 2017. ArXiv, abs/171110677. Search in Google Scholar

[23] Luo Y, Jia X, Fu S, Xu M. pRide: privacy-preserving ride matching over road networks for online ride-hailing service. IEEE Trans Inform Forensics Security. 2019;14(7):1791–802. 10.1109/TIFS.2018.2885282Search in Google Scholar

[24] Yu H, Jia X, Zhang H, Yu X, Shu J. Psride: privacy-preserving shared ride matching for online ride hailing systems. IEEE Trans Dependable Secure Comput. 2021;18(3):1425–40. 10.1109/TDSC.2019.2931295Search in Google Scholar

[25] Brakerski Z, Gentry C, Vaikuntanathan V. (Leveled) fully homomorphic encryption without bootstrapping. ACM Trans Comput Theory (TOCT). 2014;6(3):1–36. 10.1145/2090236.2090262Search in Google Scholar

[26] Dua D, Graff C. UCI machine learning repository. 2017. https://archive.ics.uci.edu/ml/index.php.Search in Google Scholar

[27] Boldyreva A, Chenette N, Lee Y, O’Neill A. Order-preserving symmetric encryption. In: Joux A, editor. Advances in cryptology - EUROCRYPT 2009. Berlin Heidelberg: Springer; 2009. p. 224–41. 10.1007/978-3-642-01001-9_13Search in Google Scholar

[28] Boneh D, Lewi K, Raykova M, Sahai A, Zhandry M, Zimmerman J. Semantically secure order-revealing encryption: multi-input functional encryption without obfuscation. In: Oswald E, Fischlin M, editors. Advances in cryptology - EUROCRYPT 2015, Part II. Vol 9057 of Lecture Notes in Computer Science, Sofia, Bulgaria, p. 26–30. Heidelberg, Germany: Springer; 2015. p. 563–94. 10.1007/978-3-662-46803-6_19Search in Google Scholar

[29] Lewi K, Wu DJ. Order-revealing encryption: New constructions, applications, and lower bounds. In: Weippl ER, Katzenbeisser S, Kruegel C, Myers AC, Halevi S, editors. ACM CCS 2016: 23rd Conference on Computer and Communications Security. Vienna, Austria: ACM Press; 2016. p. 1167–78. 10.1145/2976749.2978376Search in Google Scholar

[30] Costache A, Smart NP, Vivek S, Waller A. Fixed-point arithmetic in SHE schemes. In: Avanzi R, Heys HM, editors, SAC 2016: 23rd Annual International Workshop on Selected Areas in Cryptography. Vol. 10532 of Lecture Notes in Computer Science, St. John’s, NL, Canada, Aug. 10–12, 2016. Heidelberg, Germany: Springer; 2016. p. 401–22. 10.1007/978-3-319-69453-5_22Search in Google Scholar

[31] Cheon JH, Kim A, Kim M, Song YS. Homomorphic encryption for arithmetic of approximate numbers. In: Takagi T, Peyrin T, editors. Advances in Cryptology - ASIACRYPT 2017 - 23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, December 3–7, 2017, Proceedings, Part I. Vol 10624. of Lecture Notes in Computer Science. Springer; 2017. p. 409–37. 10.1007/978-3-319-70694-8_15Search in Google Scholar

[32] Li B, Micciancio D. On the security of homomorphic encryption on approximate numbers. In: Canteaut A, Standaert F-X, editors. Advances in Cryptology - EUROCRYPT 2021. Cham: Springer International Publishing; 2021. p. 648–77. 10.1007/978-3-030-77870-5_23Search in Google Scholar

[33] Murthy S, Vivek S. Cryptanalysis of a protocol for efficient sorting on SHE encrypted data. In: Albrecht M, editor. Cryptography and Coding - 17th IMA International Conference, IMACC 2019, December 16–18, 2019, Proceedings. Vol. 11929 of Lecture Notes in Computer Science. Oxford, UK: Springer; 2019. p. 278–94. 10.1007/978-3-030-35199-1_14Search in Google Scholar

[34] Kumaraswamy D, Murthy S, Vivek S. Revisiting driver anonymity in oride. In: Selected Areas in Cryptography: 28th International Conference, Virtual Event, September 29–October 1, 2021, Revised Selected Papers. Berlin, Heidelberg: Springer-Verlag; 2021. p. 25–46. 10.1007/978-3-030-99277-4_2Search in Google Scholar

[35] Sage 9.1 Reference Manual. The elliptic curve factorization method. 2019. https://doc.sagemath.org/html/en/reference/interfaces/sage/interfaces/ecm.html. Last accessed: June 01, 2020. Search in Google Scholar

[36] Tao T. Blog: The divisor bound. 2008. https://terrytao.wordpress.com/2008/09/23/the-divisor-bound/. Last accessed: July 19, 2021. Search in Google Scholar

[37] Ravi R, Sundaram R, Marathe MV, Rosenkrantz DJ, Ravi SS. Spanning trees - short or small. SIAM J Discrete Math. 1996;9(2):178–200. 10.1137/S0895480194266331Search in Google Scholar

[38] Sagraloff M. When newton meets descartes: a simple and fast algorithm to isolate the real roots of a polynomial. In: Proceedings of the 37th International Symposium on Symbolic and Algebraic Computation. New York, NY, USA: ISSAC ’12; 2012. p. 297–304. 10.1145/2442829.2442872Search in Google Scholar

[39] Kobel A, Rouillier F, Sagraloff M. Computing real roots of real polynomials … and now for real! In: Proceedings of the ACM on International Symposium on Symbolic and Algebraic Computation. New York, NY, USA: ISSAC ’16; 2016. p. 303–10. 10.1145/2930889.2930937Search in Google Scholar

[40] Stein W. Sage Mathematics Software (Version 8.6). The Sage Development Team. 2019. http://www.sagemath.org. Search in Google Scholar

[41] GitHub. SAGE code for polynomial recovery. 2019. https://github.com/shyamsmurthy/knn_polynomial_recovery. Last accessed: March 21, 2021. Search in Google Scholar

Received: 2021-12-17

Revised: 2022-04-04

Accepted: 2022-07-06

Published Online: 2022-08-18

This work is licensed under the Creative Commons Attribution 4.0 International License.

β (bits)	Timeout (s)	Percentage of tests where a suitable polynomial was not recovered
64	5	95
64	10	85
64	30	10
64	60	0
64	120	0
80	60	25
80	120	10
80	180	8
96	60	40
96	120	37
96	180	38

β (bits)	Timeout (s)	Percentage of tests where a suitable polynomial was not recovered
64	5	95
64	10	85
64	30	10
64	60	0
64	120	0
80	60	25
80	120	10
80	180	8
96	60	40
96	120	37
96	180	38

Integer polynomial recovery from outputs and its application to cryptanalysis of a protocol for secure sorting

Abstract

1 Introduction

Problem statement 1

2 Related work on polynomial reconstruction

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

3 Motivating application

3.1 k -Nearest Neighbor protocol

3.2 Computation on encrypted data

3.3 k -NN protocol over encrypted data [5]

3.4 Problem setting

3.5 Adversarial model

3.6 Our contribution

4 Recovering integer inputs given only integer polynomial outputs

4.1 Divisors from y -differences

Lemma 1

Proof

Lemma 2

Proof

4.2 Consistency check on the guessed differences

Definition 6

4.3 Recursive consistency check

Lemma 3

Proof

4.4 Choosing optimal divisors sets

Definition 7

4.5 Lagrange interpolation to recover all coefficients

Lemma 4

Proof

4.6 Obtaining x 1 by finding roots of polynomials L i ( x 1 )

Lemma 5

Proof

4.7 Verification of the polynomial against the remaining outputs

4.8 Correctness of Algorithm 1

Theorem 1

4.9 Consolidated time complexity of the polynomial reconstruction algorithm

Theorem 2

4.10 On the uniqueness of the satisfying polynomial when coefficients are uniform randomly sampled

4.10.1 Evaluation at ( x p − c )

Theorem 3

Proof

4.10.2 Evaluation at ( x p + c )

Theorem 4

Proof

5 Experiments and results

5.1 Experiments with random values

5.2 Experiments with real-world data

6 Attack on the secure k -NN protocol in the noisy setting

Theorem 5

Proof

7 Conclusion and future work

Acknowledgments

Appendix A Polynomial reconstruction algorithm

References

Journal and Issue

Articles in the same Issue

Appendix

A Polynomial reconstruction algorithm

β (bits)	Timeout (s)	Percentage of tests where a suitable polynomial was not recovered
64	5	95
64	10	85
64	30	10
64	60	0
64	120	0
80	60	25
80	120	10
80	180	8
96	60	40
96	120	37
96	180	38