-
Rigidity matroids and linear algebraic matroids with applications to matrix completion and tensor codes
Authors:
Joshua Brakensiek,
Manik Dhar,
Jiyang Gao,
Sivakanth Gopi,
Matt Larson
Abstract:
We establish a connection between problems studied in rigidity theory and matroids arising from linear algebraic constructions like tensor products and symmetric products. A special case of this correspondence identifies the problem of giving a description of the correctable erasure patterns in a maximally recoverable tensor code with the problem of describing bipartite rigid graphs or low-rank co…
▽ More
We establish a connection between problems studied in rigidity theory and matroids arising from linear algebraic constructions like tensor products and symmetric products. A special case of this correspondence identifies the problem of giving a description of the correctable erasure patterns in a maximally recoverable tensor code with the problem of describing bipartite rigid graphs or low-rank completable matrix patterns. Additionally, we relate dependencies among symmetric products of generic vectors to graph rigidity and symmetric matrix completion. With an eye toward applications to computer science, we study the dependency of these matroids on the characteristic by giving new combinatorial descriptions in several cases, including the first description of the correctable patterns in an (m, n, a=2, b=2) maximally recoverable tensor code.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
Differentially Private Synthetic Data via Foundation Model APIs 2: Text
Authors:
Chulin Xie,
Zinan Lin,
Arturs Backurs,
Sivakanth Gopi,
Da Yu,
Huseyin A Inan,
Harsha Nori,
Haotian Jiang,
Huishuai Zhang,
Yin Tat Lee,
Bo Li,
Sergey Yekhanin
Abstract:
Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalab…
▽ More
Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe.
△ Less
Submitted 23 July, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
AG Codes Achieve List-decoding Capacity over Constant-sized Fields
Authors:
Joshua Brakensiek,
Manik Dhar,
Sivakanth Gopi,
Zihan Zhang
Abstract:
The recently-emerging field of higher order MDS codes has sought to unify a number of concepts in coding theory. Such areas captured by higher order MDS codes include maximally recoverable (MR) tensor codes, codes with optimal list-decoding guarantees, and codes with constrained generator matrices (as in the GM-MDS theorem).
By proving these equivalences, Brakensiek-Gopi-Makam showed the existen…
▽ More
The recently-emerging field of higher order MDS codes has sought to unify a number of concepts in coding theory. Such areas captured by higher order MDS codes include maximally recoverable (MR) tensor codes, codes with optimal list-decoding guarantees, and codes with constrained generator matrices (as in the GM-MDS theorem).
By proving these equivalences, Brakensiek-Gopi-Makam showed the existence of optimally list-decodable Reed-Solomon codes over exponential sized fields. Building on this, recent breakthroughs by Guo-Zhang and Alrabiah-Guruswami-Li have shown that randomly punctured Reed-Solomon codes achieve list-decoding capacity (which is a relaxation of optimal list-decodability) over linear size fields. We extend these works by developing a formal theory of relaxed higher order MDS codes. In particular, we show that there are two inequivalent relaxations which we call lower and upper relaxations. The lower relaxation is equivalent to relaxed optimal list-decodable codes and the upper relaxation is equivalent to relaxed MR tensor codes with a single parity check per column.
We then generalize the techniques of GZ and AGL to show that both these relaxations can be constructed over constant size fields by randomly puncturing suitable algebraic-geometric codes. For this, we crucially use the generalized GM-MDS theorem for polynomial codes recently proved by Brakensiek-Dhar-Gopi. We obtain the following corollaries from our main result. First, randomly punctured AG codes of rate $R$ achieve list-decoding capacity with list size $O(1/ε)$ and field size $\exp(O(1/ε^2))$. Prior to this work, AG codes were not even known to achieve list-decoding capacity. Second, by randomly puncturing AG codes, we can construct relaxed MR tensor codes with a single parity check per column over constant-sized fields, whereas (non-relaxed) MR tensor codes require exponential field size.
△ Less
Submitted 12 August, 2024; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Generalized GM-MDS: Polynomial Codes are Higher Order MDS
Authors:
Joshua Brakensiek,
Manik Dhar,
Sivakanth Gopi
Abstract:
The GM-MDS theorem, conjectured by Dau-Song-Dong-Yuen and proved by Lovett and Yildiz-Hassibi, shows that the generator matrices of Reed-Solomon codes can attain every possible configuration of zeros for an MDS code. The recently emerging theory of higher order MDS codes has connected the GM-MDS theorem to other important properties of Reed-Solomon codes, including showing that Reed-Solomon codes…
▽ More
The GM-MDS theorem, conjectured by Dau-Song-Dong-Yuen and proved by Lovett and Yildiz-Hassibi, shows that the generator matrices of Reed-Solomon codes can attain every possible configuration of zeros for an MDS code. The recently emerging theory of higher order MDS codes has connected the GM-MDS theorem to other important properties of Reed-Solomon codes, including showing that Reed-Solomon codes can achieve list decoding capacity, even over fields of size linear in the message length.
A few works have extended the GM-MDS theorem to other families of codes, including Gabidulin and skew polynomial codes. In this paper, we generalize all these previous results by showing that the GM-MDS theorem applies to any polynomial code, i.e., a code where the columns of the generator matrix are obtained by evaluating linearly independent polynomials at different points. We also show that the GM-MDS theorem applies to dual codes of such polynomial codes, which is non-trivial since the dual of a polynomial code may not be a polynomial code. More generally, we show that GM-MDS theorem also holds for algebraic codes (and their duals) where columns of the generator matrix are chosen to be points on some irreducible variety which is not contained in a hyperplane through the origin. Our generalization has applications to constructing capacity-achieving list-decodable codes as shown in a follow-up work by Brakensiek-Dhar-Gopi-Zhang, where it is proved that randomly punctured algebraic-geometric (AG) codes achieve list-decoding capacity over constant-sized fields.
△ Less
Submitted 12 August, 2024; v1 submitted 19 October, 2023;
originally announced October 2023.
-
Privacy-Preserving In-Context Learning with Differentially Private Few-Shot Generation
Authors:
Xinyu Tang,
Richard Shin,
Huseyin A. Inan,
Andre Manoel,
Fatemehsadat Mireshghallah,
Zinan Lin,
Sivakanth Gopi,
Janardhan Kulkarni,
Robert Sim
Abstract:
We study the problem of in-context learning (ICL) with large language models (LLMs) on private datasets. This scenario poses privacy risks, as LLMs may leak or regurgitate the private examples demonstrated in the prompt. We propose a novel algorithm that generates synthetic few-shot demonstrations from the private dataset with formal differential privacy (DP) guarantees, and show empirically that…
▽ More
We study the problem of in-context learning (ICL) with large language models (LLMs) on private datasets. This scenario poses privacy risks, as LLMs may leak or regurgitate the private examples demonstrated in the prompt. We propose a novel algorithm that generates synthetic few-shot demonstrations from the private dataset with formal differential privacy (DP) guarantees, and show empirically that it can achieve effective ICL. We conduct extensive experiments on standard benchmarks and compare our algorithm with non-private ICL and zero-shot solutions. Our results demonstrate that our algorithm can achieve competitive performance with strong privacy levels. These results open up new possibilities for ICL with privacy protection for a broad range of applications.
△ Less
Submitted 27 January, 2024; v1 submitted 20 September, 2023;
originally announced September 2023.
-
Textbooks Are All You Need
Authors:
Suriya Gunasekar,
Yi Zhang,
Jyoti Aneja,
Caio César Teodoro Mendes,
Allie Del Giorno,
Sivakanth Gopi,
Mojan Javaheripi,
Piero Kauffmann,
Gustavo de Rosa,
Olli Saarikivi,
Adil Salim,
Shital Shah,
Harkirat Singh Behl,
Xin Wang,
Sébastien Bubeck,
Ronen Eldan,
Adam Tauman Kalai,
Yin Tat Lee,
Yuanzhi Li
Abstract:
We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accu…
▽ More
We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval.
△ Less
Submitted 2 October, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
Differentially Private Synthetic Data via Foundation Model APIs 1: Images
Authors:
Zinan Lin,
Sivakanth Gopi,
Janardhan Kulkarni,
Harsha Nori,
Sergey Yekhanin
Abstract:
Generating differentially private (DP) synthetic data that closely resembles the original private data is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs…
▽ More
Generating differentially private (DP) synthetic data that closely resembles the original private data is a scalable way to mitigate privacy concerns in the current data-driven world. In contrast to current practices that train customized models for this task, we aim to generate DP Synthetic Data via APIs (DPSDA), where we treat foundation models as blackboxes and only utilize their inference APIs. Such API-based, training-free approaches are easier to deploy as exemplified by the recent surge in the number of API-based apps. These approaches can also leverage the power of large foundation models which are only accessible via their inference APIs. However, this comes with greater challenges due to strictly more restrictive model access and the need to protect privacy from the API provider.
In this paper, we present a new framework called Private Evolution (PE) to solve this problem and show its initial promise on synthetic images. Surprisingly, PE can match or even outperform state-of-the-art (SOTA) methods without any model training. For example, on CIFAR10 (with ImageNet as the public data), we achieve FID <= 7.9 with privacy cost ε = 0.67, significantly improving the previous SOTA from ε = 32. We further demonstrate the promise of applying PE on large foundation models such as Stable Diffusion to tackle challenging private datasets with a small number of high-resolution images. The code and data are released at https://github.com/microsoft/DPSDA.
△ Less
Submitted 29 February, 2024; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Selective Pre-training for Private Fine-tuning
Authors:
Da Yu,
Sivakanth Gopi,
Janardhan Kulkarni,
Zinan Lin,
Saurabh Naik,
Tomasz Lukasz Religa,
Jian Yin,
Huishuai Zhang
Abstract:
Text prediction models, when used in applications like email clients or word processors, must protect user data privacy and adhere to model size constraints. These constraints are crucial to meet memory and inference time requirements, as well as to reduce inference costs. Building small, fast, and private domain-specific language models is a thriving area of research. In this work, we show that a…
▽ More
Text prediction models, when used in applications like email clients or word processors, must protect user data privacy and adhere to model size constraints. These constraints are crucial to meet memory and inference time requirements, as well as to reduce inference costs. Building small, fast, and private domain-specific language models is a thriving area of research. In this work, we show that a careful pre-training on a \emph{subset} of the public dataset that is guided by the private dataset is crucial to train small language models with differential privacy. On standard benchmarks, small models trained with our new framework achieve state-of-the-art performance. In addition to performance improvements, our results demonstrate that smaller models, through careful pre-training and private fine-tuning, can match the performance of much larger models that do not have access to private data. This underscores the potential of private learning for model compression and enhanced efficiency.
△ Less
Submitted 2 July, 2024; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Algorithmic Aspects of the Log-Laplace Transform and a Non-Euclidean Proximal Sampler
Authors:
Sivakanth Gopi,
Yin Tat Lee,
Daogao Liu,
Ruoqi Shen,
Kevin Tian
Abstract:
The development of efficient sampling algorithms catering to non-Euclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a non-Euclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the log-Laplace transfo…
▽ More
The development of efficient sampling algorithms catering to non-Euclidean geometries has been a challenging endeavor, as discretization techniques which succeed in the Euclidean setting do not readily carry over to more general settings. We develop a non-Euclidean analog of the recent proximal sampler of [LST21], which naturally induces regularization by an object known as the log-Laplace transform (LLT) of a density. We prove new mathematical properties (with an algorithmic flavor) of the LLT, such as strong convexity-smoothness duality and an isoperimetric inequality, which are used to prove a mixing time on our proximal sampler matching [LST21] under a warm start. As our main application, we show our warm-started sampler improves the value oracle complexity of differentially private convex optimization in $\ell_p$ and Schatten-$p$ norms for $p \in [1, 2]$ to match the Euclidean setting [GLL22], while retaining state-of-the-art excess risk bounds [GLLST23]. We find our investigation of the LLT to be a promising proof-of-concept of its utility as a tool for designing samplers, and outline directions for future exploration.
△ Less
Submitted 22 February, 2023; v1 submitted 12 February, 2023;
originally announced February 2023.
-
A construction of Maximally Recoverable LRCs for small number of local groups
Authors:
Manik Dhar,
Sivakanth Gopi
Abstract:
Maximally Recoverable Local Reconstruction Codes (LRCs) are codes designed for distributed storage to provide maximum resilience to failures for a given amount of storage redundancy and locality. An $(n,r,h,a,g)$-MR LRC has $n$ coordinates divided into $g$ local groups of size $r=n/g$, where each local group has `$a$' local parity checks and there are an additional `$h$' global parity checks. Such…
▽ More
Maximally Recoverable Local Reconstruction Codes (LRCs) are codes designed for distributed storage to provide maximum resilience to failures for a given amount of storage redundancy and locality. An $(n,r,h,a,g)$-MR LRC has $n$ coordinates divided into $g$ local groups of size $r=n/g$, where each local group has `$a$' local parity checks and there are an additional `$h$' global parity checks. Such a code can correct `$a$' erasures in each local group and any $h$ additional erasures. Constructions of MR LRCs over small fields is desirable since field size determines the encoding and decoding efficiency in practice. In this work, we give a new construction of $(n,r,h,a,g)$-MR-LRCs over fields of size $q=O(n)^{h+(g-1)a-\lceil h/g\rceil}$ which generalizes a construction of Hu and Yekhanin (ISIT 2016). This improves upon state of the art when there are a small number of local groups, which is true in practical deployments of MR LRCs.
△ Less
Submitted 11 May, 2023; v1 submitted 22 December, 2022;
originally announced December 2022.
-
Improved Field Size Bounds for Higher Order MDS Codes
Authors:
Joshua Brakensiek,
Manik Dhar,
Sivakanth Gopi
Abstract:
Higher order MDS codes are an interesting generalization of MDS codes recently introduced by Brakensiek, Gopi and Makam (IEEE Trans. Inf. Theory 2022). In later works, they were shown to be intimately connected to optimally list-decodable codes and maximally recoverable tensor codes. Therefore (explicit) constructions of higher order MDS codes over small fields is an important open problem. Higher…
▽ More
Higher order MDS codes are an interesting generalization of MDS codes recently introduced by Brakensiek, Gopi and Makam (IEEE Trans. Inf. Theory 2022). In later works, they were shown to be intimately connected to optimally list-decodable codes and maximally recoverable tensor codes. Therefore (explicit) constructions of higher order MDS codes over small fields is an important open problem. Higher order MDS codes are denoted by $\operatorname{MDS}(\ell)$ where $\ell$ denotes the order of generality, $\operatorname{MDS}(2)$ codes are equivalent to the usual MDS codes. The best prior lower bound on the field size of an $(n,k)$-$\operatorname{MDS}(\ell)$ codes is $Ω_\ell(n^{\ell-1})$, whereas the best known (non-explicit) upper bound is $O_\ell(n^{k(\ell-1)})$ which is exponential in the dimension.
In this work, we nearly close this exponential gap between upper and lower bounds. We show that an $(n,k)$-$\operatorname{MDS}(3)$ codes requires a field of size $Ω_k(n^{k-1})$, which is close to the known upper bound. Using the connection between higher order MDS codes and optimally list-decodable codes, we show that even for a list size of 2, a code which meets the optimal list-decoding Singleton bound requires exponential field size; this resolves an open question from Shangguan and Tamo (STOC 2020 / SIAM J. on Computing 2023).
We also give explicit constructions of $(n,k)$-$\operatorname{MDS}(\ell)$ code over fields of size $n^{(\ell k)^{O(\ell k)}}$. The smallest non-trivial case where we still do not have optimal constructions is $(n,3)$-$\operatorname{MDS}(3)$. In this case, the known lower bound on the field size is $Ω(n^2)$ and the best known upper bounds are $O(n^5)$ for a non-explicit construction and $O(n^{32})$ for an explicit construction. In this paper, we give an explicit construction over fields of size $O(n^3)$ which comes very close to being optimal.
△ Less
Submitted 21 August, 2024; v1 submitted 21 December, 2022;
originally announced December 2022.
-
Private Convex Optimization in General Norms
Authors:
Sivakanth Gopi,
Yin Tat Lee,
Daogao Liu,
Ruoqi Shen,
Kevin Tian
Abstract:
We propose a new framework for differentially private optimization of convex functions which are Lipschitz in an arbitrary norm $\|\cdot\|$. Our algorithms are based on a regularized exponential mechanism which samples from the density $\propto \exp(-k(F+μr))$ where $F$ is the empirical loss and $r$ is a regularizer which is strongly convex with respect to $\|\cdot\|$, generalizing a recent work o…
▽ More
We propose a new framework for differentially private optimization of convex functions which are Lipschitz in an arbitrary norm $\|\cdot\|$. Our algorithms are based on a regularized exponential mechanism which samples from the density $\propto \exp(-k(F+μr))$ where $F$ is the empirical loss and $r$ is a regularizer which is strongly convex with respect to $\|\cdot\|$, generalizing a recent work of [Gopi, Lee, Liu '22] to non-Euclidean settings. We show that this mechanism satisfies Gaussian differential privacy and solves both DP-ERM (empirical risk minimization) and DP-SCO (stochastic convex optimization) by using localization tools from convex geometry. Our framework is the first to apply to private convex optimization in general normed spaces and directly recovers non-private SCO rates achieved by mirror descent as the privacy parameter $ε\to \infty$. As applications, for Lipschitz optimization in $\ell_p$ norms for all $p \in (1, 2)$, we obtain the first optimal privacy-utility tradeoffs; for $p = 1$, we improve tradeoffs obtained by the recent works [Asi, Feldman, Koren, Talwar '21, Bassily, Guzman, Nandi '21] by at least a logarithmic factor. Our $\ell_p$ norm and Schatten-$p$ norm optimization frameworks are complemented with polynomial-time samplers whose query complexity we explicitly bound.
△ Less
Submitted 10 November, 2022; v1 submitted 17 July, 2022;
originally announced July 2022.
-
Generic Reed-Solomon Codes Achieve List-decoding Capacity
Authors:
Joshua Brakensiek,
Sivakanth Gopi,
Visu Makam
Abstract:
In a recent paper, Brakensiek, Gopi and Makam introduced higher order MDS codes as a generalization of MDS codes. An order-$\ell$ MDS code, denoted by $\operatorname{MDS}(\ell)$, has the property that any $\ell$ subspaces formed from columns of its generator matrix intersect as minimally as possible. An independent work by Roth defined a different notion of higher order MDS codes as those achievin…
▽ More
In a recent paper, Brakensiek, Gopi and Makam introduced higher order MDS codes as a generalization of MDS codes. An order-$\ell$ MDS code, denoted by $\operatorname{MDS}(\ell)$, has the property that any $\ell$ subspaces formed from columns of its generator matrix intersect as minimally as possible. An independent work by Roth defined a different notion of higher order MDS codes as those achieving a generalized singleton bound for list-decoding. In this work, we show that these two notions of higher order MDS codes are (nearly) equivalent.
We also show that generic Reed-Solomon codes are $\operatorname{MDS}(\ell)$ for all $\ell$, relying crucially on the GM-MDS theorem which shows that generator matrices of generic Reed-Solomon codes achieve any possible zero pattern. As a corollary, this implies that generic Reed-Solomon codes achieve list decoding capacity. More concretely, we show that, with high probability, a random Reed-Solomon code of rate $R$ over an exponentially large field is list decodable from radius $1-R-ε$ with list size at most $\frac{1-R-ε}ε$, resolving a conjecture of Shangguan and Tamo.
△ Less
Submitted 28 August, 2024; v1 submitted 10 June, 2022;
originally announced June 2022.
-
Private Convex Optimization via Exponential Mechanism
Authors:
Sivakanth Gopi,
Yin Tat Lee,
Daogao Liu
Abstract:
In this paper, we study private optimization problems for non-smooth convex functions $F(x)=\mathbb{E}_i f_i(x)$ on $\mathbb{R}^d$. We show that modifying the exponential mechanism by adding an $\ell_2^2$ regularizer to $F(x)$ and sampling from $π(x)\propto \exp(-k(F(x)+μ\|x\|_2^2/2))$ recovers both the known optimal empirical risk and population loss under $(ε,δ)$-DP. Furthermore, we show how to…
▽ More
In this paper, we study private optimization problems for non-smooth convex functions $F(x)=\mathbb{E}_i f_i(x)$ on $\mathbb{R}^d$. We show that modifying the exponential mechanism by adding an $\ell_2^2$ regularizer to $F(x)$ and sampling from $π(x)\propto \exp(-k(F(x)+μ\|x\|_2^2/2))$ recovers both the known optimal empirical risk and population loss under $(ε,δ)$-DP. Furthermore, we show how to implement this mechanism using $\widetilde{O}(n \min(d, n))$ queries to $f_i(x)$ for the DP-SCO where $n$ is the number of samples/users and $d$ is the ambient dimension. We also give a (nearly) matching lower bound $\widetildeΩ(n \min(d, n))$ on the number of evaluation queries.
Our results utilize the following tools that are of independent interest: (1) We prove Gaussian Differential Privacy (GDP) of the exponential mechanism if the loss function is strongly convex and the perturbation is Lipschitz. Our privacy bound is \emph{optimal} as it includes the privacy of Gaussian mechanism as a special case and is proved using the isoperimetric inequality for strongly log-concave measures. (2) We show how to sample from $\exp(-F(x)-μ\|x\|^2_2/2)$ for $G$-Lipschitz $F$ with $η$ error in total variation (TV) distance using $\widetilde{O}((G^2/μ) \log^2(d/η))$ unbiased queries to $F(x)$. This is the first sampler whose query complexity has \emph{polylogarithmic dependence} on both dimension $d$ and accuracy $η$.
△ Less
Submitted 28 July, 2022; v1 submitted 1 March, 2022;
originally announced March 2022.
-
Differentially Private Fine-tuning of Language Models
Authors:
Da Yu,
Saurabh Naik,
Arturs Backurs,
Sivakanth Gopi,
Huseyin A. Inan,
Gautam Kamath,
Janardhan Kulkarni,
Yin Tat Lee,
Andre Manoel,
Lukas Wutschitz,
Sergey Yekhanin,
Huishuai Zhang
Abstract:
We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially…
▽ More
We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially private adaptations of these approaches outperform previous private algorithms in three important dimensions: utility, privacy, and the computational and memory cost of private training. On many commonly studied datasets, the utility of private models approaches that of non-private models. For example, on the MNLI dataset we achieve an accuracy of $87.8\%$ using RoBERTa-Large and $83.5\%$ using RoBERTa-Base with a privacy budget of $ε= 6.7$. In comparison, absent privacy constraints, RoBERTa-Large achieves an accuracy of $90.2\%$. Our findings are similar for natural language generation tasks. Privately fine-tuning with DART, GPT-2-Small, GPT-2-Medium, GPT-2-Large, and GPT-2-XL achieve BLEU scores of 38.5, 42.0, 43.1, and 43.8 respectively (privacy budget of $ε= 6.8,δ=$ 1e-5) whereas the non-private baseline is $48.1$. All our experiments suggest that larger models are better suited for private fine-tuning: while they are well known to achieve superior accuracy non-privately, we find that they also better maintain their accuracy when privacy is introduced.
△ Less
Submitted 14 July, 2022; v1 submitted 13 October, 2021;
originally announced October 2021.
-
Differentially Private n-gram Extraction
Authors:
Kunho Kim,
Sivakanth Gopi,
Janardhan Kulkarni,
Sergey Yekhanin
Abstract:
We revisit the problem of $n$-gram extraction in the differential privacy setting. In this problem, given a corpus of private text data, the goal is to release as many $n$-grams as possible while preserving user level privacy. Extracting $n$-grams is a fundamental subroutine in many NLP applications such as sentence completion, response generation for emails etc. The problem also arises in other a…
▽ More
We revisit the problem of $n$-gram extraction in the differential privacy setting. In this problem, given a corpus of private text data, the goal is to release as many $n$-grams as possible while preserving user level privacy. Extracting $n$-grams is a fundamental subroutine in many NLP applications such as sentence completion, response generation for emails etc. The problem also arises in other applications such as sequence mining, and is a generalization of recently studied differentially private set union (DPSU). In this paper, we develop a new differentially private algorithm for this problem which, in our experiments, significantly outperforms the state-of-the-art. Our improvements stem from combining recent advances in DPSU, privacy accounting, and new heuristics for pruning in the tree-based approach initiated by Chen et al. (2012).
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
Lower Bounds for Maximally Recoverable Tensor Code and Higher Order MDS Codes
Authors:
Joshua Brakensiek,
Sivakanth Gopi,
Visu Makam
Abstract:
An $(m,n,a,b)$-tensor code consists of $m\times n$ matrices whose columns satisfy `$a$' parity checks and rows satisfy `$b$' parity checks (i.e., a tensor code is the tensor product of a column code and row code). Tensor codes are useful in distributed storage because a single erasure can be corrected quickly either by reading its row or column. Maximally Recoverable (MR) Tensor Codes, introduced…
▽ More
An $(m,n,a,b)$-tensor code consists of $m\times n$ matrices whose columns satisfy `$a$' parity checks and rows satisfy `$b$' parity checks (i.e., a tensor code is the tensor product of a column code and row code). Tensor codes are useful in distributed storage because a single erasure can be corrected quickly either by reading its row or column. Maximally Recoverable (MR) Tensor Codes, introduced by Gopalan et al., are tensor codes which can correct every erasure pattern that is information theoretically possible to correct. The main questions about MR Tensor Codes are characterizing which erasure patterns are correctable and obtaining explicit constructions over small fields.
In this paper, we study the important special case when $a=1$, i.e., the columns satisfy a single parity check equation. We introduce the notion of higher order MDS codes (MDS$(\ell)$ codes) which is an interesting generalization of the well-known MDS codes, where $\ell$ captures the order of genericity of points in a low-dimensional space. We then prove that a tensor code with $a=1$ is MR iff the row code is an MDS$(m)$ code. We then show that MDS$(m)$ codes satisfy some weak duality. Using this characterization and duality, we prove that $(m,n,a=1,b)$-MR tensor codes require fields of size $q=Ω_{m,b}(n^{\min\{b,m\}-1})$. Our lower bound also extends to the setting of $a>1$. We also give a deterministic polynomial time algorithm to check if a given erasure pattern is correctable by the MR tensor code (when $a=1$).
△ Less
Submitted 2 December, 2022; v1 submitted 22 July, 2021;
originally announced July 2021.
-
Trellis BMA: Coded Trace Reconstruction on IDS Channels for DNA Storage
Authors:
Sundara Rajan Srinivasavaradhan,
Sivakanth Gopi,
Henry D. Pfister,
Sergey Yekhanin
Abstract:
Sequencing a DNA strand, as part of the read process in DNA storage, produces multiple noisy copies which can be combined to produce better estimates of the original strand; this is called trace reconstruction. One can reduce the error rate further by introducing redundancy in the write sequence and this is called coded trace reconstruction. In this paper, we model the DNA storage channel as an in…
▽ More
Sequencing a DNA strand, as part of the read process in DNA storage, produces multiple noisy copies which can be combined to produce better estimates of the original strand; this is called trace reconstruction. One can reduce the error rate further by introducing redundancy in the write sequence and this is called coded trace reconstruction. In this paper, we model the DNA storage channel as an insertion-deletion-substitution (IDS) channel and design both encoding schemes and low-complexity decoding algorithms for coded trace reconstruction.
We introduce Trellis BMA, a new reconstruction algorithm whose complexity is linear in the number of traces, and compare its performance to previous algorithms. Our results show that it reduces the error rate on both simulated and experimental data. The performance comparisons in this paper are based on a new dataset of traces that will be publicly released with the paper. Our hope is that this dataset will enable research progress by allowing objective comparisons between candidate algorithms.
△ Less
Submitted 20 August, 2024; v1 submitted 13 July, 2021;
originally announced July 2021.
-
Numerical Composition of Differential Privacy
Authors:
Sivakanth Gopi,
Yin Tat Lee,
Lukas Wutschitz
Abstract:
We give a fast algorithm to optimally compose privacy guarantees of differentially private (DP) algorithms to arbitrary accuracy. Our method is based on the notion of privacy loss random variables to quantify the privacy loss of DP algorithms. The running time and memory needed for our algorithm to approximate the privacy curve of a DP algorithm composed with itself $k$ times is…
▽ More
We give a fast algorithm to optimally compose privacy guarantees of differentially private (DP) algorithms to arbitrary accuracy. Our method is based on the notion of privacy loss random variables to quantify the privacy loss of DP algorithms. The running time and memory needed for our algorithm to approximate the privacy curve of a DP algorithm composed with itself $k$ times is $\tilde{O}(\sqrt{k})$. This improves over the best prior method by Koskela et al. (2020) which requires $\tildeΩ(k^{1.5})$ running time. We demonstrate the utility of our algorithm by accurately computing the privacy loss of DP-SGD algorithm of Abadi et al. (2016) and showing that our algorithm speeds up the privacy computations by a few orders of magnitude compared to prior work, while maintaining similar accuracy.
△ Less
Submitted 26 October, 2021; v1 submitted 5 June, 2021;
originally announced June 2021.
-
Fast and Memory Efficient Differentially Private-SGD via JL Projections
Authors:
Zhiqi Bu,
Sivakanth Gopi,
Janardhan Kulkarni,
Yin Tat Lee,
Judy Hanwen Shen,
Uthaipon Tantipongpipat
Abstract:
Differentially Private-SGD (DP-SGD) of Abadi et al. (2016) and its variations are the only known algorithms for private training of large scale neural networks. This algorithm requires computation of per-sample gradients norms which is extremely slow and memory intensive in practice. In this paper, we present a new framework to design differentially private optimizers called DP-SGD-JL and DP-Adam-…
▽ More
Differentially Private-SGD (DP-SGD) of Abadi et al. (2016) and its variations are the only known algorithms for private training of large scale neural networks. This algorithm requires computation of per-sample gradients norms which is extremely slow and memory intensive in practice. In this paper, we present a new framework to design differentially private optimizers called DP-SGD-JL and DP-Adam-JL. Our approach uses Johnson-Lindenstrauss (JL) projections to quickly approximate the per-sample gradient norms without exactly computing them, thus making the training time and memory requirements of our optimizers closer to that of their non-DP versions.
Unlike previous attempts to make DP-SGD faster which work only on a subset of network architectures or use compiler techniques, we propose an algorithmic solution which works for any network in a black-box manner which is the main contribution of this paper. To illustrate this, on IMDb dataset, we train a Recurrent Neural Network (RNN) to achieve good privacy-vs-accuracy tradeoff, while being significantly faster than DP-SGD and with a similar memory footprint as non-private SGD. The privacy analysis of our algorithms is more involved than DP-SGD, we use the recently proposed f-DP framework of Dong et al. (2019) to prove privacy.
△ Less
Submitted 5 February, 2021;
originally announced February 2021.
-
Improved Maximally Recoverable LRCs using Skew Polynomials
Authors:
Sivakanth Gopi,
Venkatesan Guruswami
Abstract:
An $(n,r,h,a,q)$-Local Reconstruction Code (LRC) is a linear code over $\mathbb{F}_q$ of length $n$, whose codeword symbols are partitioned into $n/r$ local groups each of size $r$. Each local group satisfies `$a$' local parity checks to recover from `$a$' erasures in that local group and there are further $h$ global parity checks to provide fault tolerance from more global erasure patterns. Such…
▽ More
An $(n,r,h,a,q)$-Local Reconstruction Code (LRC) is a linear code over $\mathbb{F}_q$ of length $n$, whose codeword symbols are partitioned into $n/r$ local groups each of size $r$. Each local group satisfies `$a$' local parity checks to recover from `$a$' erasures in that local group and there are further $h$ global parity checks to provide fault tolerance from more global erasure patterns. Such an LRC is Maximally Recoverable (MR), if it offers the best blend of locality and global erasure resilience -- namely it can correct all erasure patterns whose recovery is information-theoretically feasible given the locality structure (these are precisely patterns with up to `$a$' erasures in each local group and an additional $h$ erasures anywhere in the codeword).
Random constructions can easily show the existence of MR LRCs over very large fields, but a major algebraic challenge is to construct MR LRCs, or even show their existence, over smaller fields, as well as understand inherent lower bounds on their field size. We give an explicit construction of $(n,r,h,a,q)$-MR LRCs with field size $q$ bounded by $\left(O\left(\max\{r,n/r\}\right)\right)^{\min\{h,r-a\}}$. This improves upon known constructions in many relevant parameter ranges. Moreover, it matches the lower bound from Gopi et al. (2020) in an interesting range of parameters where $r=Θ(\sqrt{n})$, $r-a=Θ(\sqrt{n})$ and $h$ is a fixed constant with $h\le a+2$, achieving the optimal field size of $Θ_{h}(n^{h/2}).$
Our construction is based on the theory of skew polynomials. We believe skew polynomials should have further applications in coding and complexity theory; as a small illustration we show how to capture algebraic results underlying list decoding folded Reed-Solomon and multiplicity codes in a unified way within this theory.
△ Less
Submitted 18 May, 2022; v1 submitted 14 December, 2020;
originally announced December 2020.
-
Differentially Private Set Union
Authors:
Sivakanth Gopi,
Pankaj Gulhane,
Janardhan Kulkarni,
Judy Hanwen Shen,
Milad Shokouhi,
Sergey Yekhanin
Abstract:
We study the basic operation of set union in the global model of differential privacy. In this problem, we are given a universe $U$ of items, possibly of infinite size, and a database $D$ of users. Each user $i$ contributes a subset $W_i \subseteq U$ of items. We want an ($ε$,$δ$)-differentially private algorithm which outputs a subset $S \subset \cup_i W_i$ such that the size of $S$ is as large a…
▽ More
We study the basic operation of set union in the global model of differential privacy. In this problem, we are given a universe $U$ of items, possibly of infinite size, and a database $D$ of users. Each user $i$ contributes a subset $W_i \subseteq U$ of items. We want an ($ε$,$δ$)-differentially private algorithm which outputs a subset $S \subset \cup_i W_i$ such that the size of $S$ is as large as possible. The problem arises in countless real world applications; it is particularly ubiquitous in natural language processing (NLP) applications as vocabulary extraction. For example, discovering words, sentences, $n$-grams etc., from private text data belonging to users is an instance of the set union problem.
Known algorithms for this problem proceed by collecting a subset of items from each user, taking the union of such subsets, and disclosing the items whose noisy counts fall above a certain threshold. Crucially, in the above process, the contribution of each individual user is always independent of the items held by other users, resulting in a wasteful aggregation process, where some item counts happen to be way above the threshold. We deviate from the above paradigm by allowing users to contribute their items in a $\textit{dependent fashion}$, guided by a $\textit{policy}$. In this new setting ensuring privacy is significantly delicate. We prove that any policy which has certain $\textit{contractive}$ properties would result in a differentially private algorithm. We design two new algorithms, one using Laplace noise and other Gaussian noise, as specific instances of policies satisfying the contractive properties. Our experiments show that the new algorithms significantly outperform previously known mechanisms for the problem.
△ Less
Submitted 6 April, 2022; v1 submitted 22 February, 2020;
originally announced February 2020.
-
Locally Private Hypothesis Selection
Authors:
Sivakanth Gopi,
Gautam Kamath,
Janardhan Kulkarni,
Aleksandar Nikolov,
Zhiwei Steven Wu,
Huanyu Zhang
Abstract:
We initiate the study of hypothesis selection under local differential privacy. Given samples from an unknown probability distribution $p$ and a set of $k$ probability distributions $\mathcal{Q}$, we aim to output, under the constraints of $\varepsilon$-local differential privacy, a distribution from $\mathcal{Q}$ whose total variation distance to $p$ is comparable to the best such distribution. T…
▽ More
We initiate the study of hypothesis selection under local differential privacy. Given samples from an unknown probability distribution $p$ and a set of $k$ probability distributions $\mathcal{Q}$, we aim to output, under the constraints of $\varepsilon$-local differential privacy, a distribution from $\mathcal{Q}$ whose total variation distance to $p$ is comparable to the best such distribution. This is a generalization of the classic problem of $k$-wise simple hypothesis testing, which corresponds to when $p \in \mathcal{Q}$, and we wish to identify $p$. Absent privacy constraints, this problem requires $O(\log k)$ samples from $p$, and it was recently shown that the same complexity is achievable under (central) differential privacy. However, the naive approach to this problem under local differential privacy would require $\tilde O(k^2)$ samples.
We first show that the constraint of local differential privacy incurs an exponential increase in cost: any algorithm for this problem requires at least $Ω(k)$ samples. Second, for the special case of $k$-wise simple hypothesis testing, we provide a non-interactive algorithm which nearly matches this bound, requiring $\tilde O(k)$ samples. Finally, we provide sequentially interactive algorithms for the general case, requiring $\tilde O(k)$ samples and only $O(\log \log k)$ rounds of interactivity. Our algorithms are achieved through a reduction to maximum selection with adversarial comparators, a problem of independent interest for which we initiate study in the parallel setting. For this problem, we provide a family of algorithms for each number of allowed rounds of interaction $t$, as well as lower bounds showing that they are near-optimal for every $t$. Notably, our algorithms result in exponential improvements on the round complexity of previous methods.
△ Less
Submitted 19 June, 2020; v1 submitted 21 February, 2020;
originally announced February 2020.
-
CSPs with Global Modular Constraints: Algorithms and Hardness via Polynomial Representations
Authors:
Joshua Brakensiek,
Sivakanth Gopi,
Venkatesan Guruswami
Abstract:
We study the complexity of Boolean constraint satisfaction problems (CSPs) when the assignment must have Hamming weight in some congruence class modulo M, for various choices of the modulus M. Due to the known classification of tractable Boolean CSPs, this mainly reduces to the study of three cases: 2-SAT, HORN-SAT, and LIN-2 (linear equations mod 2). We classify the moduli M for which these respe…
▽ More
We study the complexity of Boolean constraint satisfaction problems (CSPs) when the assignment must have Hamming weight in some congruence class modulo M, for various choices of the modulus M. Due to the known classification of tractable Boolean CSPs, this mainly reduces to the study of three cases: 2-SAT, HORN-SAT, and LIN-2 (linear equations mod 2). We classify the moduli M for which these respective problems are polynomial time solvable, and when they are not (assuming the ETH). Our study reveals that this modular constraint lends a surprising richness to these classic, well-studied problems, with interesting broader connections to complexity theory and coding theory. The HORN-SAT case is connected to the covering complexity of polynomials representing the NAND function mod M. The LIN-2 case is tied to the sparsity of polynomials representing the OR function mod M, which in turn has connections to modular weight distribution properties of linear codes and locally decodable codes. In both cases, the analysis of our algorithm as well as the hardness reduction rely on these polynomial representations, highlighting an interesting algebraic common ground between hard cases for our algorithms and the gadgets which show hardness. These new complexity measures of polynomial representations merit further study.
The inspiration for our study comes from a recent work by Nägele, Sudakov, and Zenklusen on submodular minimization with a global congruence constraint. Our algorithm for HORN-SAT has strong similarities to their algorithm, and in particular identical kind of set systems arise in both cases. Our connection to polynomial representations leads to a simpler analysis of such set systems, and also sheds light on (but does not resolve) the complexity of submodular minimization with a congruency requirement modulo a composite M.
△ Less
Submitted 12 February, 2019;
originally announced February 2019.
-
Spanoids - an abstraction of spanning structures, and a barrier for LCCs
Authors:
Zeev Dvir,
Sivakanth Gopi,
Yuzhou Gu,
Avi Wigderson
Abstract:
We introduce a simple logical inference structure we call a $\textsf{spanoid}$ (generalizing the notion of a matroid), which captures well-studied problems in several areas. These include combinatorial geometry, algebra (arrangements of hypersurfaces and ideals), statistical physics (bootstrap percolation) and coding theory. We initiate a thorough investigation of spanoids, from computational and…
▽ More
We introduce a simple logical inference structure we call a $\textsf{spanoid}$ (generalizing the notion of a matroid), which captures well-studied problems in several areas. These include combinatorial geometry, algebra (arrangements of hypersurfaces and ideals), statistical physics (bootstrap percolation) and coding theory. We initiate a thorough investigation of spanoids, from computational and structural viewpoints, focusing on parameters relevant to the applications areas above and, in particular, to questions regarding Locally Correctable Codes (LCCs).
One central parameter we study is the $\textsf{rank}$ of a spanoid, extending the rank of a matroid and related to the dimension of codes. This leads to one main application of our work, establishing the first known barrier to improving the nearly 20-year old bound of Katz-Trevisan (KT) on the dimension of LCCs. On the one hand, we prove that the KT bound (and its more recent refinements) holds for the much more general setting of spanoid rank. On the other hand we show that there exist (random) spanoids whose rank matches these bounds. Thus, to significantly improve the known bounds one must step out of the spanoid framework.
Another parameter we explore is the $\textsf{functional rank}$ of a spanoid, which captures the possibility of turning a given spanoid into an actual code. The question of the relationship between rank and functional rank is one of the main questions we raise as it may reveal new avenues for constructing new LCCs (perhaps even matching the KT bound). As a first step, we develop an entropy relaxation of functional rank to create a small constant gap and amplify it by tensoring to construct a spanoid whose functional rank is smaller than rank by a polynomial factor. This is evidence that the entropy method we develop can prove polynomially better bounds than KT-type methods on the dimension of LCCs.
△ Less
Submitted 20 November, 2018; v1 submitted 27 September, 2018;
originally announced September 2018.
-
Gaussian width bounds with applications to arithmetic progressions in random settings
Authors:
Jop Briët,
Sivakanth Gopi
Abstract:
Motivated by problems on random differences in Szemerédi's theorem and on large deviations for arithmetic progressions in random sets, we prove upper bounds on the Gaussian width of point sets that are formed by the image of the $n$-dimensional Boolean hypercube under a mapping $ψ:\mathbb{R}^n\to\mathbb{R}^k$, where each coordinate is a constant-degree multilinear polynomial with 0-1 coefficients.…
▽ More
Motivated by problems on random differences in Szemerédi's theorem and on large deviations for arithmetic progressions in random sets, we prove upper bounds on the Gaussian width of point sets that are formed by the image of the $n$-dimensional Boolean hypercube under a mapping $ψ:\mathbb{R}^n\to\mathbb{R}^k$, where each coordinate is a constant-degree multilinear polynomial with 0-1 coefficients. We show the following applications of our bounds. Let $[\mathbb{Z}/N\mathbb{Z}]_p$ be the random subset of $\mathbb{Z}/N\mathbb{Z}$ containing each element independently with probability $p$.
$\bullet$ A set $D\subseteq \mathbb{Z}/N\mathbb{Z}$ is $\ell$-intersective if any dense subset of $\mathbb{Z}/N\mathbb{Z}$ contains a proper $(\ell+1)$-term arithmetic progression with common difference in $D$. Our main result implies that $[\mathbb{Z}/N\mathbb{Z}]_p$ is $\ell$-intersective with probability $1 - o(1)$ provided $p \geq ω(N^{-β_\ell}\log N)$ for $β_\ell = (\lceil(\ell+1)/2\rceil)^{-1}$. This gives a polynomial improvement for all $\ell \ge 3$ of a previous bound due to Frantzikinakis, Lesigne and Wierdl, and reproves more directly the same improvement shown recently by the authors and Dvir.
$\bullet$ Let $X_k$ be the number of $k$-term arithmetic progressions in $[\mathbb{Z}/N\mathbb{Z}]_p$ and consider the large deviation rate $ρ_k(δ) = \log\Pr[X_k \geq (1+δ)\mathbb{E}X_k]$. We give quadratic improvements of the best-known range of $p$ for which a highly precise estimate of $ρ_k(δ)$ due to Bhattacharya, Ganguly, Shao and Zhao is valid for all odd $k \geq 5$.
We also discuss connections with error correcting codes (locally decodable codes) and the Banach-space notion of type for injective tensor products of $\ell_p$-spaces.
△ Less
Submitted 18 October, 2018; v1 submitted 15 November, 2017;
originally announced November 2017.
-
Maximally Recoverable LRCs: A field size lower bound and constructions for few heavy parities
Authors:
Sivakanth Gopi,
Venkatesan Guruswami,
Sergey Yekhanin
Abstract:
The explosion in the volumes of data being stored online has resulted in distributed storage systems transitioning to erasure coding based schemes. Local Reconstruction Codes (LRCs) have emerged as the codes of choice for these applications. These codes can correct a small number of erasures by accessing only a small number of remaining coordinates. An $(n,r,h,a,q)$-LRC is a linear code over…
▽ More
The explosion in the volumes of data being stored online has resulted in distributed storage systems transitioning to erasure coding based schemes. Local Reconstruction Codes (LRCs) have emerged as the codes of choice for these applications. These codes can correct a small number of erasures by accessing only a small number of remaining coordinates. An $(n,r,h,a,q)$-LRC is a linear code over $\mathbb{F}_q$ of length $n$, whose codeword symbols are partitioned into $g=n/r$ local groups each of size $r$. It has $h$ global parity checks and each local group has $a$ local parity checks. Such an LRC is Maximally Recoverable (MR), if it corrects all erasure patterns which are information-theoretically correctable under the stipulated structure of local and global parity checks.
We show the first non-trivial lower bounds on the field size required for MR LRCs. When $a,h$ are constant and the number of local groups $g \ge h$, while $r$ may grow with $n$, our lower bound simplifies to $q\ge Ω_{a,h}\left(n\cdot r^{\min\{a,h-2\}}\right).$ No superlinear (in $n$) lower bounds were known prior to this work for any setting of parameters.
MR LRCs deployed in practice have a small number of global parities, typically $h=2,3$. We complement our lower bounds by giving constructions with small field size for $h\le 3$. For $h=2$, we give a linear field size construction. We also show a surprising application of elliptic curves and arithmetic progression free sets in the construction of MR LRCs.
△ Less
Submitted 15 November, 2018; v1 submitted 27 October, 2017;
originally announced October 2017.
-
Lower bounds for 2-query LCCs over large alphabet
Authors:
Arnab Bhattacharyya,
Sivakanth Gopi,
Avishay Tal
Abstract:
A locally correctable code (LCC) is an error correcting code that allows correction of any arbitrary coordinate of a corrupted codeword by querying only a few coordinates. We show that any {\em zero-error} $2$-query locally correctable code $\mathcal{C}: \{0,1\}^k \to Σ^n$ that can correct a constant fraction of corrupted symbols must have $n \geq \exp(k/\log|Σ|)$. We say that an LCC is zero-error…
▽ More
A locally correctable code (LCC) is an error correcting code that allows correction of any arbitrary coordinate of a corrupted codeword by querying only a few coordinates. We show that any {\em zero-error} $2$-query locally correctable code $\mathcal{C}: \{0,1\}^k \to Σ^n$ that can correct a constant fraction of corrupted symbols must have $n \geq \exp(k/\log|Σ|)$. We say that an LCC is zero-error if there exists a non-adaptive corrector algorithm that succeeds with probability $1$ when the input is an uncorrupted codeword. All known constructions of LCCs are zero-error.
Our result is tight upto constant factors in the exponent. The only previous lower bound on the length of 2-query LCCs over large alphabet was $Ω\left((k/\log|Σ|)^2\right)$ due to Katz and Trevisan (STOC 2000). Our bound implies that zero-error LCCs cannot yield $2$-server private information retrieval (PIR) schemes with sub-polynomial communication. Since there exists a $2$-server PIR scheme with sub-polynomial communication (STOC 2015) based on a zero-error $2$-query locally decodable code (LDC), we also obtain a separation between LDCs and LCCs over large alphabet.
For our proof of the result, we need a new decomposition lemma for directed graphs that may be of independent interest. Given a dense directed graph $G$, our decomposition uses the directed version of Szemerédi regularity lemma due to Alon and Shapira (STOC 2003) to partition almost all of $G$ into a constant number of subgraphs which are either edge-expanding or empty.
△ Less
Submitted 28 April, 2017; v1 submitted 21 November, 2016;
originally announced November 2016.
-
Outlaw distributions and locally decodable codes
Authors:
Jop Briët,
Zeev Dvir,
Sivakanth Gopi
Abstract:
Locally decodable codes (LDCs) are error correcting codes that allow for decoding of a single message bit using a small number of queries to a corrupted encoding. Despite decades of study, the optimal trade-off between query complexity and codeword length is far from understood. In this work, we give a new characterization of LDCs using distributions over Boolean functions whose expectation is har…
▽ More
Locally decodable codes (LDCs) are error correcting codes that allow for decoding of a single message bit using a small number of queries to a corrupted encoding. Despite decades of study, the optimal trade-off between query complexity and codeword length is far from understood. In this work, we give a new characterization of LDCs using distributions over Boolean functions whose expectation is hard to approximate (in~$L_\infty$~norm) with a small number of samples. We coin the term `outlaw distributions' for such distributions since they `defy' the Law of Large Numbers. We show that the existence of outlaw distributions over sufficiently `smooth' functions implies the existence of constant query LDCs and vice versa. We give several candidates for outlaw distributions over smooth functions coming from finite field incidence geometry, additive combinatorics and from hypergraph (non)expanders.
We also prove a useful lemma showing that (smooth) LDCs which are only required to work on average over a random message and a random message index can be turned into true LDCs at the cost of only constant factors in the parameters.
△ Less
Submitted 26 June, 2017; v1 submitted 20 September, 2016;
originally announced September 2016.
-
Competitive analysis of the top-K ranking problem
Authors:
Xi Chen,
Sivakanth Gopi,
Jieming Mao,
Jon Schneider
Abstract:
Motivated by applications in recommender systems, web search, social choice and crowdsourcing, we consider the problem of identifying the set of top $K$ items from noisy pairwise comparisons. In our setting, we are non-actively given $r$ pairwise comparisons between each pair of $n$ items, where each comparison has noise constrained by a very general noise model called the strong stochastic transi…
▽ More
Motivated by applications in recommender systems, web search, social choice and crowdsourcing, we consider the problem of identifying the set of top $K$ items from noisy pairwise comparisons. In our setting, we are non-actively given $r$ pairwise comparisons between each pair of $n$ items, where each comparison has noise constrained by a very general noise model called the strong stochastic transitivity (SST) model. We analyze the competitive ratio of algorithms for the top-$K$ problem. In particular, we present a linear time algorithm for the top-$K$ problem which has a competitive ratio of $\tilde{O}(\sqrt{n})$; i.e. to solve any instance of top-$K$, our algorithm needs at most $\tilde{O}(\sqrt{n})$ times as many samples needed as the best possible algorithm for that instance (in contrast, all previous known algorithms for the top-$K$ problem have competitive ratios of $\tildeΩ(n)$ or worse). We further show that this is tight: any algorithm for the top-$K$ problem has competitive ratio at least $\tildeΩ(\sqrt{n})$.
△ Less
Submitted 12 May, 2016;
originally announced May 2016.
-
Lower bounds for constant query affine-invariant LCCs and LTCs
Authors:
Arnab Bhattacharyya,
Sivakanth Gopi
Abstract:
Affine-invariant codes are codes whose coordinates form a vector space over a finite field and which are invariant under affine transformations of the coordinate space. They form a natural, well-studied class of codes; they include popular codes such as Reed-Muller and Reed-Solomon. A particularly appealing feature of affine-invariant codes is that they seem well-suited to admit local correctors a…
▽ More
Affine-invariant codes are codes whose coordinates form a vector space over a finite field and which are invariant under affine transformations of the coordinate space. They form a natural, well-studied class of codes; they include popular codes such as Reed-Muller and Reed-Solomon. A particularly appealing feature of affine-invariant codes is that they seem well-suited to admit local correctors and testers.
In this work, we give lower bounds on the length of locally correctable and locally testable affine-invariant codes with constant query complexity. We show that if a code $\mathcal{C} \subset Σ^{\mathbb{K}^n}$ is an $r$-query locally correctable code (LCC), where $\mathbb{K}$ is a finite field and $Σ$ is a finite alphabet, then the number of codewords in $\mathcal{C}$ is at most $\exp(O_{\mathbb{K}, r, |Σ|}(n^{r-1}))$. Also, we show that if $\mathcal{C} \subset Σ^{\mathbb{K}^n}$ is an $r$-query locally testable code (LTC), then the number of codewords in $\mathcal{C}$ is at most $\exp(O_{\mathbb{K}, r, |Σ|}(n^{r-2}))$. The dependence on $n$ in these bounds is tight for constant-query LCCs/LTCs, since Guo, Kopparty and Sudan (ITCS `13) construct affine-invariant codes via lifting that have the same asymptotic tradeoffs. Note that our result holds for non-linear codes, whereas previously, Ben-Sasson and Sudan (RANDOM `11) assumed linearity to derive similar results.
Our analysis uses higher-order Fourier analysis. In particular, we show that the codewords corresponding to an affine-invariant LCC/LTC must be far from each other with respect to Gowers norm of an appropriate order. This then allows us to bound the number of codewords, using known decomposition theorems which approximate any bounded function in terms of a finite number of low-degree non-classical polynomials, upto a small error in the Gowers norm.
△ Less
Submitted 23 November, 2015;
originally announced November 2015.
-
On the number of rich lines in truly high dimensional sets
Authors:
Zeev Dvir,
Sivakanth Gopi
Abstract:
We prove a new upper bound on the number of $r$-rich lines (lines with at least $r$ points) in a `truly' $d$-dimensional configuration of points $v_1,\ldots,v_n \in \mathbb{C}^d$. More formally, we show that, if the number of $r$-rich lines is significantly larger than $n^2/r^d$ then there must exist a large subset of the points contained in a hyperplane. We conjecture that the factor $r^d$ can be…
▽ More
We prove a new upper bound on the number of $r$-rich lines (lines with at least $r$ points) in a `truly' $d$-dimensional configuration of points $v_1,\ldots,v_n \in \mathbb{C}^d$. More formally, we show that, if the number of $r$-rich lines is significantly larger than $n^2/r^d$ then there must exist a large subset of the points contained in a hyperplane. We conjecture that the factor $r^d$ can be replaced with a tight $r^{d+1}$. If true, this would generalize the classic Szemerédi-Trotter theorem which gives a bound of $n^2/r^3$ on the number of $r$-rich lines in a planar configuration. This conjecture was shown to hold in $\mathbb{R}^3$ in the seminal work of Guth and Katz \cite{GK10} and was also recently proved over $\mathbb{R}^4$ (under some additional restrictions) \cite{SS14}. For the special case of arithmetic progressions ($r$ collinear points that are evenly distanced) we give a bound that is tight up to low order terms, showing that a $d$-dimensional grid achieves the largest number of $r$-term progressions.
The main ingredient in the proof is a new method to find a low degree polynomial that vanishes on many of the rich lines. Unlike previous applications of the polynomial method, we do not find this polynomial by interpolation. The starting observation is that the degree $r-2$ Veronese embedding takes $r$-collinear points to $r$ linearly dependent images. Hence, each collinear $r$-tuple of points, gives us a dependent $r$-tuple of images. We then use the design-matrix method of \cite{BDWY12} to convert these 'local' linear dependencies into a global one, showing that all the images lie in a hyperplane. This then translates into a low degree polynomial vanishing on the original set.
△ Less
Submitted 2 December, 2014;
originally announced December 2014.
-
2-Server PIR with sub-polynomial communication
Authors:
Zeev Dvir,
Sivakanth Gopi
Abstract:
A 2-server Private Information Retrieval (PIR) scheme allows a user to retrieve the $i$th bit of an $n$-bit database replicated among two servers (which do not communicate) while not revealing any information about $i$ to either server. In this work we construct a 1-round 2-server PIR with total communication cost $n^{O({\sqrt{\log\log n/\log n}})}$. This improves over the currently known 2-server…
▽ More
A 2-server Private Information Retrieval (PIR) scheme allows a user to retrieve the $i$th bit of an $n$-bit database replicated among two servers (which do not communicate) while not revealing any information about $i$ to either server. In this work we construct a 1-round 2-server PIR with total communication cost $n^{O({\sqrt{\log\log n/\log n}})}$. This improves over the currently known 2-server protocols which require $O(n^{1/3})$ communication and matches the communication cost of known 3-server PIR schemes. Our improvement comes from reducing the number of servers in existing protocols, based on Matching Vector Codes, from 3 or 4 servers to 2. This is achieved by viewing these protocols in an algebraic way (using polynomial interpolation) and extending them using partial derivatives.
△ Less
Submitted 24 July, 2014;
originally announced July 2014.