Search | arXiv e-print repository

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Authors: Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai , et al. (104 additional authors not shown)

Abstract: We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version… ▽ More We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts. △ Less

Submitted 30 August, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: 24 pages

arXiv:2404.01833 [pdf, other]

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

Authors: Mark Russinovich, Ahmed Salem, Ronen Eldan

Abstract: Large Language Models (LLMs) have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as "jailbreaks", seek to overcome this alignment. Intuitively, jailbreak attacks aim to… ▽ More Large Language Models (LLMs) have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as "jailbreaks", seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies, progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we introduce Crescendomation, a tool that automates the Crescendo attack, and our evaluation showcases its effectiveness against state-of-the-art models. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2312.09241 [pdf, other]

TinyGSM: achieving >80% on GSM8k with small language models

Authors: Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni, Yuanzhi Li, Anh Nguyen, Rachel Ward, Yi Zhang

Abstract: Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acqui… ▽ More Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning. We introduce \texttt{TinyGSM}, a synthetic dataset of 12.3M grade school math problems paired with Python solutions, generated fully by GPT-3.5. After finetuning on \texttt{TinyGSM}, we find that a duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5\% accuracy, outperforming existing models that are orders of magnitude larger. This also rivals the performance of the GPT-3.5 ``teacher'' model (77.4\%), from which our model's training data is generated. Our approach is simple and has two key components: 1) the high-quality dataset \texttt{TinyGSM}, 2) the use of a verifier, which selects the final outputs from multiple candidate generations. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2311.14737 [pdf, other]

Positional Description Matters for Transformers Arithmetic

Authors: Ruoqi Shen, Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li, Yi Zhang

Abstract: Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities --which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herei… ▽ More Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities --which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herein, we delve deeper into the role of positional encoding, and propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently. We investigate the value of these modifications for three tasks: (i) classical multiplication, (ii) length extrapolation in addition, and (iii) addition in natural language context. For (i) we train a small model on a small dataset (100M parameters and 300k samples) with remarkable aptitude in (direct, no scratchpad) 15 digits multiplication and essentially perfect up to 12 digits, while usual training in this context would give a model failing at 4 digits multiplication. In the experiments on addition, we use a mere 120k samples to demonstrate: for (ii) extrapolation from 10 digits to testing on 12 digits numbers while usual training would have no extrapolation, and for (iii) almost perfect accuracy up to 5 digits while usual training would be correct only up to 3 digits (which is essentially memorization with a training set of 120k samples). △ Less

Submitted 21 November, 2023; originally announced November 2023.

Comments: 18 pages

arXiv:2311.06171 [pdf, ps, other]

Fast relaxation of the random field Ising dynamics

Authors: Ahmed El Alaoui, Ronen Eldan, Reza Gheissari, Arianna Piana

Abstract: We study the convergence properties of Glauber dynamics for the random field Ising model (RFIM) with ferromagnetic interactions on finite domains of $\mathbb{Z}^d$, $d \ge 2$. Of particular interest is the Griffiths phase where correlations decay exponentially fast in expectation over the quenched disorder, but there exist arbitrarily large islands of weak fields where low-temperature behavior is… ▽ More We study the convergence properties of Glauber dynamics for the random field Ising model (RFIM) with ferromagnetic interactions on finite domains of $\mathbb{Z}^d$, $d \ge 2$. Of particular interest is the Griffiths phase where correlations decay exponentially fast in expectation over the quenched disorder, but there exist arbitrarily large islands of weak fields where low-temperature behavior is observed. Our results are twofold: 1. Under weak spatial mixing (boundary-to-bulk exponential decay of correlations) in expectation, we show that the dynamics satisfy a weak Poincaré inequality -- equivalent to large-set expansion -- implying algebraic relaxation to equilibrium over timescales polynomial in the volume $N$ of the domain, and polynomial time mixing from a warm start. From this we construct a polynomial-time approximate sampling algorithm based on running Glauber dynamics over an increasing sequence of approximations of the domain. 2. Under strong spatial mixing (exponential decay of correlations even near boundary pinnings) in expectation, we prove a full Poincaré inequality, implying exponential relaxation to equilibrium and $N^{o(1)}$-mixing time. Note by way of example, both weak and strong spatial mixing hold at any temperature, provided the external fields are strong enough. Our proofs combine a stochastic localization technique which has the effect of increasing the variance of the field, with a field-dependent coarse graining which controls the resulting sub-critical percolation process of sites with weak fields. △ Less

Submitted 10 November, 2023; originally announced November 2023.

Comments: 37 pages

arXiv:2310.02238 [pdf, ps, other]

Who's Harry Potter? Approximate Unlearning in LLMs

Authors: Ronen Eldan, Mark Russinovich

Abstract: Large language models (LLMs) are trained on massive internet corpora that often contain copyrighted content. This poses legal and ethical challenges for the developers and users of these models, as well as the original authors and publishers. In this paper, we propose a novel technique for unlearning a subset of the training data from a LLM, without having to retrain it from scratch. We evaluate… ▽ More Large language models (LLMs) are trained on massive internet corpora that often contain copyrighted content. This poses legal and ethical challenges for the developers and users of these models, as well as the original authors and publishers. In this paper, we propose a novel technique for unlearning a subset of the training data from a LLM, without having to retrain it from scratch. We evaluate our technique on the task of unlearning the Harry Potter books from the Llama2-7b model (a generative language model recently open-sourced by Meta). While the model took over 184K GPU-hours to pretrain, we show that in about 1 GPU hour of finetuning, we effectively erase the model's ability to generate or recall Harry Potter-related content, while its performance on common benchmarks (such as Winogrande, Hellaswag, arc, boolq and piqa) remains almost unaffected. We make our fine-tuned model publicly available on HuggingFace for community evaluation. To the best of our knowledge, this is the first paper to present an effective technique for unlearning in generative language models. Our technique consists of three main components: First, we use a reinforced model that is further trained on the target data to identify the tokens that are most related to the unlearning target, by comparing its logits with those of a baseline model. Second, we replace idiosyncratic expressions in the target data with generic counterparts, and leverage the model's own predictions to generate alternative labels for every token. These labels aim to approximate the next-token predictions of a model that has not been trained on the target data. Third, we finetune the model on these alternative labels, which effectively erases the original text from the model's memory whenever it is prompted with its context. △ Less

Submitted 4 October, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

arXiv:2309.05463 [pdf, other]

Textbooks Are All You Need II: phi-1.5 technical report

Authors: Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee

Abstract: We continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs)… ▽ More We continue the investigation into the power of smaller Transformer-based language models as initiated by \textbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on \textbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs) to generate ``textbook quality" data as a way to enhance the learning process compared to traditional web data. We follow the ``Textbooks Are All You Need" approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named \textbf{phi-1.5}, with performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding. More generally, \textbf{phi-1.5} exhibits many of the traits of much larger LLMs, both good -- such as the ability to ``think step by step" or perform some rudimentary in-context learning -- and bad, including hallucinations and the potential for toxic and biased generations -- encouragingly though, we are seeing improvement on that front thanks to the absence of web data. We open-source \textbf{phi-1.5} to promote further research on these urgent topics. △ Less

Submitted 11 September, 2023; originally announced September 2023.

arXiv:2306.11644 [pdf, other]

Textbooks Are All You Need

Authors: Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li

Abstract: We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accu… ▽ More We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval. △ Less

Submitted 2 October, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

Comments: 26 pages; changed color scheme of plot. fixed minor typos and added couple clarifications

arXiv:2305.07759 [pdf, other]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Authors: Ronen Eldan, Yuanzhi Li

Abstract: Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the abili… ▽ More Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention). In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities. We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency. We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs. △ Less

Submitted 24 May, 2023; v1 submitted 12 May, 2023; originally announced May 2023.

arXiv:2303.12712 [pdf, other]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Authors: Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang

Abstract: Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an earl… ▽ More Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions. △ Less

Submitted 13 April, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

arXiv:2212.00297 [pdf, other]

Hit-and-run mixing via localization schemes

Authors: Yuansi Chen, Ronen Eldan

Abstract: We analyze the hit-and-run algorithm for sampling uniformly from an isotropic convex body $K$ in $n$ dimensions. We show that the algorithm mixes in time $\tilde{O}(n^2/ ψ_n^2)$, where $ψ_n$ is the smallest isoperimetric constant for any isotropic logconcave distribution, also known as the Kannan-Lovasz-Simonovits (KLS) constant. Our bound improves upon previous bounds of the form… ▽ More We analyze the hit-and-run algorithm for sampling uniformly from an isotropic convex body $K$ in $n$ dimensions. We show that the algorithm mixes in time $\tilde{O}(n^2/ ψ_n^2)$, where $ψ_n$ is the smallest isoperimetric constant for any isotropic logconcave distribution, also known as the Kannan-Lovasz-Simonovits (KLS) constant. Our bound improves upon previous bounds of the form $\tilde{O}(n^2 R^2/r^2)$, which depend on the ratio $R/r$ of the radii of the circumscribed and inscribed balls of $K$, gaining a factor of $n$ in the case of isotropic convex bodies. Consequently, our result gives a mixing time estimate for the hit-and-run which matches the state-of-the-art bounds for the ball walk. Our main proof technique is based on an annealing of localization schemes introduced in Chen and Eldan (2022), which allows us to reduce the problem to the analysis of the mixing time on truncated Gaussian distributions. △ Less

Submitted 1 December, 2022; originally announced December 2022.

Comments: 37 pages, 2 figures

arXiv:2208.06508 [pdf, ps, other]

Noise stability on the Boolean hypercube via a renormalized Brownian motion

Authors: Ronen Eldan, Dan Mikulincer, Prasad Raghavendra

Abstract: We consider a variant of the classical notion of noise on the Boolean hypercube which gives rise to a new approach to inequalities regarding noise stability. We use this approach to give a new proof of the Majority is Stablest theorem by Mossel, O'Donnell, and Oleszkiewicz, improving the dependence of the bound on the maximal influence of the function from logarithmic to polynomial. We also show t… ▽ More We consider a variant of the classical notion of noise on the Boolean hypercube which gives rise to a new approach to inequalities regarding noise stability. We use this approach to give a new proof of the Majority is Stablest theorem by Mossel, O'Donnell, and Oleszkiewicz, improving the dependence of the bound on the maximal influence of the function from logarithmic to polynomial. We also show that a variant of the conjecture by Courtade and Kumar regarding the most informative Boolean function, where the classical noise is replaced by our notion, holds true. Our approach is based on a stochastic construction that we call the renormalized Brownian motion, which facilitates the use of inequalities in Gaussian space in the analysis of Boolean functions. △ Less

Submitted 12 August, 2022; originally announced August 2022.

Comments: 21 pages

arXiv:2208.03450 [pdf, other]

An Optimal "It Ain't Over Till It's Over" Theorem

Authors: Ronen Eldan, Avi Wigderson, Pei Wu

Abstract: We study the probability of Boolean functions with small max influence to become constant under random restrictions. Let $f$ be a Boolean function such that the variance of $f$ is $Ω(1)$ and all its individual influences are bounded by $τ$. We show that when restricting all but a $ρ=\tildeΩ((\log(1/τ))^{-1})$ fraction of the coordinates, the restricted function remains nonconstant with overwhelmin… ▽ More We study the probability of Boolean functions with small max influence to become constant under random restrictions. Let $f$ be a Boolean function such that the variance of $f$ is $Ω(1)$ and all its individual influences are bounded by $τ$. We show that when restricting all but a $ρ=\tildeΩ((\log(1/τ))^{-1})$ fraction of the coordinates, the restricted function remains nonconstant with overwhelming probability. This bound is essentially optimal, as witnessed by the tribes function $\mathrm{TRIBES}=\mathrm{AND}_{n/C\log n}\circ\mathrm{OR}_{C\log n}$. We extend it to an anti-concentration result, showing that the restricted function has nontrivial variance with probability $1-o(1)$. This gives a sharp version of the "it ain't over till it's over" theorem due to Mossel, O'Donnell, and Oleszkiewicz. Our proof is discrete, and avoids the use of the invariance principle. We also show two consequences of our above result: (i) As a corollary, we prove that for a uniformly random input $x$, the block sensitivity of $f$ at $x$ is $\tildeΩ(\log(1/τ))$ with probability $1-o(1)$. This should be compared with the implication of Kahn, Kalai, and Linial's result, which implies that the average block sensitivity of $f$ is $Ω(\log(1/τ))$. (ii) Combining our proof with a well-known result due to O'Donnell, Saks, Schramm, and Servedio, one can also conclude that: Restricting all but a $ρ=\tildeΩ(1/\sqrt{\log (1/τ) })$ fraction of the coordinates of a monotone function $f$, then the restricted function has decision tree complexity $Ω(τ^{-Θ(ρ)})$ with probability $Ω(1)$. △ Less

Submitted 6 August, 2022; originally announced August 2022.

Comments: 31 pages

arXiv:2206.04301 [pdf, other]

Unveiling Transformers with LEGO: a synthetic reasoning task

Authors: Yi Zhang, Arturs Backurs, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Tal Wagner

Abstract: We propose a synthetic reasoning task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the Transformer architectures learn this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well… ▽ More We propose a synthetic reasoning task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the Transformer architectures learn this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we manage to understand some of the attention heads as well as how the information flows in the network. In particular, we have identified a novel \emph{association} pattern that globally attends only to identical tokens. Based on these observations we propose a hypothesis that here pretraining helps for LEGO tasks due to certain structured attention patterns, and we experimentally verify this hypothesis. We also observe that in some data regime the trained transformer finds ``shortcut" solutions to follow the chain of reasoning, which impedes the model's robustness, and moreover we propose ways to prevent it. Motivated by our findings on structured attention patterns, we propose the LEGO attention module, a drop-in replacement for vanilla attention heads. This architectural change significantly reduces Flops and maintains or even \emph{improves} the model's performance at large-scale pretraining. △ Less

Submitted 17 February, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

arXiv:2204.06686 [pdf, ps, other]

Isoperimetric Inequalities Made Simpler

Authors: Ronen Eldan, Guy Kindler, Noam Lifshitz, Dor Minzer

Abstract: We give an alternative, simple method to prove isoperimetric inequalities over the hypercube. In particular, we show: 1. An elementary proof of classical isoperimetric inequalities of Talagrand, as well as a stronger isoperimetric result conjectured by Talagrand and recently proved by Eldan and Gross. 2. A strengthening of the Friedgut junta theorem, asserting that if the $p$-moment of the sen… ▽ More We give an alternative, simple method to prove isoperimetric inequalities over the hypercube. In particular, we show: 1. An elementary proof of classical isoperimetric inequalities of Talagrand, as well as a stronger isoperimetric result conjectured by Talagrand and recently proved by Eldan and Gross. 2. A strengthening of the Friedgut junta theorem, asserting that if the $p$-moment of the sensitivity of a function is constant for some $1/2 + \varepsilon\leq p\leq 1$, then the function is close to a junta. In this language, Friedgut's theorem is the special case that $p=1$. △ Less

Submitted 1 August, 2024; v1 submitted 13 April, 2022; originally announced April 2022.

arXiv:2203.04163 [pdf, ps, other]

Localization Schemes: A Framework for Proving Mixing Bounds for Markov Chains

Authors: Yuansi Chen, Ronen Eldan

Abstract: Two recent and seemingly-unrelated techniques for proving mixing bounds for Markov chains are: (i) the framework of Spectral Independence, introduced by Anari, Liu and Oveis Gharan, and its numerous extensions, which have given rise to several breakthroughs in the analysis of mixing times of discrete Markov chains and (ii) the Stochastic Localization technique which has proven useful in establishi… ▽ More Two recent and seemingly-unrelated techniques for proving mixing bounds for Markov chains are: (i) the framework of Spectral Independence, introduced by Anari, Liu and Oveis Gharan, and its numerous extensions, which have given rise to several breakthroughs in the analysis of mixing times of discrete Markov chains and (ii) the Stochastic Localization technique which has proven useful in establishing mixing and expansion bounds for both log-concave measures and for measures on the discrete hypercube. In this paper, we introduce a framework which connects ideas from both techniques. Our framework unifies, simplifies and extends those two techniques. In its center is the concept of a localization scheme which, to every probability measure, assigns a martingale of probability measures which localize in space as time evolves. As it turns out, to every such scheme corresponds a Markov chain, and many chains of interest appear naturally in this framework. This viewpoint provides tools for deriving mixing bounds for the dynamics through the analysis of the corresponding localization process. Generalizations of concepts of Spectral Independence and Entropic Independence naturally arise from our definitions, and in particular we recover the main theorems in the spectral and entropic independence frameworks via simple martingale arguments (completely bypassing the need to use the theory of high-dimensional expanders). We demonstrate the strength of our proposed machinery by giving short and (arguably) simpler proofs to many mixing bounds in the recent literature, including giving the first $O(n \log n)$ bound for the mixing time of Glauber dynamics on the hardcore-model (of arbitrary degree) in the tree-uniqueness regime. △ Less

Submitted 6 June, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: 62 pages, 1 figure, fixed mistakes in an earlier version

arXiv:2102.08668 [pdf, ps, other]

Non-asymptotic approximations of neural networks by Gaussian processes

Authors: Ronen Eldan, Dan Mikulincer, Tselil Schramm

Abstract: We study the extent to which wide neural networks may be approximated by Gaussian processes when initialized with random weights. It is a well-established fact that as the width of a network goes to infinity, its law converges to that of a Gaussian process. We make this quantitative by establishing explicit convergence rates for the central limit theorem in an infinite-dimensional functional space… ▽ More We study the extent to which wide neural networks may be approximated by Gaussian processes when initialized with random weights. It is a well-established fact that as the width of a network goes to infinity, its law converges to that of a Gaussian process. We make this quantitative by establishing explicit convergence rates for the central limit theorem in an infinite-dimensional functional space, metrized with a natural transportation distance. We identify two regimes of interest; when the activation function is polynomial, its degree determines the rate of convergence, while for non-polynomial activations, the rate is governed by the smoothness of the function. △ Less

Submitted 17 February, 2021; originally announced February 2021.

Comments: 18 pages

arXiv:2006.15574 [pdf, ps, other]

doi 10.1017/S0963548322000098

Community detection and percolation of information in a geometric setting

Authors: Ronen Eldan, Dan Mikulincer, Hester Pieters

Abstract: We make the first steps towards generalizing the theory of stochastic block models, in the sparse regime, towards a model where the discrete community structure is replaced by an underlying geometry. We consider a geometric random graph over a homogeneous metric space where the probability of two vertices to be connected is an arbitrary function of the distance. We give sufficient conditions under… ▽ More We make the first steps towards generalizing the theory of stochastic block models, in the sparse regime, towards a model where the discrete community structure is replaced by an underlying geometry. We consider a geometric random graph over a homogeneous metric space where the probability of two vertices to be connected is an arbitrary function of the distance. We give sufficient conditions under which the locations can be recovered (up to an isomorphism of the space) in the sparse regime. Moreover, we define a geometric counterpart of the model of flow of information on trees, due to Mossel and Peres, in which one considers a branching random walk on a sphere and the goal is to recover the location of the root based on the locations of leaves. We give some sufficient conditions for percolation and for non-percolation of information in this model. △ Less

Submitted 1 July, 2022; v1 submitted 28 June, 2020; originally announced June 2020.

Comments: 23 pages. Changes to Lemma 16

arXiv:2006.13073 [pdf, ps, other]

Reduction From Non-Unique Games To Boolean Unique Games

Authors: Ronen Eldan, Dana Moshkovitz

Abstract: We reduce the problem of proving a "Boolean Unique Games Conjecture" (with gap 1-delta vs. 1-C*delta, for any C> 1, and sufficiently small delta>0) to the problem of proving a PCP Theorem for a certain non-unique game. In a previous work, Khot and Moshkovitz suggested an inefficient candidate reduction (i.e., without a proof of soundness). The current work is the first to provide an efficient redu… ▽ More We reduce the problem of proving a "Boolean Unique Games Conjecture" (with gap 1-delta vs. 1-C*delta, for any C> 1, and sufficiently small delta>0) to the problem of proving a PCP Theorem for a certain non-unique game. In a previous work, Khot and Moshkovitz suggested an inefficient candidate reduction (i.e., without a proof of soundness). The current work is the first to provide an efficient reduction along with a proof of soundness. The non-unique game we reduce from is similar to non-unique games for which PCP theorems are known. Our proof relies on a new concentration theorem for functions in Gaussian space that are restricted to a random hyperplane. We bound the typical Euclidean distance between the low degree part of the restriction of the function to the hyperplane and the restriction to the hyperplane of the low degree part of the function. △ Less

Submitted 8 July, 2021; v1 submitted 23 June, 2020; originally announced June 2020.

arXiv:2006.02855 [pdf, ps, other]

Network size and weights size for memorization with two-layers neural networks

Authors: Sébastien Bubeck, Ronen Eldan, Yin Tat Lee, Dan Mikulincer

Abstract: In 1988, Eric B. Baum showed that two-layers neural networks with threshold activation function can perfectly memorize the binary labels of $n$ points in general position in $\mathbb{R}^d$ using only $\ulcorner n/d \urcorner$ neurons. We observe that with ReLU networks, using four times as many neurons one can fit arbitrary real labels. Moreover, for approximate memorization up to error $ε$, the n… ▽ More In 1988, Eric B. Baum showed that two-layers neural networks with threshold activation function can perfectly memorize the binary labels of $n$ points in general position in $\mathbb{R}^d$ using only $\ulcorner n/d \urcorner$ neurons. We observe that with ReLU networks, using four times as many neurons one can fit arbitrary real labels. Moreover, for approximate memorization up to error $ε$, the neural tangent kernel can also memorize with only $O\left(\frac{n}{d} \cdot \log(1/ε) \right)$ neurons (assuming that the data is well dispersed too). We show however that these constructions give rise to networks where the magnitude of the neurons' weights are far from optimal. In contrast we propose a new training procedure for ReLU networks, based on complex (as opposed to real) recombination of the neurons, for which we show approximate memorization with both $O\left(\frac{n}{d} \cdot \frac{\log(1/ε)}ε\right)$ neurons, as well as nearly-optimal size of the weights. △ Less

Submitted 3 November, 2020; v1 submitted 4 June, 2020; originally announced June 2020.

Comments: 27 pages

arXiv:1909.12067 [pdf, other]

Concentration on the Boolean hypercube via pathwise stochastic analysis

Authors: Ronen Eldan, Renan Gross

Abstract: We develop a new technique for proving concentration inequalities which relate between the variance and influences of Boolean functions. Using this technique, we 1. Settle a conjecture of Talagrand [Tal97] proving that… ▽ More We develop a new technique for proving concentration inequalities which relate between the variance and influences of Boolean functions. Using this technique, we 1. Settle a conjecture of Talagrand [Tal97] proving that $$\int_{\left\{ -1,1\right\} ^{n}}\sqrt{h_{f}\left(x\right)}dμ\geq C\cdot\mathrm{var}\left(f\right)\cdot\left(\log\left(\frac{1}{\sum\mathrm{Inf}_{i}^{2}\left(f\right)}\right)\right)^{1/2},$$ where $h_{f}\left(x\right)$ is the number of edges at $x$ along which $f$ changes its value, and $\mathrm{Inf}_{i}\left(f\right)$ is the influence of the $i$-th coordinate. 2. Strengthen several classical inequalities concerning the influences of a Boolean function, showing that near-maximizers must have large vertex boundaries. An inequality due to Talagrand states that for a Boolean function $f$, $\mathrm{var}\left(f\right)\leq C\sum_{i=1}^{n}\frac{\mathrm{Inf}_{i}\left(f\right)}{1+\log\left(1/\mathrm{Inf}_{i}\left(f\right)\right)}$. We give a lower bound for the size of the vertex boundary of functions saturating this inequality. As a corollary, we show that for sets that satisfy the edge-isoperimetric inequality or the Kahn-Kalai-Linial inequality up to a constant, a constant proportion of the mass is in the inner vertex boundary. 3. Improve a quantitative relation between influences and noise stability given by Keller and Kindler. Our proofs rely on techniques based on stochastic calculus, and bypass the use of hypercontractivity common to previous proofs. △ Less

Submitted 12 March, 2020; v1 submitted 26 September, 2019; originally announced September 2019.

Comments: 48 pages, 2 figures

arXiv:1906.10615 [pdf, ps, other]

Krivine diffusions attain the Goemans--Williamson approximation ratio

Authors: Ronen Eldan, Assaf Naor

Abstract: Answering a question of Abbasi-Zadeh, Bansal, Guruganesh, Nikolov, Schwartz and Singh (2018), we prove the existence of a slowed-down sticky Brownian motion whose induced rounding for MAXCUT attains the Goemans--Williamson approximation ratio. This is an especially simple particular case of the general rounding framework of Krivine diffusions that we investigate elsewhere. Answering a question of Abbasi-Zadeh, Bansal, Guruganesh, Nikolov, Schwartz and Singh (2018), we prove the existence of a slowed-down sticky Brownian motion whose induced rounding for MAXCUT attains the Goemans--Williamson approximation ratio. This is an especially simple particular case of the general rounding framework of Krivine diffusions that we investigate elsewhere. △ Less

Submitted 25 June, 2019; originally announced June 2019.

arXiv:1904.06984 [pdf, other]

Depth Separations in Neural Networks: What is Actually Being Separated?

Authors: Itay Safran, Ronen Eldan, Ohad Shamir

Abstract: Existing depth separation results for constant-depth networks essentially show that certain radial functions in $\mathbb{R}^d$, which can be easily approximated with depth $3$ networks, cannot be approximated by depth $2$ networks, even up to constant accuracy, unless their size is exponential in $d$. However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz paramet… ▽ More Existing depth separation results for constant-depth networks essentially show that certain radial functions in $\mathbb{R}^d$, which can be easily approximated with depth $3$ networks, cannot be approximated by depth $2$ networks, even up to constant accuracy, unless their size is exponential in $d$. However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz parameter scaling polynomially with the dimension $d$ (or equivalently, by scaling the function, the hardness result applies to $\mathcal{O}(1)$-Lipschitz functions only when the target accuracy $ε$ is at most $\text{poly}(1/d)$). In this paper, we study whether such depth separations might still hold in the natural setting of $\mathcal{O}(1)$-Lipschitz radial functions, when $ε$ does not scale with $d$. Perhaps surprisingly, we show that the answer is negative: In contrast to the intuition suggested by previous work, it \emph{is} possible to approximate $\mathcal{O}(1)$-Lipschitz radial functions with depth $2$, size $\text{poly}(d)$ networks, for every constant $ε$. We complement it by showing that approximating such functions is also possible with depth $2$, size $\text{poly}(1/ε)$ networks, for every constant $d$. Finally, we show that it is not possible to have polynomial dependence in both $d,1/ε$ simultaneously. Overall, our results indicate that in order to show depth separations for expressing $\mathcal{O}(1)$-Lipschitz functions with constant accuracy -- if at all possible -- one would need fundamentally different techniques than existing ones in the literature. △ Less

Submitted 2 June, 2021; v1 submitted 15 April, 2019; originally announced April 2019.

arXiv:1903.07140 [pdf, ps, other]

doi 10.1007/s00440-020-00967-w

Stability of the Shannon-Stam inequality via the Föllmer process

Authors: Ronen Eldan, Dan Mikulincer

Abstract: We prove stability estimates for the Shannon-Stam inequality (also known as the entropy-power inequality) for log-concave random vectors in terms of entropy and transportation distance. In particular, we give the first stability estimate for general log-concave random vectors in the following form: for log-concave random vectors $X,Y \in \mathbb{R}^d$, the deficit in the Shannon-Stam inequality is… ▽ More We prove stability estimates for the Shannon-Stam inequality (also known as the entropy-power inequality) for log-concave random vectors in terms of entropy and transportation distance. In particular, we give the first stability estimate for general log-concave random vectors in the following form: for log-concave random vectors $X,Y \in \mathbb{R}^d$, the deficit in the Shannon-Stam inequality is bounded from below by the expression $$ C \left(\mathrm{D}\left(X||G\right) + \mathrm{D}\left(Y||G\right)\right), $$ where $\mathrm{D}\left( \cdot ~ ||G\right)$ denotes the relative entropy with respect to the standard Gaussian and the constant $C$ depends only on the covariance structures and the spectral gaps of $X$ and $Y$. In the case of uniformly log-concave vectors our analysis gives dimension-free bounds. Our proofs are based on a new approach which uses an entropy-minimizing process from stochastic control theory. △ Less

Submitted 17 March, 2019; originally announced March 2019.

Comments: 24 pages

Journal ref: Probab. Theory Relat. Fields 17, 891-922 (2020)

arXiv:1707.01227 [pdf, other]

Exponential random graphs behave like mixtures of stochastic block models

Authors: Ronen Eldan, Renan Gross

Abstract: We study the behavior of exponential random graphs in both the sparse and the dense regime. We show that exponential random graphs are approximate mixtures of graphs with independent edges whose probability matrices are critical points of an associated functional, thereby satisfying a certain matrix equation. In the dense regime, every solution to this equation is close to a block matrix, concludi… ▽ More We study the behavior of exponential random graphs in both the sparse and the dense regime. We show that exponential random graphs are approximate mixtures of graphs with independent edges whose probability matrices are critical points of an associated functional, thereby satisfying a certain matrix equation. In the dense regime, every solution to this equation is close to a block matrix, concluding that the exponential random graph behaves roughly like a mixture of stochastic block models. We also show existence and uniqueness of solutions to this equation for several families of exponential random graphs, including the case where the subgraphs are counted with positive weights and the case where all weights are small in absolute value. In particular, this generalizes some of the results in a paper by Chatterjee and Diaconis from the dense regime to the sparse regime and strengthens their bounds from the cut-metric to the one-metric. △ Less

Submitted 19 April, 2018; v1 submitted 5 July, 2017; originally announced July 2017.

arXiv:1609.02490 [pdf, ps, other]

Information and dimensionality of anisotropic random geometric graphs

Authors: Ronen Eldan, Dan Mikulincer

Abstract: This paper deals with the problem of detecting non-isotropic high-dimensional geometric structure in random graphs. Namely, we study a model of a random geometric graph in which vertices correspond to points generated randomly and independently from a non-isotropic $d$-dimensional Gaussian distribution, and two vertices are connected if the distance between them is smaller than some pre-specified… ▽ More This paper deals with the problem of detecting non-isotropic high-dimensional geometric structure in random graphs. Namely, we study a model of a random geometric graph in which vertices correspond to points generated randomly and independently from a non-isotropic $d$-dimensional Gaussian distribution, and two vertices are connected if the distance between them is smaller than some pre-specified threshold. We derive new notions of dimensionality which depend upon the eigenvalues of the covariance of the Gaussian distribution. If $α$ denotes the vector of eigenvalues, and $n$ is the number of vertices, then the quantities $\left(\frac{||α||_2}{||α||_3}\right)^6/n^3$ and $\left(\frac{||α||_2}{||α||_4}\right)^4/n^3$ determine upper and lower bounds for the possibility of detection. This generalizes a recent result by Bubeck, Ding, Rácz and the first named author from [BDER14] which shows that the quantity $d/n^3$ determines the boundary of detection for isotropic geometry. Our methods involve Fourier analysis and the theory of characteristic functions to investigate the underlying probabilities of the model. The proof of the lower bound uses information theoretic tools, based on the method presented in [BG15]. △ Less

Submitted 23 February, 2020; v1 submitted 8 September, 2016; originally announced September 2016.

Comments: 38 pages

arXiv:1607.03084 [pdf, ps, other]

Kernel-based methods for bandit convex optimization

Authors: Sébastien Bubeck, Ronen Eldan, Yin Tat Lee

Abstract: We consider the adversarial convex bandit problem and we build the first $\mathrm{poly}(T)$-time algorithm with $\mathrm{poly}(n) \sqrt{T}$-regret for this problem. To do so we introduce three new ideas in the derivative-free optimization literature: (i) kernel methods, (ii) a generalization of Bernoulli convolutions, and (iii) a new annealing schedule for exponential weights (with increasing lear… ▽ More We consider the adversarial convex bandit problem and we build the first $\mathrm{poly}(T)$-time algorithm with $\mathrm{poly}(n) \sqrt{T}$-regret for this problem. To do so we introduce three new ideas in the derivative-free optimization literature: (i) kernel methods, (ii) a generalization of Bernoulli convolutions, and (iii) a new annealing schedule for exponential weights (with increasing learning rate). The basic version of our algorithm achieves $\tilde{O}(n^{9.5} \sqrt{T})$-regret, and we show that a simple variant of this algorithm can be run in $\mathrm{poly}(n \log(T))$-time per step at the cost of an additional $\mathrm{poly}(n) T^{o(1)}$ factor in the regret. These results improve upon the $\tilde{O}(n^{11} \sqrt{T})$-regret and $\exp(\mathrm{poly}(T))$-time result of the first two authors, and the $\log(T)^{\mathrm{poly}(n)} \sqrt{T}$-regret and $\log(T)^{\mathrm{poly}(n)}$-time result of Hazan and Li. Furthermore we conjecture that another variant of the algorithm could achieve $\tilde{O}(n^{1.5} \sqrt{T})$-regret, and moreover that this regret is unimprovable (the current best lower bound being $Ω(n \sqrt{T})$ and it is achieved with linear functions). For the simpler situation of zeroth order stochastic convex optimization this corresponds to the conjecture that the optimal query complexity is of order $n^3 / ε^2$. △ Less

Submitted 11 July, 2016; originally announced July 2016.

Comments: 45 pages

arXiv:1512.03965 [pdf, other]

The Power of Depth for Feedforward Neural Networks

Authors: Ronen Eldan, Ohad Shamir

Abstract: We show that there is a simple (approximately radial) function on $\reals^d$, expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension. The result holds for virtually all known activation functions, including rectified linear units, sigmoids and thresho… ▽ More We show that there is a simple (approximately radial) function on $\reals^d$, expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension. The result holds for virtually all known activation functions, including rectified linear units, sigmoids and thresholds, and formally demonstrates that depth -- even if increased by 1 -- can be exponentially more valuable than width for standard feedforward neural networks. Moreover, compared to related results in the context of Boolean functions, our result requires fewer assumptions, and the proof techniques and construction are very different. △ Less

Submitted 8 May, 2016; v1 submitted 12 December, 2015; originally announced December 2015.

Comments: Accepted to COLT 2016; Fixed a bug in the proof of claim 2 (now requiring the mild assumption that the activations are polynomially bounded); Other minor revisions

arXiv:1507.06580 [pdf, ps, other]

Multi-scale exploration of convex functions and bandit convex optimization

Authors: Sébastien Bubeck, Ronen Eldan

Abstract: We construct a new map from a convex function to a distribution on its domain, with the property that this distribution is a multi-scale exploration of the function. We use this map to solve a decade-old open problem in adversarial bandit convex optimization by showing that the minimax regret for this problem is $\tilde{O}(\mathrm{poly}(n) \sqrt{T})$, where $n$ is the dimension and $T$ the number… ▽ More We construct a new map from a convex function to a distribution on its domain, with the property that this distribution is a multi-scale exploration of the function. We use this map to solve a decade-old open problem in adversarial bandit convex optimization by showing that the minimax regret for this problem is $\tilde{O}(\mathrm{poly}(n) \sqrt{T})$, where $n$ is the dimension and $T$ the number of rounds. This bound is obtained by studying the dual Bayesian maximin regret via the information ratio analysis of Russo and Van Roy, and then using the multi-scale exploration to solve the Bayesian problem. △ Less

Submitted 23 July, 2015; originally announced July 2015.

Comments: Preliminary version; 22 pages

arXiv:1507.02564 [pdf, other]

Sampling from a log-concave distribution with Projected Langevin Monte Carlo

Authors: Sébastien Bubeck, Ronen Eldan, Joseph Lehec

Abstract: We extend the Langevin Monte Carlo (LMC) algorithm to compactly supported measures via a projection step, akin to projected Stochastic Gradient Descent (SGD). We show that (projected) LMC allows to sample in polynomial time from a log-concave distribution with smooth potential. This gives a new Markov chain to sample from a log-concave distribution. Our main result shows in particular that when th… ▽ More We extend the Langevin Monte Carlo (LMC) algorithm to compactly supported measures via a projection step, akin to projected Stochastic Gradient Descent (SGD). We show that (projected) LMC allows to sample in polynomial time from a log-concave distribution with smooth potential. This gives a new Markov chain to sample from a log-concave distribution. Our main result shows in particular that when the target distribution is uniform, LMC mixes in $\tilde{O}(n^7)$ steps (where $n$ is the dimension). We also provide preliminary experimental evidence that LMC performs at least as well as hit-and-run, for which a better mixing time of $\tilde{O}(n^4)$ was proved by Lov{á}sz and Vempala. △ Less

Submitted 9 July, 2015; originally announced July 2015.

Comments: Preliminary version; 23 pages

arXiv:1412.1587 [pdf, ps, other]

The entropic barrier: a simple and optimal universal self-concordant barrier

Authors: Sébastien Bubeck, Ronen Eldan

Abstract: We prove that the Cramér transform of the uniform measure on a convex body in $\mathbb{R}^n$ is a $(1+o(1)) n$-self-concordant barrier, improving a seminal result of Nesterov and Nemirovski. This gives the first explicit construction of a universal barrier for convex bodies with optimal self-concordance parameter. The proof is based on basic geometry of log-concave distributions, and elementary du… ▽ More We prove that the Cramér transform of the uniform measure on a convex body in $\mathbb{R}^n$ is a $(1+o(1)) n$-self-concordant barrier, improving a seminal result of Nesterov and Nemirovski. This gives the first explicit construction of a universal barrier for convex bodies with optimal self-concordance parameter. The proof is based on basic geometry of log-concave distributions, and elementary duality in exponential families. △ Less

Submitted 11 April, 2015; v1 submitted 4 December, 2014; originally announced December 2014.

Comments: 15 pages

arXiv:1411.5713 [pdf, ps, other]

Testing for high-dimensional geometry in random graphs

Authors: Sébastien Bubeck, Jian Ding, Ronen Eldan, Miklós Rácz

Abstract: We study the problem of detecting the presence of an underlying high-dimensional geometric structure in a random graph. Under the null hypothesis, the observed graph is a realization of an Erdős-Rényi random graph $G(n,p)$. Under the alternative, the graph is generated from the $G(n,p,d)$ model, where each vertex corresponds to a latent independent random vector uniformly distributed on the sphere… ▽ More We study the problem of detecting the presence of an underlying high-dimensional geometric structure in a random graph. Under the null hypothesis, the observed graph is a realization of an Erdős-Rényi random graph $G(n,p)$. Under the alternative, the graph is generated from the $G(n,p,d)$ model, where each vertex corresponds to a latent independent random vector uniformly distributed on the sphere $\mathbb{S}^{d-1}$, and two vertices are connected if the corresponding latent vectors are close enough. In the dense regime (i.e., $p$ is a constant), we propose a near-optimal and computationally efficient testing procedure based on a new quantity which we call signed triangles. The proof of the detection lower bound is based on a new bound on the total variation distance between a Wishart matrix and an appropriately normalized GOE matrix. In the sparse regime, we make a conjecture for the optimal detection boundary. We conclude the paper with some preliminary steps on the problem of estimating the dimension in $G(n,p,d)$. △ Less

Submitted 21 November, 2015; v1 submitted 20 November, 2014; originally announced November 2014.

Comments: 28 pages; v2 contains minor changes

arXiv:1409.7685 [pdf, other]

From trees to seeds: on the inference of the seed from large trees in the uniform attachment model

Authors: Sébastien Bubeck, Ronen Eldan, Elchanan Mossel, Miklós Z. Rácz

Abstract: We study the influence of the seed in random trees grown according to the uniform attachment model, also known as uniform random recursive trees. We show that different seeds lead to different distributions of limiting trees from a total variation point of view. To do this, we construct statistics that measure, in a certain well-defined sense, global "balancedness" properties of such trees. Our pa… ▽ More We study the influence of the seed in random trees grown according to the uniform attachment model, also known as uniform random recursive trees. We show that different seeds lead to different distributions of limiting trees from a total variation point of view. To do this, we construct statistics that measure, in a certain well-defined sense, global "balancedness" properties of such trees. Our paper follows recent results on the same question for the preferential attachment model. △ Less

Submitted 20 October, 2014; v1 submitted 26 September, 2014; originally announced September 2014.

Comments: 26 pages, 5 figures

arXiv:1409.2913 [pdf, ps, other]

Efficient Algorithms for Discrepancy Minimization in Convex Sets

Authors: Ronen Eldan, Mohit Singh

Abstract: A result of Spencer states that every collection of $n$ sets over a universe of size $n$ has a coloring of the ground set with $\{-1,+1\}$ of discrepancy $O(\sqrt{n})$. A geometric generalization of this result was given by Gluskin (see also Giannopoulos) who showed that every symmetric convex body $K\subseteq R^n$ with Gaussian measure at least $e^{-εn}$, for a small $ε>0$, contains a point… ▽ More A result of Spencer states that every collection of $n$ sets over a universe of size $n$ has a coloring of the ground set with $\{-1,+1\}$ of discrepancy $O(\sqrt{n})$. A geometric generalization of this result was given by Gluskin (see also Giannopoulos) who showed that every symmetric convex body $K\subseteq R^n$ with Gaussian measure at least $e^{-εn}$, for a small $ε>0$, contains a point $y\in K$ where a constant fraction of coordinates of $y$ are in $\{-1,1\}$. This is often called a partial coloring result. While both these results were inherently non-algorithmic, recently Bansal (see also Lovett-Meka) gave a polynomial time algorithm for Spencer's setting and Rothvoßgave a randomized polynomial time algorithm obtaining the same guarantee as the result of Gluskin and Giannopoulos. This paper has several related results. First we prove another constructive version of the result of Gluskin and Giannopoulos via an optimization of a linear function. This implies a linear programming based algorithm for combinatorial discrepancy obtaining the same result as Spencer. Our second result gives a new approach to obtains partial colorings and shows that every convex body $K\subseteq R^n$, possibly non-symmetric, with Gaussian measure at least $e^{-εn}$, for a small $ε>0$, contains a point $y\in K$ where a constant fraction of coordinates of $y$ are in $\{-1,1\}$. Finally, we give a simple proof that shows that for any $δ>0$ there exists a constant $c>0$ such that given a body $K$ with $γ_n(K)\geq δ$, a uniformly random $x$ from $\{-1,1\}^n$ is in $cK$ with constant probability. This gives an algorithmic version of a special case of the result of Banaszczyk. △ Less

Submitted 9 September, 2014; originally announced September 2014.

Comments: Preliminary version

Showing 1–34 of 34 results for author: Eldan, R