Search | arXiv e-print repository

Private Adaptive Gradient Methods for Convex Optimization

Authors: Hilal Asi, John Duchi, Alireza Fallah, Omid Javidbakht, Kunal Talwar

Abstract: We study adaptive methods for differentially private convex optimization, proposing and analyzing differentially private variants of a Stochastic Gradient Descent (SGD) algorithm with adaptive stepsizes, as well as the AdaGrad algorithm. We provide upper bounds on the regret of both algorithms and show that the bounds are (worst-case) optimal. As a consequence of our development, we show that our… ▽ More We study adaptive methods for differentially private convex optimization, proposing and analyzing differentially private variants of a Stochastic Gradient Descent (SGD) algorithm with adaptive stepsizes, as well as the AdaGrad algorithm. We provide upper bounds on the regret of both algorithms and show that the bounds are (worst-case) optimal. As a consequence of our development, we show that our private versions of AdaGrad outperform adaptive SGD, which in turn outperforms traditional SGD in scenarios with non-isotropic gradients where (non-private) Adagrad provably outperforms SGD. The major challenge is that the isotropic noise typically added for privacy dominates the signal in gradient geometry for high-dimensional problems; approaches to this that effectively optimize over lower-dimensional subspaces simply ignore the actual problems that varying gradient geometries introduce. In contrast, we study non-isotropic clipping and noise addition, developing a principled theoretical approach; the consequent procedures also enjoy significantly stronger empirical performance than prior approaches. △ Less

Submitted 25 June, 2021; originally announced June 2021.

Comments: To appear in 38th International Conference on Machine Learning (ICML 2021)

arXiv:2106.07537 [pdf, other]

A Wasserstein Minimax Framework for Mixed Linear Regression

Authors: Theo Diamandis, Yonina C. Eldar, Alireza Fallah, Farzan Farnia, Asuman Ozdaglar

Abstract: Multi-modal distributions are commonly used to model clustered data in statistical learning tasks. In this paper, we consider the Mixed Linear Regression (MLR) problem. We propose an optimal transport-based framework for MLR problems, Wasserstein Mixed Linear Regression (WMLR), which minimizes the Wasserstein distance between the learned and target mixture regression models. Through a model-based… ▽ More Multi-modal distributions are commonly used to model clustered data in statistical learning tasks. In this paper, we consider the Mixed Linear Regression (MLR) problem. We propose an optimal transport-based framework for MLR problems, Wasserstein Mixed Linear Regression (WMLR), which minimizes the Wasserstein distance between the learned and target mixture regression models. Through a model-based duality analysis, WMLR reduces the underlying MLR task to a nonconvex-concave minimax optimization problem, which can be provably solved to find a minimax stationary point by the Gradient Descent Ascent (GDA) algorithm. In the special case of mixtures of two linear regression models, we show that WMLR enjoys global convergence and generalization guarantees. We prove that WMLR's sample complexity grows linearly with the dimension of data. Finally, we discuss the application of WMLR to the federated learning task where the training samples are collected by multiple agents in a network. Unlike the Expectation Maximization algorithm, WMLR directly extends to the distributed, federated learning setting. We support our theoretical results through several numerical experiments, which highlight our framework's ability to handle the federated learning setting with mixture models. △ Less

Submitted 16 June, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

Comments: To appear in 38th International Conference on Machine Learning (ICML 2021)

arXiv:2105.09893 [pdf, other]

A flexible Bayesian non-confounding spatial model for analysis of dispersed count data in clinical studies

Authors: Mahsa Nadifar, Hossein Baghishani, Afshin Fallah

Abstract: In employing spatial regression models for counts, we usually meet two issues. First, ignoring the inherent collinearity between covariates and the spatial effect would lead to causal inferences. Second, real count data usually reveal over or under-dispersion where the classical Poisson model is not appropriate to use. We propose a flexible Bayesian hierarchical modeling approach by joining non-co… ▽ More In employing spatial regression models for counts, we usually meet two issues. First, ignoring the inherent collinearity between covariates and the spatial effect would lead to causal inferences. Second, real count data usually reveal over or under-dispersion where the classical Poisson model is not appropriate to use. We propose a flexible Bayesian hierarchical modeling approach by joining non-confounding spatial methodology and a newly reconsidered dispersed count modeling from the renewal theory to control the issues. Specifically, we extend the methodology for analyzing spatial count data based on the gamma distribution assumption for waiting times. The model can be formulated as a latent Gaussian model, and consequently, we can carry out the fast computation using the integrated nested Laplace approximation method. We also examine different popular approaches for handling spatial confounding and compare their performances in the presence of dispersion. We use the proposed methodology to analyze a clinical dataset related to stomach cancer incidence in Slovenia and perform a simulation study to understand the proposed approach's merits better. △ Less

Submitted 20 May, 2021; originally announced May 2021.

Comments: arXiv admin note: text overlap with arXiv:1908.02344

arXiv:2105.08686 [pdf, other]

Flexible Bayesian Modeling of Counts: Constructing Penalized Complexity Priors

Authors: Mahsa Nadifar, Hossein Baghishani, Thomas Kneib, Afshin Fallah

Abstract: Many of the data, particularly in medicine and disease mapping are count. Indeed, the under or overdispersion problem in count data distrusts the performance of the classical Poisson model. For taking into account this problem, in this paper, we introduce a new Bayesian structured additive regression model, called gamma count, with enough flexibility in modeling dispersion. Setting convenient prio… ▽ More Many of the data, particularly in medicine and disease mapping are count. Indeed, the under or overdispersion problem in count data distrusts the performance of the classical Poisson model. For taking into account this problem, in this paper, we introduce a new Bayesian structured additive regression model, called gamma count, with enough flexibility in modeling dispersion. Setting convenient prior distributions on the model parameters is a momentous issue in Bayesian statistics that characterize the nature of our uncertainty parameters. Relying on a recently proposed class of penalized complexity priors, motivated from a general set of construction principles, we derive the prior structure. The model can be formulated as a latent Gaussian model, and consequently, we can carry out the fast computation by using the integrated nested Laplace approximation method. We investigate the proposed methodology simulation study. Different expropriate prior distribution are examined to provide reasonable sensitivity analysis. To explain the applicability of the proposed model, we analyzed two real-world data sets related to the larynx mortality cancer in Germany and the handball champions league. △ Less

Submitted 18 May, 2021; originally announced May 2021.

arXiv:2102.03832 [pdf, other]

Generalization of Model-Agnostic Meta-Learning Algorithms: Recurring and Unseen Tasks

Authors: Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar

Abstract: In this paper, we study the generalization properties of Model-Agnostic Meta-Learning (MAML) algorithms for supervised learning problems. We focus on the setting in which we train the MAML model over $m$ tasks, each with $n$ data points, and characterize its generalization error from two points of view: First, we assume the new task at test time is one of the training tasks, and we show that, for… ▽ More In this paper, we study the generalization properties of Model-Agnostic Meta-Learning (MAML) algorithms for supervised learning problems. We focus on the setting in which we train the MAML model over $m$ tasks, each with $n$ data points, and characterize its generalization error from two points of view: First, we assume the new task at test time is one of the training tasks, and we show that, for strongly convex objective functions, the expected excess population loss is bounded by ${\mathcal{O}}(1/mn)$. Second, we consider the MAML algorithm's generalization to an unseen task and show that the resulting generalization error depends on the total variation distance between the underlying distributions of the new task and the tasks observed during the training process. Our proof techniques rely on the connections between algorithmic stability and generalization bounds of algorithms. In particular, we propose a new definition of stability for meta-learning algorithms, which allows us to capture the role of both the number of tasks $m$ and number of samples per task $n$ on the generalization error of MAML. △ Less

Submitted 16 November, 2021; v1 submitted 7 February, 2021; originally announced February 2021.

Comments: 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

arXiv:2002.07948 [pdf, other]

Personalized Federated Learning: A Meta-Learning Approach

Authors: Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar

Abstract: In Federated Learning, we aim to train models across multiple computing units (users), while users can only communicate with a common central server, without exchanging their data samples. This mechanism exploits the computational power of all users and allows users to obtain a richer model as their models are trained over a larger set of data points. However, this scheme only develops a common ou… ▽ More In Federated Learning, we aim to train models across multiple computing units (users), while users can only communicate with a common central server, without exchanging their data samples. This mechanism exploits the computational power of all users and allows users to obtain a richer model as their models are trained over a larger set of data points. However, this scheme only develops a common output for all the users, and, therefore, it does not adapt the model to each user. This is an important missing feature, especially given the heterogeneity of the underlying data distribution for various users. In this paper, we study a personalized variant of the federated learning in which our goal is to find an initial shared model that current or new users can easily adapt to their local dataset by performing one or a few steps of gradient descent with respect to their own data. This approach keeps all the benefits of the federated learning architecture, and, by structure, leads to a more personalized model for each user. We show this problem can be studied within the Model-Agnostic Meta-Learning (MAML) framework. Inspired by this connection, we study a personalized variant of the well-known Federated Averaging algorithm and evaluate its performance in terms of gradient norm for non-convex loss functions. Further, we characterize how this performance is affected by the closeness of underlying distributions of user data, measured in terms of distribution distances such as Total Variation and 1-Wasserstein metric. △ Less

Submitted 22 October, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: To appear in 34th Conference on Neural Information Processing Systems (NeurIPS 2020)

arXiv:2002.05683 [pdf, ps, other]

An Optimal Multistage Stochastic Gradient Method for Minimax Problems

Authors: Alireza Fallah, Asuman Ozdaglar, Sarath Pattathil

Abstract: In this paper, we study the minimax optimization problem in the smooth and strongly convex-strongly concave setting when we have access to noisy estimates of gradients. In particular, we first analyze the stochastic Gradient Descent Ascent (GDA) method with constant stepsize, and show that it converges to a neighborhood of the solution of the minimax problem. We further provide tight bounds on the… ▽ More In this paper, we study the minimax optimization problem in the smooth and strongly convex-strongly concave setting when we have access to noisy estimates of gradients. In particular, we first analyze the stochastic Gradient Descent Ascent (GDA) method with constant stepsize, and show that it converges to a neighborhood of the solution of the minimax problem. We further provide tight bounds on the convergence rate and the size of this neighborhood. Next, we propose a multistage variant of stochastic GDA (M-GDA) that runs in multiple stages with a particular learning rate decay schedule and converges to the exact solution of the minimax problem. We show M-GDA achieves the lower bounds in terms of noise dependence without any assumptions on the knowledge of noise characteristics. We also show that M-GDA obtains a linear decay rate with respect to the error's dependence on the initial error, although the dependence on condition number is suboptimal. In order to improve this dependence, we apply the multistage machinery to the stochastic Optimistic Gradient Descent Ascent (OGDA) algorithm and propose the M-OGDA algorithm which also achieves the optimal linear decay rate with respect to the initial error. To the best of our knowledge, this method is the first to simultaneously achieve the best dependence on noise characteristic as well as the initial error and condition number. △ Less

Submitted 13 February, 2020; originally announced February 2020.

arXiv:2002.05135 [pdf, other]

On the Convergence Theory of Debiased Model-Agnostic Meta-Reinforcement Learning

Authors: Alireza Fallah, Kristian Georgiev, Aryan Mokhtari, Asuman Ozdaglar

Abstract: We consider Model-Agnostic Meta-Learning (MAML) methods for Reinforcement Learning (RL) problems, where the goal is to find a policy using data from several tasks represented by Markov Decision Processes (MDPs) that can be updated by one step of stochastic policy gradient for the realized MDP. In particular, using stochastic gradients in MAML update steps is crucial for RL problems since computati… ▽ More We consider Model-Agnostic Meta-Learning (MAML) methods for Reinforcement Learning (RL) problems, where the goal is to find a policy using data from several tasks represented by Markov Decision Processes (MDPs) that can be updated by one step of stochastic policy gradient for the realized MDP. In particular, using stochastic gradients in MAML update steps is crucial for RL problems since computation of exact gradients requires access to a large number of possible trajectories. For this formulation, we propose a variant of the MAML method, named Stochastic Gradient Meta-Reinforcement Learning (SG-MRL), and study its convergence properties. We derive the iteration and sample complexity of SG-MRL to find an $ε$-first-order stationary point, which, to the best of our knowledge, provides the first convergence guarantee for model-agnostic meta-reinforcement learning algorithms. We further show how our results extend to the case where more than one step of stochastic policy gradient method is used at test time. Finally, we empirically compare SG-MRL and MAML in several deep RL environments. △ Less

Submitted 16 November, 2021; v1 submitted 12 February, 2020; originally announced February 2020.

Comments: 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

arXiv:1910.08701 [pdf, other]

Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks

Authors: Alireza Fallah, Mert Gurbuzbalaban, Asuman Ozdaglar, Umut Simsekli, Lingjiong Zhu

Abstract: We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are availabl… ▽ More We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are available. We develop a framework which allows to choose the stepsize and the momentum parameters of these algorithms in a way to optimize performance by systematically trading off the bias, variance, robustness to gradient noise and dependence to network effects. When gradients do not contain noise, we also prove that distributed accelerated methods can \emph{achieve acceleration}, requiring $\mathcal{O}(κ\log(1/\varepsilon))$ gradient evaluations and $\mathcal{O}(κ\log(1/\varepsilon))$ communications to converge to the same fixed point with the non-accelerated variant where $κ$ is the condition number and $\varepsilon$ is the target accuracy. To our knowledge, this is the first acceleration result where the iteration complexity scales with the square root of the condition number in the context of \emph{primal} distributed inexact first-order methods. For quadratic functions, we also provide finer performance bounds that are tight with respect to bias and variance terms. Finally, we study a multistage version of D-ASG with parameters carefully varied over stages to ensure exact $\mathcal{O}(-k/\sqrtκ)$ linear decay in the bias term as well as optimal $\mathcal{O}(σ^2/k)$ in the variance term. We illustrate through numerical experiments that our approach results in practical algorithms that are robust to gradient noise and that can outperform existing methods. △ Less

Submitted 4 October, 2021; v1 submitted 19 October, 2019; originally announced October 2019.

arXiv:1908.10400 [pdf, other]

On the Convergence Theory of Gradient-Based Model-Agnostic Meta-Learning Algorithms

Authors: Alireza Fallah, Aryan Mokhtari, Asuman Ozdaglar

Abstract: We study the convergence of a class of gradient-based Model-Agnostic Meta-Learning (MAML) methods and characterize their overall complexity as well as their best achievable accuracy in terms of gradient norm for nonconvex loss functions. We start with the MAML method and its first-order approximation (FO-MAML) and highlight the challenges that emerge in their analysis. By overcoming these challeng… ▽ More We study the convergence of a class of gradient-based Model-Agnostic Meta-Learning (MAML) methods and characterize their overall complexity as well as their best achievable accuracy in terms of gradient norm for nonconvex loss functions. We start with the MAML method and its first-order approximation (FO-MAML) and highlight the challenges that emerge in their analysis. By overcoming these challenges not only we provide the first theoretical guarantees for MAML and FO-MAML in nonconvex settings, but also we answer some of the unanswered questions for the implementation of these algorithms including how to choose their learning rate and the batch size for both tasks and datasets corresponding to tasks. In particular, we show that MAML can find an $ε$-first-order stationary point ($ε$-FOSP) for any positive $ε$ after at most $\mathcal{O}(1/ε^2)$ iterations at the expense of requiring second-order information. We also show that FO-MAML which ignores the second-order information required in the update of MAML cannot achieve any small desired level of accuracy, i.e., FO-MAML cannot find an $ε$-FOSP for any $ε>0$. We further propose a new variant of the MAML algorithm called Hessian-free MAML which preserves all theoretical guarantees of MAML, without requiring access to second-order information. △ Less

Submitted 15 May, 2020; v1 submitted 27 August, 2019; originally announced August 2019.

Comments: To appear in the proceedings of the $23^{rd}$ International Conference on Artificial Intelligence and Statistics (AISTATS) 2020

arXiv:1908.02344 [pdf, other]

Statistical modeling of groundwater quality assessment in Iran using a flexible Poisson likelihood

Authors: Mahsa Nadifar, Hossein Baghishani, Afshin Fallah, Havard Rue

Abstract: Assessing water quality and recognizing its associated risks to human health and the broader environment is undoubtedly essential. Groundwater is widely used to supply water for drinking, industry, and agriculture purposes. The groundwater quality measurements vary for different climates and various human behaviors, and consequently, their spatial variability can be substantial. In this paper, we… ▽ More Assessing water quality and recognizing its associated risks to human health and the broader environment is undoubtedly essential. Groundwater is widely used to supply water for drinking, industry, and agriculture purposes. The groundwater quality measurements vary for different climates and various human behaviors, and consequently, their spatial variability can be substantial. In this paper, we aim to analyze a groundwater dataset from the Golestan province, Iran, for November 2003 to November 2013. Our target response variable to monitor the quality of groundwater is the number of counts that the quality of water is good for a drink. Hence, we are facing spatial count data. Due to the ubiquity of over or underdispersion in count data, we propose a Bayesian hierarchical modeling approach based on the renewal theory that relates nonexponential waiting times between events and the distribution of the counts, relaxing the assumption of equidispersion at the cost of an additional parameter. Particularly, we extend the methodology for the analysis of spatial count data based on the gamma distribution assumption for waiting times. The model can be formulated as a latent Gaussian model, and therefore, we can carry out the fast computation by using the integrated nested Laplace approximation method. The analysis of the groundwater dataset and a simulation study show a significant improvement over both Poisson and negative binomial models. △ Less

Submitted 6 August, 2019; originally announced August 2019.

Comments: 24 pages, 6 figures

arXiv:1901.08022 [pdf, other]

A Universally Optimal Multistage Accelerated Stochastic Gradient Method

Authors: Necdet Serhat Aybat, Alireza Fallah, Mert Gurbuzbalaban, Asuman Ozdaglar

Abstract: We study the problem of minimizing a strongly convex, smooth function when we have noisy estimates of its gradient. We propose a novel multistage accelerated algorithm that is universally optimal in the sense that it achieves the optimal rate both in the deterministic and stochastic case and operates without knowledge of noise characteristics. The algorithm consists of stages that use a stochastic… ▽ More We study the problem of minimizing a strongly convex, smooth function when we have noisy estimates of its gradient. We propose a novel multistage accelerated algorithm that is universally optimal in the sense that it achieves the optimal rate both in the deterministic and stochastic case and operates without knowledge of noise characteristics. The algorithm consists of stages that use a stochastic version of Nesterov's method with a specific restart and parameters selected to achieve the fastest reduction in the bias-variance terms in the convergence rate bounds. △ Less

Submitted 27 October, 2019; v1 submitted 23 January, 2019; originally announced January 2019.

Comments: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019)

arXiv:1805.10579 [pdf, other]

Robust Accelerated Gradient Methods for Smooth Strongly Convex Functions

Authors: Necdet Serhat Aybat, Alireza Fallah, Mert Gurbuzbalaban, Asuman Ozdaglar

Abstract: We study the trade-offs between convergence rate and robustness to gradient errors in designing a first-order algorithm. We focus on gradient descent (GD) and accelerated gradient (AG) methods for minimizing strongly convex functions when the gradient has random errors in the form of additive white noise. With gradient errors, the function values of the iterates need not converge to the optimal va… ▽ More We study the trade-offs between convergence rate and robustness to gradient errors in designing a first-order algorithm. We focus on gradient descent (GD) and accelerated gradient (AG) methods for minimizing strongly convex functions when the gradient has random errors in the form of additive white noise. With gradient errors, the function values of the iterates need not converge to the optimal value; hence, we define the robustness of an algorithm to noise as the asymptotic expected suboptimality of the iterate sequence to input noise power. For this robustness measure, we provide exact expressions for the quadratic case using tools from robust control theory and tight upper bounds for the smooth strongly convex case using Lyapunov functions certified through matrix inequalities. We use these characterizations within an optimization problem which selects parameters of each algorithm to achieve a particular trade-off between rate and robustness. Our results show that AG can achieve acceleration while being more robust to random gradient errors. This behavior is quite different than previously reported in the deterministic gradient noise setting. We also establish some connections between the robustness of an algorithm and how quickly it can converge back to the optimal solution if it is perturbed from the optimal point with deterministic noise. Our framework also leads to practical algorithms that can perform better than other state-of-the-art methods in the presence of random gradient noise. △ Less

Submitted 5 November, 2019; v1 submitted 27 May, 2018; originally announced May 2018.

Comments: To appear in SIAM Journal on Optimization (SIOPT)

Showing 1–13 of 13 results for author: Fallah, A