Search | arXiv e-print repository

Distributionally Robust Optimization with Bias and Variance Reduction

Authors: Ronak Mehta, Vincent Roulet, Krishna Pillutla, Zaid Harchaoui

Abstract: We consider the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and $f$-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-$k$ loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparame… ▽ More We consider the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and $f$-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-$k$ loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparameter, and prove that it enjoys linear convergence for smooth regularized losses. This contrasts with previous algorithms that either require tuning multiple hyperparameters or potentially fail to converge due to biased gradient estimates or inadequate regularization. Empirically, we show that Prospect can converge 2-3$\times$ faster than baselines such as stochastic gradient and stochastic saddle-point methods on distribution shift and fairness benchmarks spanning tabular, vision, and language domains. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2212.05149 [pdf, other]

Stochastic Optimization for Spectral Risk Measures

Authors: Ronak Mehta, Vincent Roulet, Krishna Pillutla, Lang Liu, Zaid Harchaoui

Abstract: Spectral risk objectives - also called $L$-risks - allow for learning systems to interpolate between optimizing average-case performance (as in empirical risk minimization) and worst-case performance on a task. We develop stochastic algorithms to optimize these quantities by characterizing their subdifferential and addressing challenges such as biasedness of subgradient estimates and non-smoothnes… ▽ More Spectral risk objectives - also called $L$-risks - allow for learning systems to interpolate between optimizing average-case performance (as in empirical risk minimization) and worst-case performance on a task. We develop stochastic algorithms to optimize these quantities by characterizing their subdifferential and addressing challenges such as biasedness of subgradient estimates and non-smoothness of the objective. We show theoretically and experimentally that out-of-the-box approaches such as stochastic subgradient and dual averaging are hindered by bias and that our approach outperforms them. △ Less

Submitted 9 December, 2022; originally announced December 2022.

arXiv:2212.04014 [pdf, other]

Statistical and Computational Guarantees for Influence Diagnostics

Authors: Jillian Fisher, Lang Liu, Krishna Pillutla, Yejin Choi, Zaid Harchaoui

Abstract: Influence diagnostics such as influence functions and approximate maximum influence perturbations are popular in machine learning and in AI domain applications. Influence diagnostics are powerful statistical tools to identify influential datapoints or subsets of datapoints. We establish finite-sample statistical bounds, as well as computational complexity bounds, for influence functions and approx… ▽ More Influence diagnostics such as influence functions and approximate maximum influence perturbations are popular in machine learning and in AI domain applications. Influence diagnostics are powerful statistical tools to identify influential datapoints or subsets of datapoints. We establish finite-sample statistical bounds, as well as computational complexity bounds, for influence functions and approximate maximum influence perturbations using efficient inverse-Hessian-vector product implementations. We illustrate our results with generalized linear models and large attention based models on synthetic and real data. △ Less

Submitted 19 September, 2023; v1 submitted 7 December, 2022; originally announced December 2022.

Comments: For AISTATS 2023. Software see https://github.com/jfisher52/influence_theory

arXiv:2112.09429 [pdf, other]

doi 10.1007/s10994-023-06332-x

Federated Learning with Superquantile Aggregation for Heterogeneous Data

Authors: Krishna Pillutla, Yassine Laguel, Jérôme Malick, Zaid Harchaoui

Abstract: We present a federated learning framework that is designed to robustly deliver good predictive performance across individual clients with heterogeneous data. The proposed approach hinges upon a superquantile-based learning objective that captures the tail statistics of the error distribution over heterogeneous clients. We present a stochastic training algorithm that interleaves differentially priv… ▽ More We present a federated learning framework that is designed to robustly deliver good predictive performance across individual clients with heterogeneous data. The proposed approach hinges upon a superquantile-based learning objective that captures the tail statistics of the error distribution over heterogeneous clients. We present a stochastic training algorithm that interleaves differentially private client filtering with federated averaging steps. We prove finite time convergence guarantees for the algorithm: $O(1/\sqrt{T})$ in the nonconvex case in $T$ communication rounds and $O(\exp(-T/κ^{3/2}) + κ/T)$ in the strongly convex case with local condition number $κ$. Experimental results on benchmark datasets for federated learning demonstrate that our approach is competitive with classical ones in terms of average error and outperforms them in terms of tail statistics of the error. △ Less

Submitted 6 December, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: Machine Learning Journal, Special Issue on Safe and Fair Machine Learning (To appear)

Journal ref: Machine Learning (2023): 1-68

arXiv:2106.07898 [pdf, other]

Divergence Frontiers for Generative Models: Sample Complexity, Quantization Effects, and Frontier Integrals

Authors: Lang Liu, Krishna Pillutla, Sean Welleck, Sewoong Oh, Yejin Choi, Zaid Harchaoui

Abstract: The spectacular success of deep generative models calls for quantitative tools to measure their statistical performance. Divergence frontiers have recently been proposed as an evaluation framework for generative models, due to their ability to measure the quality-diversity trade-off inherent to deep generative modeling. We establish non-asymptotic bounds on the sample complexity of divergence fron… ▽ More The spectacular success of deep generative models calls for quantitative tools to measure their statistical performance. Divergence frontiers have recently been proposed as an evaluation framework for generative models, due to their ability to measure the quality-diversity trade-off inherent to deep generative modeling. We establish non-asymptotic bounds on the sample complexity of divergence frontiers. We also introduce frontier integrals which provide summary statistics of divergence frontiers. We show how smoothed estimators such as Good-Turing or Krichevsky-Trofimov can overcome the missing mass problem and lead to faster rates of convergence. We illustrate the theoretical results with numerical examples from natural language processing and computer vision. △ Less

Submitted 11 December, 2021; v1 submitted 15 June, 2021; originally announced June 2021.

arXiv:2002.11223 [pdf, other]

doi 10.1007/s10994-023-06332-x

Device Heterogeneity in Federated Learning: A Superquantile Approach

Authors: Yassine Laguel, Krishna Pillutla, Jérôme Malick, Zaid Harchaoui

Abstract: We propose a federated learning framework to handle heterogeneous client devices which do not conform to the population data distribution. The approach hinges upon a parameterized superquantile-based objective, where the parameter ranges over levels of conformity. We present an optimization algorithm and establish its convergence to a stationary point. We show how to practically implement it using… ▽ More We propose a federated learning framework to handle heterogeneous client devices which do not conform to the population data distribution. The approach hinges upon a parameterized superquantile-based objective, where the parameter ranges over levels of conformity. We present an optimization algorithm and establish its convergence to a stationary point. We show how to practically implement it using secure aggregation by interleaving iterations of the usual federated averaging method with device filtering. We conclude with numerical experiments on neural networks as well as linear models on tasks from computer vision and natural language processing. △ Less

Submitted 25 February, 2020; originally announced February 2020.

Journal ref: Machine Learning (2023): 1-68

arXiv:1912.13445 [pdf, other]

doi 10.1109/TSP.2022.3153135

Robust Aggregation for Federated Learning

Authors: Krishna Pillutla, Sham M. Kakade, Zaid Harchaoui

Abstract: Federated learning is the centralized training of statistical models from decentralized data on mobile devices while preserving the privacy of each device. We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The approach relies on a robust aggregation oracle based on the geometric medi… ▽ More Federated learning is the centralized training of statistical models from decentralized data on mobile devices while preserving the privacy of each device. We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The approach relies on a robust aggregation oracle based on the geometric median, which returns a robust aggregate using a constant number of iterations of a regular non-robust averaging oracle. The robust aggregation oracle is privacy-preserving, similar to the non-robust secure average oracle it builds upon. We establish its convergence for least squares estimation of additive models. We provide experimental results with linear models and deep networks for three tasks in computer vision and natural language processing. The robust aggregation approach is agnostic to the level of corruption; it outperforms the classical aggregation approach in terms of robustness when the level of corruption is high, while being competitive in the regime of low corruption. Two variants, a faster one with one-step robust aggregation and another one with on-device personalization, round off the paper. △ Less

Submitted 17 January, 2022; v1 submitted 31 December, 2019; originally announced December 2019.

Journal ref: IEEE Transactions on Signal Processing 70 (2022): 1142-1154

arXiv:1902.03228 [pdf, other]

A Smoother Way to Train Structured Prediction Models

Authors: Krishna Pillutla, Vincent Roulet, Sham M. Kakade, Zaid Harchaoui

Abstract: We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon. Smoothing overcomes the non-smoothness inherent to the maximum margin structured prediction objective, and paves the way for the use of fast primal gradient-based optimization algorithms. We illustrate the proposed framework by developing a novel primal incremental optim… ▽ More We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon. Smoothing overcomes the non-smoothness inherent to the maximum margin structured prediction objective, and paves the way for the use of fast primal gradient-based optimization algorithms. We illustrate the proposed framework by developing a novel primal incremental optimization algorithm for the structural support vector machine. The proposed algorithm blends an extrapolation scheme for acceleration and an adaptive smoothing scheme and builds upon the stochastic variance-reduced gradient algorithm. We establish its worst-case global complexity bound and study several practical variants, including extensions to deep structured prediction. We present experimental results on two real-world problems, namely named entity recognition and visual object localization. The experimental results show that the proposed framework allows us to build upon efficient inference algorithms to develop large-scale optimization algorithms for structured prediction which can achieve competitive performance on the two real-world problems. △ Less

Submitted 8 February, 2019; originally announced February 2019.

Comments: Short version appeared in Neural Information Processing Systems (NeurIPS) 2018

arXiv:1710.09430 [pdf, ps, other]

doi 10.4230/LIPIcs.FSTTCS.2017.2

A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)

Authors: Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, Venkata Krishna Pillutla, Aaron Sidford

Abstract: This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addre… ▽ More This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharply characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addresses model mis-specification. △ Less

Submitted 21 July, 2018; v1 submitted 25 October, 2017; originally announced October 2017.

Comments: Lemma 1 has been updated in v2

arXiv:1512.04848 [pdf, other]

Data Driven Resource Allocation for Distributed Learning

Authors: Travis Dick, Mu Li, Venkata Krishna Pillutla, Colin White, Maria Florina Balcan, Alex Smola

Abstract: In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present… ▽ More In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power. △ Less

Submitted 15 December, 2016; v1 submitted 15 December, 2015; originally announced December 2015.

Showing 1–10 of 10 results for author: Pillutla, K