Search | arXiv e-print repository

Rejection via Learning Density Ratios

Authors: Alexander Soen, Hisham Husain, Philip Schulz, Vu Nguyen

Abstract: Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. The predominant approach is to alter the supervised learning pipeline by augmenting typical loss functions, letting model rejection incur a lower loss than an incorrect prediction. Instead, we propose a different distributional perspective, where we seek to find an idealized data di… ▽ More Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. The predominant approach is to alter the supervised learning pipeline by augmenting typical loss functions, letting model rejection incur a lower loss than an incorrect prediction. Instead, we propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance. This can be formalized via the optimization of a loss's risk with a $ φ$-divergence regularization term. Through this idealized distribution, a rejection decision can be made by utilizing the density ratio between this distribution and the data distribution. We focus on the setting where our $ φ$-divergences are specified by the family of $ α$-divergence. Our framework is tested empirically over clean and noisy datasets. △ Less

Submitted 28 May, 2024; originally announced May 2024.

arXiv:2203.02128 [pdf, other]

Distributionally Robust Bayesian Optimization with $\varphi$-divergences

Authors: Hisham Husain, Vu Nguyen, Anton van den Hengel

Abstract: The study of robustness has received much attention due to its inevitability in data-driven settings where many systems face uncertainty. One such example of concern is Bayesian Optimization (BO), where uncertainty is multi-faceted, yet there only exists a limited number of works dedicated to this direction. In particular, there is the work of Kirschner et al. (2020), which bridges the existing li… ▽ More The study of robustness has received much attention due to its inevitability in data-driven settings where many systems face uncertainty. One such example of concern is Bayesian Optimization (BO), where uncertainty is multi-faceted, yet there only exists a limited number of works dedicated to this direction. In particular, there is the work of Kirschner et al. (2020), which bridges the existing literature of Distributionally Robust Optimization (DRO) by casting the BO problem from the lens of DRO. While this work is pioneering, it admittedly suffers from various practical shortcomings such as finite contexts assumptions, leaving behind the main question Can one devise a computationally tractable algorithm for solving this DRO-BO problem? In this work, we tackle this question to a large degree of generality by considering robustness against data-shift in $\varphi$-divergences, which subsumes many popular choices, such as the $χ^2$-divergence, Total Variation, and the extant Kullback-Leibler (KL) divergence. We show that the DRO-BO problem in this setting is equivalent to a finite-dimensional optimization problem which, even in the continuous context setting, can be easily implemented with provable sublinear regret bounds. We then show experimentally that our method surpasses existing methods, attesting to the theoretical results. △ Less

Submitted 27 October, 2023; v1 submitted 3 March, 2022; originally announced March 2022.

Comments: NeurIPS 2023 camera ready paper

arXiv:2102.08093

A Law of Robustness for Weight-bounded Neural Networks

Authors: Hisham Husain, Borja Balle

Abstract: Robustness of deep neural networks against adversarial perturbations is a pressing concern motivated by recent findings showing the pervasive nature of such vulnerabilities. One method of characterizing the robustness of a neural network model is through its Lipschitz constant, which forms a robustness certificate. A natural question to ask is, for a fixed model class (such as neural networks) and… ▽ More Robustness of deep neural networks against adversarial perturbations is a pressing concern motivated by recent findings showing the pervasive nature of such vulnerabilities. One method of characterizing the robustness of a neural network model is through its Lipschitz constant, which forms a robustness certificate. A natural question to ask is, for a fixed model class (such as neural networks) and a dataset of size $n$, what is the smallest achievable Lipschitz constant among all models that fit the dataset? Recently, (Bubeck et al., 2020) conjectured that when using two-layer networks with $k$ neurons to fit a generic dataset, the smallest Lipschitz constant is $Ω(\sqrt{\frac{n}{k}})$. This implies that one would require one neuron per data point to robustly fit the data. In this work we derive a lower bound on the Lipschitz constant for any arbitrary model class with bounded Rademacher complexity. Our result coincides with that conjectured in (Bubeck et al., 2020) for two-layer networks under the assumption of bounded weights. However, due to our result's generality, we also derive bounds for multi-layer neural networks, discovering that one requires $\log n$ constant-sized layers to robustly fit the data. Thus, our work establishes a law of robustness for weight bounded neural networks and provides formal evidence on the necessity of over-parametrization in deep learning. △ Less

Submitted 12 March, 2021; v1 submitted 16 February, 2021; originally announced February 2021.

Comments: The main result does not resolve the conjecture as claimed. However the proof technique can be used to obtain a weaker result. The manuscript will be updated at a later date

arXiv:2101.07012 [pdf, other]

Regularized Policies are Reward Robust

Authors: Hisham Husain, Kamil Ciosek, Ryota Tomioka

Abstract: Entropic regularization of policies in Reinforcement Learning (RL) is a commonly used heuristic to ensure that the learned policy explores the state-space sufficiently before overfitting to a local optimal policy. The primary motivation for using entropy is for exploration and disambiguating optimal policies; however, the theoretical effects are not entirely understood. In this work, we study the… ▽ More Entropic regularization of policies in Reinforcement Learning (RL) is a commonly used heuristic to ensure that the learned policy explores the state-space sufficiently before overfitting to a local optimal policy. The primary motivation for using entropy is for exploration and disambiguating optimal policies; however, the theoretical effects are not entirely understood. In this work, we study the more general regularized RL objective and using Fenchel duality; we derive the dual problem which takes the form of an adversarial reward problem. In particular, we find that the optimal policy found by a regularized objective is precisely an optimal policy of a reinforcement learning problem under a worst-case adversarial reward. Our result allows us to reinterpret the popular entropic regularization scheme as a form of robustification. Furthermore, due to the generality of our results, we apply to other existing regularization schemes. Our results thus give insights into the effects of regularization of policies and deepen our understanding of exploration through robust rewards at large. △ Less

Submitted 18 January, 2021; originally announced January 2021.

arXiv:2012.00188 [pdf, other]

Fair Densities via Boosting the Sufficient Statistics of Exponential Families

Authors: Alexander Soen, Hisham Husain, Richard Nock

Abstract: We introduce a boosting algorithm to pre-process data for fairness. Starting from an initial fair but inaccurate distribution, our approach shifts towards better data fitting while still ensuring a minimal fairness guarantee. To do so, it learns the sufficient statistics of an exponential family with boosting-compliant convergence. Importantly, we are able to theoretically prove that the learned d… ▽ More We introduce a boosting algorithm to pre-process data for fairness. Starting from an initial fair but inaccurate distribution, our approach shifts towards better data fitting while still ensuring a minimal fairness guarantee. To do so, it learns the sufficient statistics of an exponential family with boosting-compliant convergence. Importantly, we are able to theoretically prove that the learned distribution will have a representation rate and statistical rate data fairness guarantee. Unlike recent optimization based pre-processing methods, our approach can be easily adapted for continuous domain features. Furthermore, when the weak learners are specified to be decision trees, the sufficient statistics of the learned distribution can be examined to provide clues on sources of (un)fairness. Empirical results are present to display the quality of result on real-world data. △ Less

Submitted 15 August, 2023; v1 submitted 30 November, 2020; originally announced December 2020.

Comments: Published in Proceedings of the 40th International Conference on Machine Learning (ICML2023)

arXiv:2006.05188 [pdf, other]

Optimal Continual Learning has Perfect Memory and is NP-hard

Authors: Jeremias Knoblauch, Hisham Husain, Tom Diethe

Abstract: Continual Learning (CL) algorithms incrementally learn a predictor or representation across multiple sequentially observed tasks. Designing CL algorithms that perform reliably and avoid so-called catastrophic forgetting has proven a persistent challenge. The current paper develops a theoretical approach that explains why. In particular, we derive the computational properties which CL algorithms wo… ▽ More Continual Learning (CL) algorithms incrementally learn a predictor or representation across multiple sequentially observed tasks. Designing CL algorithms that perform reliably and avoid so-called catastrophic forgetting has proven a persistent challenge. The current paper develops a theoretical approach that explains why. In particular, we derive the computational properties which CL algorithms would have to possess in order to avoid catastrophic forgetting. Our main finding is that such optimal CL algorithms generally solve an NP-hard problem and will require perfect memory to do so. The findings are of theoretical interest, but also explain the excellent performance of CL algorithms using experience replay, episodic memory and core sets relative to regularization-based approaches. △ Less

Submitted 9 June, 2020; originally announced June 2020.

Comments: Accepted for publication at ICML (International Conference on Machine Learning) 2020; 13 pages, 8 Figures

arXiv:2006.04349 [pdf, other]

Distributional Robustness with IPMs and links to Regularization and GANs

Authors: Hisham Husain

Abstract: Robustness to adversarial attacks is an important concern due to the fragility of deep neural networks to small perturbations and has received an abundance of attention in recent years. Distributionally Robust Optimization (DRO), a particularly promising way of addressing this challenge, studies robustness via divergence-based uncertainty sets and has provided valuable insights into robustificatio… ▽ More Robustness to adversarial attacks is an important concern due to the fragility of deep neural networks to small perturbations and has received an abundance of attention in recent years. Distributionally Robust Optimization (DRO), a particularly promising way of addressing this challenge, studies robustness via divergence-based uncertainty sets and has provided valuable insights into robustification strategies such as regularization. In the context of machine learning, the majority of existing results have chosen $f$-divergences, Wasserstein distances and more recently, the Maximum Mean Discrepancy (MMD) to construct uncertainty sets. We extend this line of work for the purposes of understanding robustness via regularization by studying uncertainty sets constructed with Integral Probability Metrics (IPMs) - a large family of divergences including the MMD, Total Variation and Wasserstein distances. Our main result shows that DRO under \textit{any} choice of IPM corresponds to a family of regularization penalties, which recover and improve upon existing results in the setting of MMD and Wasserstein distances. Due to the generality of our result, we show that other choices of IPMs correspond to other commonly used penalties in machine learning. Furthermore, we extend our results to shed light on adversarial generative modelling via $f$-GANs, constituting the first study of distributional robustness for the $f$-GAN objective. Our results unveil the inductive properties of the discriminator set with regards to robustness, allowing us to give positive comments for several penalty-based GAN methods such as Wasserstein-, MMD- and Sobolev-GANs. In summary, our results intimately link GANs to distributional robustness, extend previous results on DRO and contribute to our understanding of the link between regularization and robustness at large. △ Less

Submitted 8 June, 2020; originally announced June 2020.

arXiv:1909.09436 [pdf, other]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Authors: Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, Marc Brockschmidt

Abstract: Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas. To enable evaluation of progress on code search, we are releasing the CodeSe… ▽ More Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas. To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. In this article, we describe the methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task. We hope that CodeSearchNet Challenge encourages researchers and practitioners to study this interesting task further and will host a competition and leaderboard to track the progress on the challenge. We are also keen on extending CodeSearchNet Challenge to more queries and programming languages in the future. △ Less

Submitted 8 June, 2020; v1 submitted 20 September, 2019; originally announced September 2019.

Comments: Updated evaluation numbers after fixing indexing bug

arXiv:1902.00985 [pdf, ps, other]

Adversarial Networks and Autoencoders: The Primal-Dual Relationship and Generalization Bounds

Authors: Hisham Husain, Richard Nock, Robert C. Williamson

Abstract: Since the introduction of Generative Adversarial Networks (GANs) and Variational Autoencoders (VAE), the literature on generative modelling has witnessed an overwhelming resurgence. The impressive, yet elusive empirical performance of GANs has lead to the rise of many GAN-VAE hybrids, with the hopes of GAN level performance and additional benefits of VAE, such as an encoder for feature reduction,… ▽ More Since the introduction of Generative Adversarial Networks (GANs) and Variational Autoencoders (VAE), the literature on generative modelling has witnessed an overwhelming resurgence. The impressive, yet elusive empirical performance of GANs has lead to the rise of many GAN-VAE hybrids, with the hopes of GAN level performance and additional benefits of VAE, such as an encoder for feature reduction, which is not offered by GANs. Recently, the Wasserstein Autoencoder (WAE) was proposed, achieving performance similar to that of GANs, yet it is still unclear whether the two are fundamentally different or can be further improved into a unified model. In this work, we study the $f$-GAN and WAE models and make two main discoveries. First, we find that the $f$-GAN and WAE objectives partake in a primal-dual relationship and are equivalent under some assumptions, which then allows us to explicate the success of WAE. Second, the equivalence result allows us to, for the first time, prove generalization bounds for Autoencoder models, which is a pertinent problem when it comes to theoretical analyses of generative models. Furthermore, we show that the WAE objective is related to other statistical quantities such as the $f$-divergence and in particular, upper bounded by the Wasserstein distance, which then allows us to tap into existing efficient (regularized) optimal transport solvers. Our findings thus present the first primal-dual relationship between GANs and Autoencoder models, comment on generalization abilities and make a step towards unifying these models. △ Less

Submitted 26 April, 2019; v1 submitted 3 February, 2019; originally announced February 2019.

arXiv:1806.04819 [pdf, other]

Integral Privacy for Sampling

Authors: Hisham Husain, Zac Cranko, Richard Nock

Abstract: Differential privacy is a leading protection setting, focused by design on individual privacy. Many applications, in medical / pharmaceutical domains or social networks, rather posit privacy at a group level, a setting we call integral privacy. We aim for the strongest form of privacy: the group size is in particular not known in advance. We study a problem with related applications in domains cit… ▽ More Differential privacy is a leading protection setting, focused by design on individual privacy. Many applications, in medical / pharmaceutical domains or social networks, rather posit privacy at a group level, a setting we call integral privacy. We aim for the strongest form of privacy: the group size is in particular not known in advance. We study a problem with related applications in domains cited above that have recently met with substantial recent press: sampling. Keeping correct utility levels in such a strong model of statistical indistinguishability looks difficult to be achieved with the usual differential privacy toolbox because it would typically scale in the worst case the sensitivity by the sample size and so the noise variance by up to its square. We introduce a trick specific to sampling that bypasses the sensitivity analysis. Privacy enforces an information theoretic barrier on approximation, and we show how to reach this barrier with guarantees on the approximation of the target non private density. We do so using a recent approach to non private density estimation relying on the original boosting theory, learning the sufficient statistics of an exponential family with classifiers. Approximation guarantees cover the mode capture problem. In the context of learning, the sampling problem is particularly important: because integral privacy enjoys the same closure under post-processing as differential privacy does, any algorithm using integrally privacy sampled data would result in an output equally integrally private. We also show that this brings fairness guarantees on post-processing that would eventually elude classical differential privacy: any decision process has bounded data-dependent bias when the data is integrally privately sampled. Experimental results against private kernel density estimation and private GANs displays the quality of our results. △ Less

Submitted 2 July, 2019; v1 submitted 12 June, 2018; originally announced June 2018.

Showing 1–10 of 10 results for author: Husain, H