Search | arXiv e-print repository

Learning Continually by Spectral Regularization

Authors: Alex Lewandowski, Saurabh Kumar, Dale Schuurmans, András György, Marlos C. Machado

Abstract: Loss of plasticity is a phenomenon where neural networks become more difficult to train during the course of learning. Continual learning algorithms seek to mitigate this effect by sustaining good predictive performance while maintaining network trainability. We develop new techniques for improving continual learning by first reconsidering how initialization can ensure trainability during early ph… ▽ More Loss of plasticity is a phenomenon where neural networks become more difficult to train during the course of learning. Continual learning algorithms seek to mitigate this effect by sustaining good predictive performance while maintaining network trainability. We develop new techniques for improving continual learning by first reconsidering how initialization can ensure trainability during early phases of learning. From this perspective, we derive new regularization strategies for continual learning that ensure beneficial initialization properties are better maintained throughout training. In particular, we investigate two new regularization techniques for continual learning: (i) Wasserstein regularization toward the initial weight distribution, which is less restrictive than regularizing toward initial weights; and (ii) regularizing weight matrix singular values, which directly ensures gradient diversity is maintained throughout training. We present an experimental analysis that shows these alternative regularizers can improve continual learning performance across a range of supervised learning tasks and model architectures. The alternative regularizers prove to be less sensitive to hyperparameters while demonstrating better training in individual tasks, sustaining trainability as new tasks arrive, and achieving better generalization performance. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.02543 [pdf, other]

To Believe or Not to Believe Your LLM

Authors: Yasin Abbasi Yadkori, Ilja Kuzborskij, András György, Csaba Szepesvári

Abstract: We explore uncertainty quantification in large language models (LLMs), with the goal to identify when uncertainty in responses given a query is large. We simultaneously consider both epistemic and aleatoric uncertainties, where the former comes from the lack of knowledge about the ground truth (such as about facts or the language), and the latter comes from irreducible randomness (such as multiple… ▽ More We explore uncertainty quantification in large language models (LLMs), with the goal to identify when uncertainty in responses given a query is large. We simultaneously consider both epistemic and aleatoric uncertainties, where the former comes from the lack of knowledge about the ground truth (such as about facts or the language), and the latter comes from irreducible randomness (such as multiple possible answers). In particular, we derive an information-theoretic metric that allows to reliably detect when only epistemic uncertainty is large, in which case the output of the model is unreliable. This condition can be computed based solely on the output of the model obtained simply by some special iterative prompting based on the previous responses. Such quantification, for instance, allows to detect hallucinations (cases when epistemic uncertainty is high) in both single- and multi-answer responses. This is in contrast to many standard uncertainty quantification strategies (such as thresholding the log-likelihood of a response) where hallucinations in the multi-answer case cannot be detected. We conduct a series of experiments which demonstrate the advantage of our formulation. Further, our investigations shed some light on how the probabilities assigned to a given output by an LLM can be amplified by iterative prompting, which might be of independent interest. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.01563 [pdf, other]

Mitigating LLM Hallucinations via Conformal Abstention

Authors: Yasin Abbasi Yadkori, Ilja Kuzborskij, David Stutz, András György, Adam Fisch, Arnaud Doucet, Iuliya Beloshapka, Wei-Hung Weng, Yao-Yuan Yang, Csaba Szepesvári, Ali Taylan Cemgil, Nenad Tomasev

Abstract: We develop a principled procedure for determining when a large language model (LLM) should abstain from responding (e.g., by saying "I don't know") in a general domain, instead of resorting to possibly "hallucinating" a non-sensical or incorrect answer. Building on earlier approaches that use self-consistency as a more reliable measure of model confidence, we propose using the LLM itself to self-e… ▽ More We develop a principled procedure for determining when a large language model (LLM) should abstain from responding (e.g., by saying "I don't know") in a general domain, instead of resorting to possibly "hallucinating" a non-sensical or incorrect answer. Building on earlier approaches that use self-consistency as a more reliable measure of model confidence, we propose using the LLM itself to self-evaluate the similarity between each of its sampled responses for a given query. We then further leverage conformal prediction techniques to develop an abstention procedure that benefits from rigorous theoretical guarantees on the hallucination rate (error rate). Experimentally, our resulting conformal abstention method reliably bounds the hallucination rate on various closed-book, open-domain generative question answering datasets, while also maintaining a significantly less conservative abstention rate on a dataset with long responses (Temporal Sequences) compared to baselines using log-probability scores to quantify uncertainty, while achieveing comparable performance on a dataset with short answers (TriviaQA). To evaluate the experiments automatically, one needs to determine if two responses are equivalent given a question. Following standard practice, we use a thresholded similarity function to determine if two responses match, but also provide a method for calibrating the threshold based on conformal prediction, with theoretical guarantees on the accuracy of the match prediction, which might be of independent interest. △ Less

Submitted 4 April, 2024; originally announced May 2024.

arXiv:2403.01518 [pdf, other]

Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models

Authors: Amal Rannen-Triki, Jorg Bornschein, Razvan Pascanu, Marcus Hutter, Andras György, Alexandre Galashov, Yee Whye Teh, Michalis K. Titsias

Abstract: We consider the problem of online fine tuning the parameters of a language model at test time, also known as dynamic evaluation. While it is generally known that this approach improves the overall predictive performance, especially when considering distributional shift between training and evaluation data, we here emphasize the perspective that online adaptation turns parameters into temporally ch… ▽ More We consider the problem of online fine tuning the parameters of a language model at test time, also known as dynamic evaluation. While it is generally known that this approach improves the overall predictive performance, especially when considering distributional shift between training and evaluation data, we here emphasize the perspective that online adaptation turns parameters into temporally changing states and provides a form of context-length extension with memory in weights, more in line with the concept of memory in neuroscience. We pay particular attention to the speed of adaptation (in terms of sample efficiency),sensitivity to the overall distributional drift, and the computational overhead for performing gradient computations and parameter updates. Our empirical study provides insights on when online adaptation is particularly interesting. We highlight that with online adaptation the conceptual distinction between in-context learning and fine tuning blurs: both are methods to condition the model on previously observed tokens. △ Less

Submitted 3 March, 2024; originally announced March 2024.

arXiv:2402.05878 [pdf, other]

Prior-Dependent Allocations for Bayesian Fixed-Budget Best-Arm Identification in Structured Bandits

Authors: Nicolas Nguyen, Imad Aouali, András György, Claire Vernade

Abstract: We study the problem of Bayesian fixed-budget best-arm identification (BAI) in structured bandits. We propose an algorithm that uses fixed allocations based on the prior information and the structure of the environment. We provide theoretical bounds on its performance across diverse models, including the first prior-dependent upper bounds for linear and hierarchical BAI. Our key contribution is in… ▽ More We study the problem of Bayesian fixed-budget best-arm identification (BAI) in structured bandits. We propose an algorithm that uses fixed allocations based on the prior information and the structure of the environment. We provide theoretical bounds on its performance across diverse models, including the first prior-dependent upper bounds for linear and hierarchical BAI. Our key contribution is introducing new proof methods that result in tighter bounds for multi-armed BAI compared to existing methods. We extensively compare our approach to other fixed-budget BAI methods, demonstrating its consistent and robust performance in various settings. Our work improves our understanding of Bayesian fixed-budget BAI in structured bandits and highlights the effectiveness of our approach in practical scenarios. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2402.03928 [pdf, other]

Approximating the Core via Iterative Coalition Sampling

Authors: Ian Gemp, Marc Lanctot, Luke Marris, Yiran Mao, Edgar Duéñez-Guzmán, Sarah Perrin, Andras Gyorgy, Romuald Elie, Georgios Piliouras, Michael Kaisers, Daniel Hennes, Kalesha Bullard, Kate Larson, Yoram Bachrach

Abstract: The core is a central solution concept in cooperative game theory, defined as the set of feasible allocations or payments such that no subset of agents has incentive to break away and form their own subgroup or coalition. However, it has long been known that the core (and approximations, such as the least-core) are hard to compute. This limits our ability to analyze cooperative games in general, a… ▽ More The core is a central solution concept in cooperative game theory, defined as the set of feasible allocations or payments such that no subset of agents has incentive to break away and form their own subgroup or coalition. However, it has long been known that the core (and approximations, such as the least-core) are hard to compute. This limits our ability to analyze cooperative games in general, and to fully embrace cooperative game theory contributions in domains such as explainable AI (XAI), where the core can complement the Shapley values to identify influential features or instances supporting predictions by black-box models. We propose novel iterative algorithms for computing variants of the core, which avoid the computational bottleneck of many other approaches; namely solving large linear programs. As such, they scale better to very large problems as we demonstrate across different classes of cooperative games, including weighted voting games, induced subgraph games, and marginal contribution networks. We also explore our algorithms in the context of XAI, providing further evidence of the power of the core for such applications. △ Less

Submitted 6 February, 2024; originally announced February 2024.

Comments: Published in AAMAS 2024

arXiv:2310.07811 [pdf, ps, other]

Online RL in Linearly $q^π$-Realizable MDPs Is as Easy as in Linear MDPs If You Learn What to Ignore

Authors: Gellért Weisz, András György, Csaba Szepesvári

Abstract: We consider online reinforcement learning (RL) in episodic Markov decision processes (MDPs) under the linear $q^π$-realizability assumption, where it is assumed that the action-values of all policies can be expressed as linear functions of state-action features. This class is known to be more general than linear MDPs, where the transition kernel and the reward function are assumed to be linear fun… ▽ More We consider online reinforcement learning (RL) in episodic Markov decision processes (MDPs) under the linear $q^π$-realizability assumption, where it is assumed that the action-values of all policies can be expressed as linear functions of state-action features. This class is known to be more general than linear MDPs, where the transition kernel and the reward function are assumed to be linear functions of the feature vectors. As our first contribution, we show that the difference between the two classes is the presence of states in linearly $q^π$-realizable MDPs where for any policy, all the actions have approximately equal values, and skipping over these states by following an arbitrarily fixed policy in those states transforms the problem to a linear MDP. Based on this observation, we derive a novel (computationally inefficient) learning algorithm for linearly $q^π$-realizable MDPs that simultaneously learns what states should be skipped over and runs another learning algorithm on the linear MDP hidden in the problem. The method returns an $ε$-optimal policy after $\text{polylog}(H, d)/ε^2$ interactions with the MDP, where $H$ is the time horizon and $d$ is the dimension of the feature vectors, giving the first polynomial-sample-complexity online RL algorithm for this setting. The results are proved for the misspecified case, where the sample complexity is shown to degrade gracefully with the misspecification error. △ Less

Submitted 20 December, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

arXiv:2305.13934 [pdf, other]

doi 10.1038/s41598-023-38964-3

Perception, performance, and detectability of conversational artificial intelligence across 32 university courses

Authors: Hazem Ibrahim, Fengyuan Liu, Rohail Asim, Balaraju Battu, Sidahmed Benabderrahmane, Bashar Alhafni, Wifag Adnan, Tuka Alhanai, Bedoor AlShebli, Riyadh Baghdadi, Jocelyn J. Bélanger, Elena Beretta, Kemal Celik, Moumena Chaqfeh, Mohammed F. Daqaq, Zaynab El Bernoussi, Daryl Fougnie, Borja Garcia de Soto, Alberto Gandolfi, Andras Gyorgy, Nizar Habash, J. Andrew Harris, Aaron Kaufman, Lefteris Kirousis, Korhan Kocak , et al. (14 additional authors not shown)

Abstract: The emergence of large language models has led to the development of powerful tools such as ChatGPT that can produce text indistinguishable from human-generated work. With the increasing accessibility of such technology, students across the globe may utilize it to help with their school work -- a possibility that has sparked discussions on the integrity of student evaluations in the age of artific… ▽ More The emergence of large language models has led to the development of powerful tools such as ChatGPT that can produce text indistinguishable from human-generated work. With the increasing accessibility of such technology, students across the globe may utilize it to help with their school work -- a possibility that has sparked discussions on the integrity of student evaluations in the age of artificial intelligence (AI). To date, it is unclear how such tools perform compared to students on university-level courses. Further, students' perspectives regarding the use of such tools, and educators' perspectives on treating their use as plagiarism, remain unknown. Here, we compare the performance of ChatGPT against students on 32 university-level courses. We also assess the degree to which its use can be detected by two classifiers designed specifically for this purpose. Additionally, we conduct a survey across five countries, as well as a more in-depth survey at the authors' institution, to discern students' and educators' perceptions of ChatGPT's use. We find that ChatGPT's performance is comparable, if not superior, to that of students in many courses. Moreover, current AI-text classifiers cannot reliably detect ChatGPT's use in school work, due to their propensity to classify human-written answers as AI-generated, as well as the ease with which AI-generated text can be edited to evade detection. Finally, we find an emerging consensus among students to use the tool, and among educators to treat this as plagiarism. Our findings offer insights that could guide policy discussions addressing the integration of AI into educational frameworks. △ Less

Submitted 7 May, 2023; originally announced May 2023.

Comments: 17 pages, 4 figures

arXiv:2305.11032 [pdf, ps, other]

Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL

Authors: Qinghua Liu, Gellért Weisz, András György, Chi Jin, Csaba Szepesvári

Abstract: While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited -- they are either restricted to tabular MDPs or suffer from highly suboptimal sample complexity, especial in online RL where exploration is necessary. This paper proposes a simple efficie… ▽ More While policy optimization algorithms have played an important role in recent empirical success of Reinforcement Learning (RL), the existing theoretical understanding of policy optimization remains rather limited -- they are either restricted to tabular MDPs or suffer from highly suboptimal sample complexity, especial in online RL where exploration is necessary. This paper proposes a simple efficient policy optimization framework -- Optimistic NPG for online RL. Optimistic NPG can be viewed as a simple combination of the classic natural policy gradient (NPG) algorithm [Kakade, 2001] with optimistic policy evaluation subroutines to encourage exploration. For $d$-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an $\varepsilon$-optimal policy within $\tilde{O}(d^2/\varepsilon^3)$ samples, which is the first computationally efficient algorithm whose sample complexity has the optimal dimension dependence $\tildeΘ(d^2)$. It also improves over state-of-the-art results of policy optimization algorithms [Zanette et al., 2021] by a factor of $d$. In the realm of general function approximation, which subsumes linear MDPs, Optimistic NPG, to our best knowledge, stands as the first policy optimization algorithm that achieves polynomial sample complexity for learning near-optimal policies. △ Less

Submitted 3 December, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

arXiv:2302.05371 [pdf, ps, other]

A Second-Order Method for Stochastic Bandit Convex Optimisation

Authors: Tor Lattimore, András György

Abstract: We introduce a simple and efficient algorithm for unconstrained zeroth-order stochastic convex bandits and prove its regret is at most $(1 + r/d)[d^{1.5} \sqrt{n} + d^3] polylog(n, d, r)$ where $n$ is the horizon, $d$ the dimension and $r$ is the radius of a known ball containing the minimiser of the loss. We introduce a simple and efficient algorithm for unconstrained zeroth-order stochastic convex bandits and prove its regret is at most $(1 + r/d)[d^{1.5} \sqrt{n} + d^3] polylog(n, d, r)$ where $n$ is the horizon, $d$ the dimension and $r$ is the radius of a known ball containing the minimiser of the loss. △ Less

Submitted 10 February, 2023; originally announced February 2023.

Comments: 27 pages

arXiv:2301.03236 [pdf, other]

Optimistic Meta-Gradients

Authors: Sebastian Flennerhag, Tom Zahavy, Brendan O'Donoghue, Hado van Hasselt, András György, Satinder Singh

Abstract: We study the connection between gradient-based meta-learning and convex op-timisation. We observe that gradient descent with momentum is a special case of meta-gradients, and building on recent results in optimisation, we prove convergence rates for meta-learning in the single task setting. While a meta-learned update rule can yield faster convergence up to constant factor, it is not sufficient fo… ▽ More We study the connection between gradient-based meta-learning and convex op-timisation. We observe that gradient descent with momentum is a special case of meta-gradients, and building on recent results in optimisation, we prove convergence rates for meta-learning in the single task setting. While a meta-learned update rule can yield faster convergence up to constant factor, it is not sufficient for acceleration. Instead, some form of optimism is required. We show that optimism in meta-learning can be captured through Bootstrapped Meta-Gradients (Flennerhag et al., 2022), providing deeper insight into its underlying mechanics. △ Less

Submitted 9 January, 2023; originally announced January 2023.

arXiv:2212.12532 [pdf, other]

Generalization Bounds for Few-Shot Transfer Learning with Pretrained Classifiers

Authors: Tomer Galanti, András György, Marcus Hutter

Abstract: We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. We offer a theoretical expl… ▽ More We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. We offer a theoretical explanation for this behavior based on the recently discovered phenomenon of class-feature-variability collapse, that is, that during the training of deep classification networks the feature embeddings of samples belonging to the same class tend to concentrate around their class means. More specifically, we show that the few-shot error of the learned feature map on new classes (defined as the classification error of the nearest class-center classifier using centers learned from a small number of random samples from each new class) is small in case of class-feature-variability collapse, under the assumption that the classes are selected independently from a fixed distribution. This suggests that foundation models can provide feature maps that are transferable to new downstream tasks, even with very few samples; to our knowledge, this is the first performance bound for transfer-learning that is non-vacuous in the few-shot setting. △ Less

Submitted 16 July, 2023; v1 submitted 23 December, 2022; originally announced December 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2112.15121

arXiv:2212.03319 [pdf, other]

Understanding Self-Predictive Learning for Reinforcement Learning

Authors: Yunhao Tang, Zhaohan Daniel Guo, Pierre Harvey Richemond, Bernardo Ávila Pires, Yash Chandak, Rémi Munos, Mark Rowland, Mohammad Gheshlaghi Azar, Charline Le Lan, Clare Lyle, András György, Shantanu Thakoor, Will Dabney, Bilal Piot, Daniele Calandriello, Michal Valko

Abstract: We study the learning dynamics of self-predictive learning for reinforcement learning, a family of algorithms that learn representations by minimizing the prediction error of their own future latent representations. Despite its recent empirical success, such algorithms have an apparent defect: trivial representations (such as constants) minimize the prediction error, yet it is obviously undesirabl… ▽ More We study the learning dynamics of self-predictive learning for reinforcement learning, a family of algorithms that learn representations by minimizing the prediction error of their own future latent representations. Despite its recent empirical success, such algorithms have an apparent defect: trivial representations (such as constants) minimize the prediction error, yet it is obviously undesirable to converge to such solutions. Our central insight is that careful designs of the optimization dynamics are critical to learning meaningful representations. We identify that a faster paced optimization of the predictor and semi-gradient updates on the representation, are crucial to preventing the representation collapse. Then in an idealized setup, we show self-predictive learning dynamics carries out spectral decomposition on the state transition matrix, effectively capturing information of the transition dynamics. Building on the theoretical insights, we propose bidirectional self-predictive learning, a novel self-predictive algorithm that learns two representations simultaneously. We examine the robustness of our theoretical insights with a number of small-scale experiments and showcase the promise of the novel representation learning algorithm with large-scale experiments. △ Less

Submitted 6 December, 2022; originally announced December 2022.

arXiv:2210.15755 [pdf, ps, other]

Confident Approximate Policy Iteration for Efficient Local Planning in $q^π$-realizable MDPs

Authors: Gellért Weisz, András György, Tadashi Kozuno, Csaba Szepesvári

Abstract: We consider approximate dynamic programming in $γ$-discounted Markov decision processes and apply it to approximate planning with linear value-function approximation. Our first contribution is a new variant of Approximate Policy Iteration (API), called Confident Approximate Policy Iteration (CAPI), which computes a deterministic stationary policy with an optimal error bound scaling linearly with t… ▽ More We consider approximate dynamic programming in $γ$-discounted Markov decision processes and apply it to approximate planning with linear value-function approximation. Our first contribution is a new variant of Approximate Policy Iteration (API), called Confident Approximate Policy Iteration (CAPI), which computes a deterministic stationary policy with an optimal error bound scaling linearly with the product of the effective horizon $H$ and the worst-case approximation error $ε$ of the action-value functions of stationary policies. This improvement over API (whose error scales with $H^2$) comes at the price of an $H$-fold increase in memory cost. Unlike Scherrer and Lesner [2012], who recommended computing a non-stationary policy to achieve a similar improvement (with the same memory overhead), we are able to stick to stationary policies. This allows for our second contribution, the application of CAPI to planning with local access to a simulator and $d$-dimensional linear function approximation. As such, we design a planning algorithm that applies CAPI to obtain a sequence of policies with successively refined accuracies on a dynamically evolving set of states. The algorithm outputs an $\tilde O(\sqrt{d}Hε)$-optimal policy after issuing $\tilde O(dH^4/ε^2)$ queries to the simulator, simultaneously achieving the optimal accuracy bound and the best known query complexity bound, while earlier algorithms in the literature achieve only one of them. This query complexity is shown to be tight in all parameters except $H$. These improvements come at the expense of a mild (polynomial) increase in memory and computational costs of both the algorithm and its output policy. △ Less

Submitted 27 October, 2022; originally announced October 2022.

arXiv:2205.13170 [pdf, other]

Distributed Contextual Linear Bandits with Minimax Optimal Communication Cost

Authors: Sanae Amani, Tor Lattimore, András György, Lin F. Yang

Abstract: We study distributed contextual linear bandits with stochastic contexts, where $N$ agents act cooperatively to solve a linear bandit-optimization problem with $d$-dimensional features over the course of $T$ rounds. For this problem, we derive the first ever information-theoretic lower bound $Ω(dN)$ on the communication cost of any algorithm that performs optimally in a regret minimization setup. W… ▽ More We study distributed contextual linear bandits with stochastic contexts, where $N$ agents act cooperatively to solve a linear bandit-optimization problem with $d$-dimensional features over the course of $T$ rounds. For this problem, we derive the first ever information-theoretic lower bound $Ω(dN)$ on the communication cost of any algorithm that performs optimally in a regret minimization setup. We then propose a distributed batch elimination version of the LinUCB algorithm, DisBE-LUCB, where the agents share information among each other through a central server. We prove that the communication cost of DisBE-LUCB matches our lower bound up to logarithmic factors. In particular, for scenarios with known context distribution, the communication cost of DisBE-LUCB is only $\tilde{\mathcal{O}}(dN)$ and its regret is ${\tilde{\mathcal{O}}}(\sqrt{dNT})$, which is of the same order as that incurred by an optimal single-agent algorithm for $NT$ rounds. We also provide similar bounds for practical settings where the context distribution can only be estimated. Therefore, our proposed algorithm is nearly minimax optimal in terms of \emph{both regret and communication cost}. Finally, we propose DecBE-LUCB, a fully decentralized version of DisBE-LUCB, which operates without a central server, where agents share information with their \emph{immediate neighbors} through a carefully designed consensus procedure. △ Less

Submitted 7 December, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

arXiv:2202.13001 [pdf, other]

Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

Authors: MohammadJavad Azizi, Thang Duong, Yasin Abbasi-Yadkori, András György, Claire Vernade, Mohammad Ghavamzadeh

Abstract: We study a sequential decision problem where the learner faces a sequence of $K$-armed bandit tasks. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). For a given integer $M\le K$, the learner aims to compete with the best subset of arms of size $M$. We design an algorithm based on a reduction to bandit submodular maximizati… ▽ More We study a sequential decision problem where the learner faces a sequence of $K$-armed bandit tasks. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). For a given integer $M\le K$, the learner aims to compete with the best subset of arms of size $M$. We design an algorithm based on a reduction to bandit submodular maximization, and show that, for $T$ rounds comprised of $N$ tasks, in the regime of large number of tasks and small number of optimal arms $M$, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $τ$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M τ}+N^{2/3}Mτ)$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M τ}+N^{1/2}\sqrt{M K τ})$ regret. △ Less

Submitted 18 October, 2022; v1 submitted 25 February, 2022; originally announced February 2022.

arXiv:2201.06532 [pdf, ps, other]

A New Look at Dynamic Regret for Non-Stationary Stochastic Bandits

Authors: Yasin Abbasi-Yadkori, Andras Gyorgy, Nevena Lazic

Abstract: We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of their dynamic regret, which is defined as the difference between the expected cumulative reward of an agent choosing the optimal arm in every time step and the cumulative r… ▽ More We study the non-stationary stochastic multi-armed bandit problem, where the reward statistics of each arm may change several times during the course of learning. The performance of a learning algorithm is evaluated in terms of their dynamic regret, which is defined as the difference between the expected cumulative reward of an agent choosing the optimal arm in every time step and the cumulative reward of the learning algorithm. One way to measure the hardness of such environments is to consider how many times the identity of the optimal arm can change. We propose a method that achieves, in $K$-armed bandit problems, a near-optimal $\widetilde O(\sqrt{K N(S+1)})$ dynamic regret, where $N$ is the time horizon of the problem and $S$ is the number of times the identity of the optimal arm changes, without prior knowledge of $S$. Previous works for this problem obtain regret bounds that scale with the number of changes (or the amount of change) in the reward functions, which can be much larger, or assume prior knowledge of $S$ to achieve similar bounds. △ Less

Submitted 8 March, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

arXiv:2112.15121 [pdf, other]

On the Role of Neural Collapse in Transfer Learning

Authors: Tomer Galanti, András György, Marcus Hutter

Abstract: We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. In this paper we provide an… ▽ More We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. In this paper we provide an explanation for this behavior based on the recently observed phenomenon that the features learned by overparameterized classification networks show an interesting clustering property, called neural collapse. We demonstrate both theoretically and empirically that neural collapse generalizes to new samples from the training classes, and -- more importantly -- to new classes as well, allowing foundation models to provide feature maps that work well in transfer learning and, specifically, in the few-shot setting. △ Less

Submitted 3 January, 2022; v1 submitted 30 December, 2021; originally announced December 2021.

arXiv:2110.02195 [pdf, other]

TensorPlan and the Few Actions Lower Bound for Planning in MDPs under Linear Realizability of Optimal Value Functions

Authors: Gellért Weisz, Csaba Szepesvári, András György

Abstract: We consider the minimax query complexity of online planning with a generative model in fixed-horizon Markov decision processes (MDPs) with linear function approximation. Following recent works, we consider broad classes of problems where either (i) the optimal value function $v^\star$ or (ii) the optimal action-value function $q^\star$ lie in the linear span of some features; or (iii) both… ▽ More We consider the minimax query complexity of online planning with a generative model in fixed-horizon Markov decision processes (MDPs) with linear function approximation. Following recent works, we consider broad classes of problems where either (i) the optimal value function $v^\star$ or (ii) the optimal action-value function $q^\star$ lie in the linear span of some features; or (iii) both $v^\star$ and $q^\star$ lie in the linear span when restricted to the states reachable from the starting state. Recently, Weisz et al. (2021b) showed that under (ii) the minimax query complexity of any planning algorithm is at least exponential in the horizon $H$ or in the feature dimension $d$ when the size $A$ of the action set can be chosen to be exponential in $\min(d,H)$. On the other hand, for the setting (i), Weisz et al. (2021a) introduced TensorPlan, a planner whose query cost is polynomial in all relevant quantities when the number of actions is fixed. Among other things, these two works left open the question whether polynomial query complexity is possible when $A$ is subexponential in $\min(d,H)$. In this paper we answer this question in the negative: we show that an exponentially large lower bound holds when $A=Ω(\min(d^{1/4},H^{1/2}))$, under either (i), (ii) or (iii). In particular, this implies a perhaps surprising exponential separation of query complexity compared to the work of Du et al. (2021) who prove a polynomial upper bound when (iii) holds for all states. Furthermore, we show that the upper bound of TensorPlan can be extended to hold under (iii) and, for MDPs with deterministic transitions and stochastic rewards, also under (ii). △ Less

Submitted 10 March, 2022; v1 submitted 5 October, 2021; originally announced October 2021.

arXiv:2106.16037 [pdf, ps, other]

Learning to Minimize Age of Information over an Unreliable Channel with Energy Harvesting

Authors: Elif Tugce Ceran, Deniz Gunduz, Andras Gyorgy

Abstract: The time average expected age of information (AoI) is studied for status updates sent over an error-prone channel from an energy-harvesting transmitter with a finite-capacity battery. Energy cost of sensing new status updates is taken into account as well as the transmission energy cost better capturing practical systems. The optimal scheduling policy is first studied under the hybrid automatic re… ▽ More The time average expected age of information (AoI) is studied for status updates sent over an error-prone channel from an energy-harvesting transmitter with a finite-capacity battery. Energy cost of sensing new status updates is taken into account as well as the transmission energy cost better capturing practical systems. The optimal scheduling policy is first studied under the hybrid automatic repeat request (HARQ) protocol when the channel and energy harvesting statistics are known, and the existence of a threshold-based optimal policy is shown. For the case of unknown environments, average-cost reinforcement-learning algorithms are proposed that learn the system parameters and the status update policy in real-time. The effectiveness of the proposed methods is demonstrated through numerical results. △ Less

Submitted 30 June, 2021; originally announced June 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:1902.09467

arXiv:2106.08199 [pdf, other]

On Multi-objective Policy Optimization as a Tool for Reinforcement Learning: Case Studies in Offline RL and Finetuning

Authors: Abbas Abdolmaleki, Sandy H. Huang, Giulia Vezzani, Bobak Shahriari, Jost Tobias Springenberg, Shruti Mishra, Dhruva TB, Arunkumar Byravan, Konstantinos Bousmalis, Andras Gyorgy, Csaba Szepesvari, Raia Hadsell, Nicolas Heess, Martin Riedmiller

Abstract: Many advances that have improved the robustness and efficiency of deep reinforcement learning (RL) algorithms can, in one way or another, be understood as introducing additional objectives or constraints in the policy optimization step. This includes ideas as far ranging as exploration bonuses, entropy regularization, and regularization toward teachers or data priors. Often, the task reward and au… ▽ More Many advances that have improved the robustness and efficiency of deep reinforcement learning (RL) algorithms can, in one way or another, be understood as introducing additional objectives or constraints in the policy optimization step. This includes ideas as far ranging as exploration bonuses, entropy regularization, and regularization toward teachers or data priors. Often, the task reward and auxiliary objectives are in conflict, and in this paper we argue that this makes it natural to treat these cases as instances of multi-objective (MO) optimization problems. We demonstrate how this perspective allows us to develop novel and more effective RL algorithms. In particular, we focus on offline RL and finetuning as case studies, and show that existing approaches can be understood as MO algorithms relying on linear scalarization. We hypothesize that replacing linear scalarization with a better algorithm can improve performance. We introduce Distillation of a Mixture of Experts (DiME), a new MORL algorithm that outperforms linear scalarization and can be applied to these non-standard MO problems. We demonstrate that for offline RL, DiME leads to a simple new algorithm that outperforms state-of-the-art. For finetuning, we derive new algorithms that learn to outperform the teacher policy. △ Less

Submitted 1 August, 2023; v1 submitted 15 June, 2021; originally announced June 2021.

arXiv:2104.01086 [pdf, other]

Defending Against Image Corruptions Through Adversarial Augmentations

Authors: Dan A. Calian, Florian Stimberg, Olivia Wiles, Sylvestre-Alvise Rebuffi, Andras Gyorgy, Timothy Mann, Sven Gowal

Abstract: Modern neural networks excel at image classification, yet they remain vulnerable to common image corruptions such as blur, speckle noise or fog. Recent methods that focus on this problem, such as AugMix and DeepAugment, introduce defenses that operate in expectation over a distribution of image corruptions. In contrast, the literature on $\ell_p$-norm bounded perturbations focuses on defenses agai… ▽ More Modern neural networks excel at image classification, yet they remain vulnerable to common image corruptions such as blur, speckle noise or fog. Recent methods that focus on this problem, such as AugMix and DeepAugment, introduce defenses that operate in expectation over a distribution of image corruptions. In contrast, the literature on $\ell_p$-norm bounded perturbations focuses on defenses against worst-case corruptions. In this work, we reconcile both approaches by proposing AdversarialAugment, a technique which optimizes the parameters of image-to-image models to generate adversarially corrupted augmented images. We theoretically motivate our method and give sufficient conditions for the consistency of its idealized version as well as that of DeepAugment. Our classifiers improve upon the state-of-the-art on common image corruption benchmarks conducted in expectation on CIFAR-10-C and improve worst-case performance against $\ell_p$-norm bounded perturbations on both CIFAR-10 and ImageNet. △ Less

Submitted 16 December, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

arXiv:2102.09774 [pdf, ps, other]

A Reinforcement Learning Approach to Age of Information in Multi-User Networks with HARQ

Authors: Elif Tugce Ceran, Deniz Gunduz, Andras Gyorgy

Abstract: Scheduling the transmission of time-sensitive information from a source node to multiple users over error-prone communication channels is studied with the goal of minimizing the long-term average age of information (AoI) at the users. A long-term average resource constraint is imposed on the source, which limits the average number of transmissions. The source can transmit only to a single user at… ▽ More Scheduling the transmission of time-sensitive information from a source node to multiple users over error-prone communication channels is studied with the goal of minimizing the long-term average age of information (AoI) at the users. A long-term average resource constraint is imposed on the source, which limits the average number of transmissions. The source can transmit only to a single user at each time slot, and after each transmission, it receives an instantaneous ACK/NACK feedback from the intended receiver, and decides when and to which user to transmit the next update. Assuming the channel statistics are known, the optimal scheduling policy is studied for both the standard automatic repeat request (ARQ) and hybrid ARQ (HARQ) protocols. Then, a reinforcement learning(RL) approach is introduced to find a near-optimal policy, which does not assume any a priori information on the random processes governing the channel states. Different RL methods including average-cost SARSAwith linear function approximation (LFA), upper confidence reinforcement learning (UCRL2), and deep Q-network (DQN) are applied and compared through numerical simulations △ Less

Submitted 19 February, 2021; originally announced February 2021.

arXiv:2102.07140 [pdf, other]

Perceptually Constrained Adversarial Attacks

Authors: Muhammad Zaid Hameed, Andras Gyorgy

Abstract: Motivated by previous observations that the usually applied $L_p$ norms ($p=1,2,\infty$) do not capture the perceptual quality of adversarial examples in image classification, we propose to replace these norms with the structural similarity index (SSIM) measure, which was developed originally to measure the perceptual similarity of images. Through extensive experiments with adversarially trained c… ▽ More Motivated by previous observations that the usually applied $L_p$ norms ($p=1,2,\infty$) do not capture the perceptual quality of adversarial examples in image classification, we propose to replace these norms with the structural similarity index (SSIM) measure, which was developed originally to measure the perceptual similarity of images. Through extensive experiments with adversarially trained classifiers for MNIST and CIFAR-10, we demonstrate that our SSIM-constrained adversarial attacks can break state-of-the-art adversarially trained classifiers and achieve similar or larger success rate than the elastic net attack, while consistently providing adversarial images of better perceptual quality. Utilizing SSIM to automatically identify and disallow adversarial images of low quality, we evaluate the performance of several defense schemes in a perceptually much more meaningful way than was done previously in the literature. △ Less

Submitted 14 February, 2021; originally announced February 2021.

arXiv:2012.00708 [pdf, other]

Mutual Information Constraints for Monte-Carlo Objectives

Authors: Gábor Melis, András György, Phil Blunsom

Abstract: A common failure mode of density models trained as variational autoencoders is to model the data without relying on their latent variables, rendering these variables useless. Two contributing factors, the underspecification of the model and the looseness of the variational lower bound, have been studied separately in the literature. We weave these two strands of research together, specifically the… ▽ More A common failure mode of density models trained as variational autoencoders is to model the data without relying on their latent variables, rendering these variables useless. Two contributing factors, the underspecification of the model and the looseness of the variational lower bound, have been studied separately in the literature. We weave these two strands of research together, specifically the tighter bounds of Monte-Carlo objectives and constraints on the mutual information between the observable and the latent variables. Estimating the mutual information as the average Kullback-Leibler divergence between the easily available variational posterior $q(z|x)$ and the prior does not work with Monte-Carlo objectives because $q(z|x)$ is no longer a direct approximation to the model's true posterior $p(z|x)$. Hence, we construct estimators of the Kullback-Leibler divergence of the true posterior from the prior by recycling samples used in the objective, with which we train models of continuous and discrete latents at much improved rate-distortion and no posterior collapse. While alleviated, the tradeoff between modelling the data and using the latents still remains, and we urge for evaluating inference methods across a range of mutual information values. △ Less

Submitted 9 May, 2022; v1 submitted 1 December, 2020; originally announced December 2020.

Comments: 32 pages, 29 figures

arXiv:2010.06022 [pdf, ps, other]

Adapting to Delays and Data in Adversarial Multi-Armed Bandits

Authors: András György, Pooria Joulani

Abstract: We consider the adversarial multi-armed bandit problem under delayed feedback. We analyze variants of the Exp3 algorithm that tune their step-size using only information (about the losses and delays) available at the time of the decisions, and obtain regret guarantees that adapt to the observed (rather than the worst-case) sequences of delays and/or losses. First, through a remarkably simple proof… ▽ More We consider the adversarial multi-armed bandit problem under delayed feedback. We analyze variants of the Exp3 algorithm that tune their step-size using only information (about the losses and delays) available at the time of the decisions, and obtain regret guarantees that adapt to the observed (rather than the worst-case) sequences of delays and/or losses. First, through a remarkably simple proof technique, we show that with proper tuning of the step size, the algorithm achieves an optimal (up to logarithmic factors) regret of order $\sqrt{\log(K)(TK + D)}$ both in expectation and in high probability, where $K$ is the number of arms, $T$ is the time horizon, and $D$ is the cumulative delay. The high-probability version of the bound, which is the first high-probability delay-adaptive bound in the literature, crucially depends on the use of implicit exploration in estimating the losses. Then, following Zimmert and Seldin [2019], we extend these results so that the algorithm can "skip" rounds with large delays, resulting in regret bounds of order $\sqrt{TK\log(K)} + |R| + \sqrt{D_{\bar{R}}\log(K)}$, where $R$ is an arbitrary set of rounds (which are skipped) and $D_{\bar{R}}$ is the cumulative delay of the feedback for other rounds. Finally, we present another, data-adaptive (AdaGrad-style) version of the algorithm for which the regret adapts to the observed (delayed) losses instead of only adapting to the cumulative delay (this algorithm requires an a priori upper bound on the maximum delay, or the advance knowledge of the delay for each decision when it is made). The resulting bound can be orders of magnitude smaller on benign problems, and it can be shown that the delay only affects the regret through the loss of the best arm. △ Less

Submitted 12 October, 2020; originally announced October 2020.

arXiv:2009.12228 [pdf, ps, other]

Mirror Descent and the Information Ratio

Authors: Tor Lattimore, András György

Abstract: We establish a connection between the stability of mirror descent and the information ratio by Russo and Van Roy [2014]. Our analysis shows that mirror descent with suitable loss estimators and exploratory distributions enjoys the same bound on the adversarial regret as the bounds on the Bayesian regret for information-directed sampling. Along the way, we develop the theory for information-directe… ▽ More We establish a connection between the stability of mirror descent and the information ratio by Russo and Van Roy [2014]. Our analysis shows that mirror descent with suitable loss estimators and exploratory distributions enjoys the same bound on the adversarial regret as the bounds on the Bayesian regret for information-directed sampling. Along the way, we develop the theory for information-directed sampling and provide an efficient algorithm for adversarial bandits for which the regret upper bound matches exactly the best known information-theoretic upper bound. △ Less

Submitted 25 September, 2020; originally announced September 2020.

arXiv:2006.10460 [pdf, other]

Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting

Authors: Ilja Kuzborskij, Claire Vernade, András György, Csaba Szepesvári

Abstract: We consider off-policy evaluation in the contextual bandit setting for the purpose of obtaining a robust off-policy selection strategy, where the selection strategy is evaluated based on the value of the chosen policy in a set of proposal (target) policies. We propose a new method to compute a lower bound on the value of an arbitrary target policy given some logged data in contextual bandits for a… ▽ More We consider off-policy evaluation in the contextual bandit setting for the purpose of obtaining a robust off-policy selection strategy, where the selection strategy is evaluated based on the value of the chosen policy in a set of proposal (target) policies. We propose a new method to compute a lower bound on the value of an arbitrary target policy given some logged data in contextual bandits for a desired coverage. The lower bound is built around the so-called Self-normalized Importance Weighting (SN) estimator. It combines the use of a semi-empirical Efron-Stein tail inequality to control the concentration and a new multiplicative (rather than additive) control of the bias. The new approach is evaluated on a number of synthetic and real datasets and is found to be superior to its main competitors, both in terms of tightness of the confidence intervals and the quality of the policies chosen. △ Less

Submitted 21 March, 2022; v1 submitted 18 June, 2020; originally announced June 2020.

arXiv:2006.02119 [pdf, other]

Non-Stationary Delayed Bandits with Intermediate Observations

Authors: Claire Vernade, Andras Gyorgy, Timothy Mann

Abstract: Online recommender systems often face long delays in receiving feedback, especially when optimizing for some long-term metrics. While mitigating the effects of delays in learning is well-understood in stationary environments, the problem becomes much more challenging when the environment changes. In fact, if the timescale of the change is comparable to the delay, it is impossible to learn about th… ▽ More Online recommender systems often face long delays in receiving feedback, especially when optimizing for some long-term metrics. While mitigating the effects of delays in learning is well-understood in stationary environments, the problem becomes much more challenging when the environment changes. In fact, if the timescale of the change is comparable to the delay, it is impossible to learn about the environment, since the available observations are already obsolete. However, the arising issues can be addressed if intermediate signals are available without delay, such that given those signals, the long-term behavior of the system is stationary. To model this situation, we introduce the problem of stochastic, non-stationary, delayed bandits with intermediate observations. We develop a computationally efficient algorithm based on UCRL, and prove sublinear regret guarantees for its performance. Experimental results demonstrate that our method is able to learn in non-stationary delayed environments where existing methods fail. △ Less

Submitted 11 August, 2020; v1 submitted 3 June, 2020; originally announced June 2020.

Comments: 18 pages, 17 figures, ICML 2020

arXiv:1910.05558 [pdf, other]

Minimal Assumptions Refinement for GR(1) Specifications

Authors: Davide G. Cavezza, Dalal Alrajeh, Andras Gyorgy

Abstract: Reactive synthesis is concerned with finding a correct-by-construction controller from formal specifications, typically expressed in Linear Temporal Logic (LTL). The specifications describe assumptions about an environment and guarantees to be achieved by the controller operating in that environment. If a controller exists, given the assumptions, the specification is said to be realizable. This pa… ▽ More Reactive synthesis is concerned with finding a correct-by-construction controller from formal specifications, typically expressed in Linear Temporal Logic (LTL). The specifications describe assumptions about an environment and guarantees to be achieved by the controller operating in that environment. If a controller exists, given the assumptions, the specification is said to be realizable. This paper focuses on finding a minimal set of assumptions that guarantee realizability in the context of counterstrategy-guided assumption refinement procedures. Specifically, we introduce the notion of minimal assumptions refinements and provide an algorithm that provably computes these with little time overhead. We show experimentally, using common benchmarks, that embedding our algorithm in state-of-the-art approaches for assumption refinement results in consistently shorter solutions than without such embedding, and allows to explore a higher number of candidate solutions. We also propose a hybrid variant for dealing with the higher sparsity of solutions in the space of minimal refinements and show that its application speeds up the identification of a solution. △ Less

Submitted 12 October, 2019; originally announced October 2019.

arXiv:1905.03030 [pdf, other]

Meta-learning of Sequential Strategies

Authors: Pedro A. Ortega, Jane X. Wang, Mark Rowland, Tim Genewein, Zeb Kurth-Nelson, Razvan Pascanu, Nicolas Heess, Joel Veness, Alex Pritzel, Pablo Sprechmann, Siddhant M. Jayakumar, Tom McGrath, Kevin Miller, Mohammad Azar, Ian Osband, Neil Rabinowitz, András György, Silvia Chiappa, Simon Osindero, Yee Whye Teh, Hado van Hasselt, Nando de Freitas, Matthew Botvinick, Shane Legg

Abstract: In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal pred… ▽ More In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics. Essentially, memory-based meta-learning translates the hard problem of probabilistic sequential inference into a regression problem. △ Less

Submitted 18 July, 2019; v1 submitted 8 May, 2019; originally announced May 2019.

Comments: DeepMind Technical Report (15 pages, 6 figures). Version V1.1

arXiv:1903.02380 [pdf, other]

Detecting Overfitting via Adversarial Examples

Authors: Roman Werpachowski, András György, Csaba Szepesvári

Abstract: The repeated community-wide reuse of test sets in popular benchmark problems raises doubts about the credibility of reported test-error rates. Verifying whether a learned model is overfitted to a test set is challenging as independent test sets drawn from the same data distribution are usually unavailable, while other test sets may introduce a distribution shift. We propose a new hypothesis test t… ▽ More The repeated community-wide reuse of test sets in popular benchmark problems raises doubts about the credibility of reported test-error rates. Verifying whether a learned model is overfitted to a test set is challenging as independent test sets drawn from the same data distribution are usually unavailable, while other test sets may introduce a distribution shift. We propose a new hypothesis test that uses only the original test data to detect overfitting. It utilizes a new unbiased error estimate that is based on adversarial examples generated from the test data and importance weighting. Overfitting is detected if this error estimate is sufficiently different from the original test error rate. We develop a specialized variant of our test for multiclass image classification, and apply it to testing overfitting of recent models to the popular ImageNet benchmark. Our method correctly indicates overfitting of the trained model to the training set, but is not able to detect any overfitting to the test set, in line with other recent work on this topic. △ Less

Submitted 14 November, 2019; v1 submitted 6 March, 2019; originally announced March 2019.

Comments: 17 pages

Journal ref: Part of: Advances in Neural Information Processing Systems 32 (NIPS 2019) pre-proceedings

arXiv:1902.10730 [pdf, other]

doi 10.1145/3306618.3314288

Degenerate Feedback Loops in Recommender Systems

Authors: Ray Jiang, Silvia Chiappa, Tor Lattimore, András György, Pushmeet Kohli

Abstract: Machine learning is used extensively in recommender systems deployed in products. The decisions made by these systems can influence user beliefs and preferences which in turn affect the feedback the learning system receives - thus creating a feedback loop. This phenomenon can give rise to the so-called "echo chambers" or "filter bubbles" that have user and societal implications. In this paper, we… ▽ More Machine learning is used extensively in recommender systems deployed in products. The decisions made by these systems can influence user beliefs and preferences which in turn affect the feedback the learning system receives - thus creating a feedback loop. This phenomenon can give rise to the so-called "echo chambers" or "filter bubbles" that have user and societal implications. In this paper, we provide a novel theoretical analysis that examines both the role of user dynamics and the behavior of recommender systems, disentangling the echo chamber from the filter bubble effect. In addition, we offer practical solutions to slow down system degeneracy. Our study contributes toward understanding and developing solutions to commonly cited issues in the complex temporal scenario, an area that is still largely unexplored. △ Less

Submitted 27 March, 2019; v1 submitted 27 February, 2019; originally announced February 2019.

Journal ref: Proceedings of AAAI/ACM Conference on AI, Ethics, and Society, Honolulu, HI, USA, January 27-28, 2019 (AIES '19)

arXiv:1902.10674 [pdf, other]

The Best Defense Is a Good Offense: Adversarial Attacks to Avoid Modulation Detection

Authors: Muhammad Zaid Hameed, Andras Gyorgy, Deniz Gunduz

Abstract: We consider a communication scenario, in which an intruder tries to determine the modulation scheme of the intercepted signal. Our aim is to minimize the accuracy of the intruder, while guaranteeing that the intended receiver can still recover the underlying message with the highest reliability. This is achieved by perturbing channel input symbols at the encoder, similarly to adversarial attacks a… ▽ More We consider a communication scenario, in which an intruder tries to determine the modulation scheme of the intercepted signal. Our aim is to minimize the accuracy of the intruder, while guaranteeing that the intended receiver can still recover the underlying message with the highest reliability. This is achieved by perturbing channel input symbols at the encoder, similarly to adversarial attacks against classifiers in machine learning. In image classification, the perturbation is limited to be imperceptible to a human observer, while in our case the perturbation is constrained so that the message can still be reliably decoded by the legitimate receiver, which is oblivious to the perturbation. Simulation results demonstrate the viability of our approach to make wireless communication secure against state-of-the-art intruders (using deep learning or decision trees) with minimal sacrifice in the communication performance. On the other hand, we also demonstrate that using diverse training data and curriculum learning can significantly boost the accuracy of the intruder. △ Less

Submitted 7 April, 2020; v1 submitted 27 February, 2019; originally announced February 2019.

arXiv:1902.09467 [pdf, ps, other]

Reinforcement Learning to Minimize Age of Information with an Energy Harvesting Sensor with HARQ and Sensing Cost

Authors: Elif Tuğçe Ceran, Deniz Gündüz, András György

Abstract: The time average expected age of information (AoI) is studied for status updates sent from an energy-harvesting transmitter with a finite-capacity battery. The optimal scheduling policy is first studied under different feedback mechanisms when the channel and energy harvesting statistics are known. For the case of unknown environments, an average-cost reinforcement learning algorithm is proposed t… ▽ More The time average expected age of information (AoI) is studied for status updates sent from an energy-harvesting transmitter with a finite-capacity battery. The optimal scheduling policy is first studied under different feedback mechanisms when the channel and energy harvesting statistics are known. For the case of unknown environments, an average-cost reinforcement learning algorithm is proposed that learns the system parameters and the status update policy in real time. The effectiveness of the proposed methods is verified through numerical results. △ Less

Submitted 24 January, 2019; originally announced February 2019.

arXiv:1807.09387 [pdf, other]

Learning from Delayed Outcomes via Proxies with Applications to Recommender Systems

Authors: Timothy A. Mann, Sven Gowal, András György, Ray Jiang, Huiyi Hu, Balaji Lakshminarayanan, Prav Srinivasan

Abstract: Predicting delayed outcomes is an important problem in recommender systems (e.g., if customers will finish reading an ebook). We formalize the problem as an adversarial, delayed online learning problem and consider how a proxy for the delayed outcome (e.g., if customers read a third of the book in 24 hours) can help minimize regret, even though the proxy is not available when making a prediction.… ▽ More Predicting delayed outcomes is an important problem in recommender systems (e.g., if customers will finish reading an ebook). We formalize the problem as an adversarial, delayed online learning problem and consider how a proxy for the delayed outcome (e.g., if customers read a third of the book in 24 hours) can help minimize regret, even though the proxy is not available when making a prediction. Motivated by our regret analysis, we propose two neural network architectures: Factored Forecaster (FF) which is ideal if the proxy is informative of the outcome in hindsight, and Residual Factored Forecaster (RFF) that is robust to a non-informative proxy. Experiments on two real-world datasets for predicting human behavior show that RFF outperforms both FF and a direct forecaster that does not make use of the proxy. Our results suggest that exploiting proxies by factorization is a promising way to mitigate the impact of long delays in human-behavior prediction tasks. △ Less

Submitted 15 October, 2019; v1 submitted 24 July, 2018; originally announced July 2018.

arXiv:1807.00755 [pdf, other]

LeapsAndBounds: A Method for Approximately Optimal Algorithm Configuration

Authors: Gellért Weisz, András György, Csaba Szepesvári

Abstract: We consider the problem of configuring general-purpose solvers to run efficiently on problem instances drawn from an unknown distribution. The goal of the configurator is to find a configuration that runs fast on average on most instances, and do so with the least amount of total work. It can run a chosen solver on a random instance until the solver finishes or a timeout is reached. We propose Lea… ▽ More We consider the problem of configuring general-purpose solvers to run efficiently on problem instances drawn from an unknown distribution. The goal of the configurator is to find a configuration that runs fast on average on most instances, and do so with the least amount of total work. It can run a chosen solver on a random instance until the solver finishes or a timeout is reached. We propose LeapsAndBounds, an algorithm that tests configurations on randomly selected problem instances for longer and longer time. We prove that the capped expected runtime of the configuration returned by LeapsAndBounds is close to the optimal expected runtime, while our algorithm's running time is near-optimal. Our results show that LeapsAndBounds is more efficient than the recent algorithm of Kleinberg et al. (2017), which, to our knowledge, is the only other algorithm configuration method with non-trivial theoretical guarantees. Experimental results on configuring a public SAT solver on a new benchmark dataset also stand witness to the superiority of our method. △ Less

Submitted 2 July, 2018; originally announced July 2018.

Comments: to appear at ICML 2018

arXiv:1806.03816 [pdf, ps, other]

Adaptive MCMC via Combining Local Samplers

Authors: Kiarash Shaloudegi, András György

Abstract: Markov chain Monte Carlo (MCMC) methods are widely used in machine learning. One of the major problems with MCMC is the question of how to design chains that mix fast over the whole state space; in particular, how to select the parameters of an MCMC algorithm. Here we take a different approach and, similarly to parallel MCMC methods, instead of trying to find a single chain that samples from the w… ▽ More Markov chain Monte Carlo (MCMC) methods are widely used in machine learning. One of the major problems with MCMC is the question of how to design chains that mix fast over the whole state space; in particular, how to select the parameters of an MCMC algorithm. Here we take a different approach and, similarly to parallel MCMC methods, instead of trying to find a single chain that samples from the whole distribution, we combine samples from several chains run in parallel, each exploring only parts of the state space (e.g., a few modes only). The chains are prioritized based on kernel Stein discrepancy, which provides a good measure of performance locally. The samples from the independent chains are combined using a novel technique for estimating the probability of different regions of the sample space. Experimental results demonstrate that the proposed algorithm may provide significant speedups in different sampling problems. Most importantly, when combined with the state-of-the-art NUTS algorithm as the base MCMC sampler, our method remained competitive with NUTS on sampling from unimodal distributions, while significantly outperforming state-of-the-art competitors on synthetic multimodal problems as well as on a challenging sensor localization task. △ Less

Submitted 12 July, 2019; v1 submitted 11 June, 2018; originally announced June 2018.

arXiv:1806.00336 [pdf, ps, other]

A Reinforcement Learning Approach to Age of Information in Multi-User Networks

Authors: Elif Tuğçe Ceran, Deniz Gündüz, András György

Abstract: Scheduling the transmission of time-sensitive data to multiple users over error-prone communication channels is studied with the goal of minimizing the long-term average age of information (AoI) at the users under a constraint on the average number of transmissions at the source node. After each transmission, the source receives an instantaneous ACK/NACK feedback from the intended receiver and dec… ▽ More Scheduling the transmission of time-sensitive data to multiple users over error-prone communication channels is studied with the goal of minimizing the long-term average age of information (AoI) at the users under a constraint on the average number of transmissions at the source node. After each transmission, the source receives an instantaneous ACK/NACK feedback from the intended receiver and decides on what time and to which user to transmit the next update. The optimal scheduling policy is first studied under different feedback mechanisms when the channel statistics are known; in particular, the standard automatic repeat request (ARQ) and hybrid ARQ (HARQ) protocols are considered. Then a reinforcement learning (RL) approach is introduced, which does not assume any a priori information on the random processes governing the channel states. Different RL methods are verified and compared through numerical simulations. △ Less

Submitted 1 June, 2018; originally announced June 2018.

arXiv:1805.03151 [pdf, other]

A Weakness Measure for GR(1) Formulae

Authors: Davide G. Cavezza, Dalal Alrajeh, András György

Abstract: In spite of the theoretical and algorithmic developments for system synthesis in recent years, little effort has been dedicated to quantifying the quality of the specifications used for synthesis. When dealing with unrealizable specifications, finding the weakest environment assumptions that would ensure realizability is typically a desirable property; in such context the weakness of the assumptio… ▽ More In spite of the theoretical and algorithmic developments for system synthesis in recent years, little effort has been dedicated to quantifying the quality of the specifications used for synthesis. When dealing with unrealizable specifications, finding the weakest environment assumptions that would ensure realizability is typically a desirable property; in such context the weakness of the assumptions is a major quality parameter. The question of whether one assumption is weaker than another is commonly interpreted using implication or, equivalently, language inclusion. However, this interpretation does not provide any further insight into the weakness of assumptions when implication does not hold. To our knowledge, the only measure that is capable of comparing two formulae in this case is entropy, but even it fails to provide a sufficiently refined notion of weakness in case of GR(1) formulae, a subset of linear temporal logic formulae which is of particular interest in controller synthesis. In this paper we propose a more refined measure of weakness based on the Hausdorff dimension, a concept that captures the notion of size of the omega-language satisfying a linear temporal logic formula. We identify the conditions under which this measure is guaranteed to distinguish between weaker and stronger GR(1) formulae. We evaluate our proposed weakness measure in the context of computing GR(1) assumptions refinements. △ Less

Submitted 8 May, 2018; originally announced May 2018.

Comments: To appear in FM2018 proceedings

arXiv:1802.03041 [pdf, ps, other]

Detection of Adversarial Training Examples in Poisoning Attacks through Anomaly Detection

Authors: Andrea Paudice, Luis Muñoz-González, Andras Gyorgy, Emil C. Lupu

Abstract: Machine learning has become an important component for many systems and applications including computer vision, spam filtering, malware and network intrusion detection, among others. Despite the capabilities of machine learning algorithms to extract valuable information from data and produce accurate predictions, it has been shown that these algorithms are vulnerable to attacks. Data poisoning is… ▽ More Machine learning has become an important component for many systems and applications including computer vision, spam filtering, malware and network intrusion detection, among others. Despite the capabilities of machine learning algorithms to extract valuable information from data and produce accurate predictions, it has been shown that these algorithms are vulnerable to attacks. Data poisoning is one of the most relevant security threats against machine learning systems, where attackers can subvert the learning process by injecting malicious samples in the training data. Recent work in adversarial machine learning has shown that the so-called optimal attack strategies can successfully poison linear classifiers, degrading the performance of the system dramatically after compromising a small fraction of the training dataset. In this paper we propose a defence mechanism to mitigate the effect of these optimal poisoning attacks based on outlier detection. We show empirically that the adversarial examples generated by these attack strategies are quite different from genuine points, as no detectability constrains are considered to craft the attack. Hence, they can be detected with an appropriate pre-filtering of the training dataset. △ Less

Submitted 8 February, 2018; originally announced February 2018.

Comments: 10 pages, 3 figures

arXiv:1712.07084 [pdf, ps, other]

A Reinforcement-Learning Approach to Proactive Caching in Wireless Networks

Authors: Samuel O. Somuyiwa, Andras Gyorgy, Deniz Gunduz

Abstract: We consider a mobile user accessing contents in a dynamic environment, where new contents are generated over time (by the user's contacts), and remain relevant to the user for random lifetimes. The user, equipped with a finite-capacity cache memory, randomly accesses the system, and requests all the relevant contents at the time of access. The system incurs an energy cost associated with the numbe… ▽ More We consider a mobile user accessing contents in a dynamic environment, where new contents are generated over time (by the user's contacts), and remain relevant to the user for random lifetimes. The user, equipped with a finite-capacity cache memory, randomly accesses the system, and requests all the relevant contents at the time of access. The system incurs an energy cost associated with the number of contents downloaded and the channel quality at that time. Assuming causal knowledge of the channel quality, the content profile, and the user-access behavior, we model the proactive caching problem as a Markov decision process with the goal of minimizing the long-term average energy cost. We first prove the optimality of a threshold-based proactive caching scheme, which dynamically caches or removes appropriate contents from the memory, prior to being requested by the user, depending on the channel state. The optimal threshold values depend on the system state, and hence, are computationally intractable. Therefore, we propose parametric representations for the threshold values, and use reinforcement-learning algorithms to find near-optimal parametrizations. We demonstrate through simulations that the proposed schemes significantly outperform classical reactive downloading, and perform very close to a genie-aided lower bound. △ Less

Submitted 19 December, 2017; originally announced December 2017.

arXiv:1710.04971 [pdf, ps, other]

Average Age of Information with Hybrid ARQ under a Resource Constraint

Authors: Elif Tugce Ceran, Deniz Gunduz, Andras Gyorgy

Abstract: Scheduling of the transmission of status updates over an error-prone communication channel is studied in order to minimize the long-term average age of information (AoI) at the destination, under an average resource constraint at the source node, which limits the average number of transmissions. After each transmission, the source receives an instantaneous ACK/NACK feedback, and decides on the nex… ▽ More Scheduling of the transmission of status updates over an error-prone communication channel is studied in order to minimize the long-term average age of information (AoI) at the destination, under an average resource constraint at the source node, which limits the average number of transmissions. After each transmission, the source receives an instantaneous ACK/NACK feedback, and decides on the next update, without a priori knowledge on the success of the future transmissions. The optimal scheduling policy is studied under different feedback mechanisms; in particular, standard automatic repeat request (ARQ) and hybrid ARQ (HARQ) protocols are considered. Average-cost reinforcement learning algorithms are proposed when the error probabilities for the HARQ system are unknown. △ Less

Submitted 5 December, 2018; v1 submitted 1 October, 2017; originally announced October 2017.

arXiv:1709.02726 [pdf, other]

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds

Authors: Pooria Joulani, András György, Csaba Szepesvári

Abstract: Recently, much work has been done on extending the scope of online learning and incremental stochastic optimization algorithms. In this paper we contribute to this effort in two ways: First, based on a new regret decomposition and a generalization of Bregman divergences, we provide a self-contained, modular analysis of the two workhorses of online learning: (general) adaptive versions of Mirror De… ▽ More Recently, much work has been done on extending the scope of online learning and incremental stochastic optimization algorithms. In this paper we contribute to this effort in two ways: First, based on a new regret decomposition and a generalization of Bregman divergences, we provide a self-contained, modular analysis of the two workhorses of online learning: (general) adaptive versions of Mirror Descent (MD) and the Follow-the-Regularized-Leader (FTRL) algorithms. The analysis is done with extra care so as not to introduce assumptions not needed in the proofs and allows to combine, in a straightforward way, different algorithmic ideas (e.g., adaptivity, optimism, implicit updates) and learning settings (e.g., strongly convex or composite objectives). This way we are able to reprove, extend and refine a large body of the literature, while keeping the proofs concise. The second contribution is a byproduct of this careful analysis: We present algorithms with improved variational bounds for smooth, composite objectives, including a new family of optimistic MD algorithms with only one projection step per round. Furthermore, we provide a simple extension of adaptive regret bounds to practically relevant non-convex problem settings with essentially no extra effort. △ Less

Submitted 8 September, 2017; originally announced September 2017.

Comments: Accepted to The 28th International Conference on Algorithmic Learning Theory (ALT 2017). 40 pages

arXiv:1702.06382 [pdf, ps, other]

Energy-Efficient Wireless Content Delivery with Proactive Caching

Authors: Samuel O. Somuyiwa, András György, Deniz Gündüz

Abstract: We propose an intelligent proactive content caching scheme to reduce the energy consumption in wireless downlink. We consider an online social network (OSN) setting where new contents are generated over time, and remain \textit{relevant} to the user for a random lifetime. Contents are downloaded to the user equipment (UE) through a time-varying wireless channel at an energy cost that depends on th… ▽ More We propose an intelligent proactive content caching scheme to reduce the energy consumption in wireless downlink. We consider an online social network (OSN) setting where new contents are generated over time, and remain \textit{relevant} to the user for a random lifetime. Contents are downloaded to the user equipment (UE) through a time-varying wireless channel at an energy cost that depends on the channel state and the number of contents downloaded. The user accesses the OSN at random time instants, and consumes all the relevant contents. To reduce the energy consumption, we propose \textit{proactive caching} of contents under favorable channel conditions to a finite capacity cache memory. Assuming that the channel quality (or equivalently, the cost of downloading data) is memoryless over time slots, we show that the optimal caching policy, which may replace contents in the cache with shorter remaining lifetime with contents at the server that remain relevant longer, has certain threshold structure with respect to the channel quality. Since the optimal policy is computationally demanding in practice, we introduce a simplified caching scheme and optimize its parameters using policy search. We also present two lower bounds on the energy consumption. We demonstrate through numerical simulations that the proposed caching scheme significantly reduces the energy consumption compared to traditional reactive caching tools, and achieves close-to-optimal performance for a wide variety of system parameters. △ Less

Submitted 21 February, 2017; originally announced February 2017.

Comments: 6 pages, 2 figures. Submitted, under review

arXiv:1702.03040 [pdf, other]

Following the Leader and Fast Rates in Linear Prediction: Curved Constraint Sets and Other Regularities

Authors: Ruitong Huang, Tor Lattimore, András György, Csaba Szepesvári

Abstract: The follow the leader (FTL) algorithm, perhaps the simplest of all online learning algorithms, is known to perform well when the loss functions it is used on are convex and positively curved. In this paper we ask whether there are other "lucky" settings when FTL achieves sublinear, "small" regret. In particular, we study the fundamental problem of linear prediction over a non-empty convex, compact… ▽ More The follow the leader (FTL) algorithm, perhaps the simplest of all online learning algorithms, is known to perform well when the loss functions it is used on are convex and positively curved. In this paper we ask whether there are other "lucky" settings when FTL achieves sublinear, "small" regret. In particular, we study the fundamental problem of linear prediction over a non-empty convex, compact domain. Amongst other results, we prove that the curvature of the boundary of the domain can act as if the losses were curved: In this case, we prove that as long as the mean of the loss vectors have positive lengths bounded away from zero, FTL enjoys a logarithmic growth rate of regret, while, e.g., for polytope domains and stochastic data it enjoys finite expected regret. Building on a previously known meta-algorithm, we also get an algorithm that simultaneously enjoys the worst-case guarantees and the bound available for FTL. △ Less

Submitted 9 February, 2017; originally announced February 2017.

arXiv:1610.09491 [pdf, ps, other]

SDP Relaxation with Randomized Rounding for Energy Disaggregation

Authors: Kiarash Shaloudegi, András György, Csaba Szepesvári, Wilsun Xu

Abstract: We develop a scalable, computationally efficient method for the task of energy disaggregation for home appliance monitoring. In this problem the goal is to estimate the energy consumption of each appliance over time based on the total energy-consumption signal of a household. The current state of the art is to model the problem as inference in factorial HMMs, and use quadratic programming to find… ▽ More We develop a scalable, computationally efficient method for the task of energy disaggregation for home appliance monitoring. In this problem the goal is to estimate the energy consumption of each appliance over time based on the total energy-consumption signal of a household. The current state of the art is to model the problem as inference in factorial HMMs, and use quadratic programming to find an approximate solution to the resulting quadratic integer program. Here we take a more principled approach, better suited to integer programming problems, and find an approximate optimum by combining convex semidefinite relaxations randomized rounding, as well as a scalable ADMM method that exploits the special structure of the resulting semidefinite program. Simulation results both in synthetic and real-world datasets demonstrate the superiority of our method. △ Less

Submitted 29 October, 2016; originally announced October 2016.

arXiv:1609.07087 [pdf, other]

(Bandit) Convex Optimization with Biased Noisy Gradient Oracles

Authors: Xiaowei Hu, Prashanth L. A., András György, Csaba Szepesvári

Abstract: Algorithms for bandit convex optimization and online learning often rely on constructing noisy gradient estimates, which are then used in appropriately adjusted first-order algorithms, replacing actual gradients. Depending on the properties of the function to be optimized and the nature of ``noise'' in the bandit feedback, the bias and variance of gradient estimates exhibit various tradeoffs. In t… ▽ More Algorithms for bandit convex optimization and online learning often rely on constructing noisy gradient estimates, which are then used in appropriately adjusted first-order algorithms, replacing actual gradients. Depending on the properties of the function to be optimized and the nature of ``noise'' in the bandit feedback, the bias and variance of gradient estimates exhibit various tradeoffs. In this paper we propose a novel framework that replaces the specific gradient estimation methods with an abstract oracle. With the help of the new framework we unify previous works, reproducing their results in a clean and concise fashion, while, perhaps more importantly, the framework also allows us to formally show that to achieve the optimal root-$n$ rate either the algorithms that use existing gradient estimators, or the proof techniques used to analyze them have to go beyond what exists today. △ Less

Submitted 4 July, 2020; v1 submitted 22 September, 2016; originally announced September 2016.

arXiv:1609.01872 [pdf, ps, other]

Chaining Bounds for Empirical Risk Minimization

Authors: Gábor Balázs, András György, Csaba Szepesvári

Abstract: This paper extends the standard chaining technique to prove excess risk upper bounds for empirical risk minimization with random design settings even if the magnitude of the noise and the estimates is unbounded. The bound applies to many loss functions besides the squared loss, and scales only with the sub-Gaussian or subexponential parameters without further statistical assumptions such as the bo… ▽ More This paper extends the standard chaining technique to prove excess risk upper bounds for empirical risk minimization with random design settings even if the magnitude of the noise and the estimates is unbounded. The bound applies to many loss functions besides the squared loss, and scales only with the sub-Gaussian or subexponential parameters without further statistical assumptions such as the bounded kurtosis condition over the hypothesis class. A detailed analysis is provided for slope constrained and penalized linear least squares regression with a sub-Gaussian setting, which often proves tight sample complexity bounds up to logartihmic factors. △ Less

Submitted 7 September, 2016; originally announced September 2016.

arXiv:1510.08108 [pdf, ps, other]

Online Learning with Gaussian Payoffs and Side Observations

Authors: Yifan Wu, András György, Csaba Szepesvári

Abstract: We consider a sequential learning problem with Gaussian payoffs and side information: after selecting an action $i$, the learner receives information about the payoff of every action $j$ in the form of Gaussian observations whose mean is the same as the mean payoff, but the variance depends on the pair $(i,j)$ (and may be infinite). The setup allows a more refined information transfer from one act… ▽ More We consider a sequential learning problem with Gaussian payoffs and side information: after selecting an action $i$, the learner receives information about the payoff of every action $j$ in the form of Gaussian observations whose mean is the same as the mean payoff, but the variance depends on the pair $(i,j)$ (and may be infinite). The setup allows a more refined information transfer from one action to another than previous partial monitoring setups, including the recently introduced graph-structured feedback case. For the first time in the literature, we provide non-asymptotic problem-dependent lower bounds on the regret of any algorithm, which recover existing asymptotic problem-dependent lower bounds and finite-time minimax lower bounds available in the literature. We also provide algorithms that achieve the problem-dependent lower bound (up to some universal constant factor) or the minimax lower bounds (up to logarithmic factors). △ Less

Submitted 27 October, 2015; originally announced October 2015.

Showing 1–50 of 58 results for author: György, A