Skip to main content

Csaba Szepesvari

University of Alberta, Computing Science, Faculty Member

Followers

176

Following

54

Public Views

Csaba Szepesvari

Anna University

Odalric-Ambrym Maillard

Technion Israel Institute of Technology

University of California, Davis

University of Copenhagen

Vasileios Kandylas

University College London

Apostolos Burnetas

Robert Kleinberg

Cornell University

InterestsView All (9)

Uploads

Papers by Csaba Szepesvari

Parametric bandits: The generalized linear case

Abstract We consider structured multi-armed bandit problems based on the Generalized Linear Model... more Abstract We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case.

Dyna (k): A Multi-Step Dyna Planning

Abstract Dyna planning is an efficient way of learning from real and imaginary experience. Existi... more Abstract Dyna planning is an efficient way of learning from real and imaginary experience. Existing tabular and linear Dyna algorithms are single-step, because an “imaginary” feature is predicted only one step into the future. In this paper, we introduce a multi-step Dyna planning that predicts more steps into the future.

LINEAR PREDICTION: RIDGE REGRESSION AND VARIANTS

Bandits contextuels: apprentissage par renforcement dans les modeles linéaires généralisés

Abstract: Nous considérons des problemes de bandits multibras structurés dans lesquels l'agent es... more Abstract: Nous considérons des problemes de bandits multibras structurés dans lesquels l'agent est guidé par une connaissance a priori sur la structure de la récompense qui peut être exploitée de manierea sélectionner efficacement un bras optimal dans des situations ou le nombre de bras est tres grand voire infini. Nous proposons un nouvel algorithme optimiste pour des problemes de bandit paramétriques non-linéaires en utilisant le cadre des modeles linéaires généralisés (GLM).

Towards Manifold-Adaptive Learning

Abstract Inputs coming from high-dimensional spaces are common in many real-world problems such a... more Abstract Inputs coming from high-dimensional spaces are common in many real-world problems such as a robot control with visual inputs. Yet learning in such cases is in general difficult, a fact often referred to as the “curse of dimensionality”. In particular, in regression or classification, in order to achieve a certain accuracy algorithms are known to require exponentially many samples in the dimension

Statistical Linear Estimation with Penalized Estimators: an Application to Reinforcement Learning

Abstract Motivated by value function estimation in reinforcement learning, we study statistical l... more Abstract Motivated by value function estimation in reinforcement learning, we study statistical linear inverse problems, ie, problems where the coefficients of a linear system to be solved are observed in noise. We consider penalized estimators, where performance is evaluated using a matrix-weighted two-norm of the defect of the estimator measured with respect to the true, unknown coefficients.

Online least squares estimation with self-normalized processes: An application to bandit problems

Abstract: The analysis of online least squares estimation is at the heart of many stochastic sequ... more Abstract: The analysis of online least squares estimation is at the heart of many stochastic sequential decision making problems. We employ tools from the self-normalized processes to provide a simple and self-contained proof of a tail bound of a vector-valued martingale. We use the bound to construct a new tighter confidence sets for the least squares estimate.

An adaptive algorithm for finite stochastic partial monitoring

Abstract: We present a new anytime algorithm that achieves near-optimal regret for any instance o... more Abstract: We present a new anytime algorithm that achieves near-optimal regret for any instance of finite stochastic partial monitoring. In particular, the new algorithm achieves the minimax regret, within logarithmic factors, for both" easy" and" hard" problems. For easy problems, it additionally achieves logarithmic individual regret. Most importantly, the algorithm is adaptive in the sense that if the opponent strategy is in an" easy region" of the strategy space then the regret grows as if the problem was easy.

Speeding up planning in Markov decision processes via automatically constructed abstractions

Abstract: In this paper, we consider planning in stochastic shortest path (SSP) problems, a subcl... more Abstract: In this paper, we consider planning in stochastic shortest path (SSP) problems, a subclass of Markov Decision Problems (MDP). We focus on medium-size problems whose state space can be fully enumerated. This problem has numerous important applications, such as navigation and planning under uncertainty. We propose a new approach for constructing a multi-level hierarchy of progressively simpler abstractions of the original problem.

Extending rapidly-exploring random trees for asymptotically optimal anytime motion planning

Abstract We consider the problem of anytime planning in continuous state and action spaces with n... more Abstract We consider the problem of anytime planning in continuous state and action spaces with non-linear deterministic dynamics. We review the existing approaches to this problem and find no algorithms that both quickly find feasible solutions and also eventually approach optimal solutions with additional time. The state-of-the-art solution to this problem is the rapidly-exploring random tree (RRT) algorithm that quickly finds a feasible solution. However, the RRT algorithm does not return better results with additional time.

Analysis of Kernel Mean Matching under Covariate Shift

Abstract: In real supervised learning scenarios, it is not uncommon that the training and test sa... more Abstract: In real supervised learning scenarios, it is not uncommon that the training and test sample follow different probability distributions, thus rendering the necessity to correct the sampling bias. Focusing on a particular covariate shift problem, we derive high probability confidence bounds for the kernel mean matching (KMM) estimator, whose convergence rate turns out to depend on some regularity measure of the regression function and also on some capacity measure of the kernel.

A Randomized Strategy for Learning to Combine Many Features

Abstract: We consider the problem of learning a predictor by combining possibly infinitely many l... more Abstract: We consider the problem of learning a predictor by combining possibly infinitely many linear predictors whose weights are to be learned, too, an instance of multiple kernel learning. To control overfitting a group p-norm penalty is used to penalize the empirical loss. We consider a reformulation of the problem that lets us implement a randomized version of the proximal point algorithm.

Module based reinforcement learning for a real robot

Abstract The behaviour of reinforcement learning (RL) algorithms is best understood in completely... more Abstract The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state-and action-space, discrete-time controlled Markov-chains. Robot-learning domains, on the other hand, are inherently infinite both in time and space, and moreover they are only partially observable.

Non-trivial two-armed partial-monitoring games are bandits

Abstract: We consider online learning in partial-monitoring games against an oblivious adversary.... more

Ockham's Razor Modeling of the Matrisome Channels of the Basal Ganglia Thalamocortical Loops

A functional model of the basal ganglia-thalamocortical (BTC) loops is described. In our modeling... more A functional model of the basal ganglia-thalamocortical (BTC) loops is described. In our modeling effort, we try to minimize the complexity of our starting hypotheses. For that reason, we call this type of modeling Ockham's razor modeling. We have the additional constraint that the starting assumptions should not contradict experimental findings about the brain. First assumption: The brain lacks direct representation of paths but represents directions (called speed fields in control theory).

Sequence prediction exploiting similarity information

Abstract When data is scarce or the alphabet is large, smoothing the probability estimates become... more Abstract When data is scarce or the alphabet is large, smoothing the probability estimates becomes inescapable when estimating n-gram models. In this paper we propose a method that implements a form of smoothing by exploiting similarity information of the alphabet elements. The idea is to view the log-conditional probability function as a smooth function defined over the similarity graph.

Bandit View on ontinuous Stochastic Optimization

We present a new strategy, Hierarchical Optimistic Optimization (HOO). It is based on a tree-repr... more We present a new strategy, Hierarchical Optimistic Optimization (HOO). It is based on a tree-representation of the search space, that we explore non-uniformly thanks to upper confidence bounds assigned to each nodes. Main theoretical result: if one knows the local regularity of the mean-payoff function around its maximum, then it is possible to obtain a cumulative regret of order

The adversarial stochastic shortest path problem with unknown transition probabilities

Abstract We consider online learning in a special class of episodic Markovian decision processes,... more Abstract We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode.

LINEAR PREDICTION WITH SIDE INFORMATION

Page 1. LINEAR PREDICTION WITH SIDE INFORMATION Csaba Szepesvári University of Alberta CMPUT 654 ... more

Model selection in reinforcement learning

Abstract We consider the problem of model selection in the batch (offline, non-interactive) reinf... more

Parametric bandits: The generalized linear case

Abstract We consider structured multi-armed bandit problems based on the Generalized Linear Model... more Abstract We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case.

Dyna (k): A Multi-Step Dyna Planning

Abstract Dyna planning is an efficient way of learning from real and imaginary experience. Existi... more Abstract Dyna planning is an efficient way of learning from real and imaginary experience. Existing tabular and linear Dyna algorithms are single-step, because an “imaginary” feature is predicted only one step into the future. In this paper, we introduce a multi-step Dyna planning that predicts more steps into the future.

LINEAR PREDICTION: RIDGE REGRESSION AND VARIANTS

Bandits contextuels: apprentissage par renforcement dans les modeles linéaires généralisés

Abstract: Nous considérons des problemes de bandits multibras structurés dans lesquels l'agent es... more Abstract: Nous considérons des problemes de bandits multibras structurés dans lesquels l'agent est guidé par une connaissance a priori sur la structure de la récompense qui peut être exploitée de manierea sélectionner efficacement un bras optimal dans des situations ou le nombre de bras est tres grand voire infini. Nous proposons un nouvel algorithme optimiste pour des problemes de bandit paramétriques non-linéaires en utilisant le cadre des modeles linéaires généralisés (GLM).

Towards Manifold-Adaptive Learning

Abstract Inputs coming from high-dimensional spaces are common in many real-world problems such a... more Abstract Inputs coming from high-dimensional spaces are common in many real-world problems such as a robot control with visual inputs. Yet learning in such cases is in general difficult, a fact often referred to as the “curse of dimensionality”. In particular, in regression or classification, in order to achieve a certain accuracy algorithms are known to require exponentially many samples in the dimension

Statistical Linear Estimation with Penalized Estimators: an Application to Reinforcement Learning

Abstract Motivated by value function estimation in reinforcement learning, we study statistical l... more Abstract Motivated by value function estimation in reinforcement learning, we study statistical linear inverse problems, ie, problems where the coefficients of a linear system to be solved are observed in noise. We consider penalized estimators, where performance is evaluated using a matrix-weighted two-norm of the defect of the estimator measured with respect to the true, unknown coefficients.

Online least squares estimation with self-normalized processes: An application to bandit problems

Abstract: The analysis of online least squares estimation is at the heart of many stochastic sequ... more Abstract: The analysis of online least squares estimation is at the heart of many stochastic sequential decision making problems. We employ tools from the self-normalized processes to provide a simple and self-contained proof of a tail bound of a vector-valued martingale. We use the bound to construct a new tighter confidence sets for the least squares estimate.

An adaptive algorithm for finite stochastic partial monitoring

Abstract: We present a new anytime algorithm that achieves near-optimal regret for any instance o... more Abstract: We present a new anytime algorithm that achieves near-optimal regret for any instance of finite stochastic partial monitoring. In particular, the new algorithm achieves the minimax regret, within logarithmic factors, for both" easy" and" hard" problems. For easy problems, it additionally achieves logarithmic individual regret. Most importantly, the algorithm is adaptive in the sense that if the opponent strategy is in an" easy region" of the strategy space then the regret grows as if the problem was easy.

Speeding up planning in Markov decision processes via automatically constructed abstractions

Abstract: In this paper, we consider planning in stochastic shortest path (SSP) problems, a subcl... more Abstract: In this paper, we consider planning in stochastic shortest path (SSP) problems, a subclass of Markov Decision Problems (MDP). We focus on medium-size problems whose state space can be fully enumerated. This problem has numerous important applications, such as navigation and planning under uncertainty. We propose a new approach for constructing a multi-level hierarchy of progressively simpler abstractions of the original problem.

Extending rapidly-exploring random trees for asymptotically optimal anytime motion planning

Abstract We consider the problem of anytime planning in continuous state and action spaces with n... more Abstract We consider the problem of anytime planning in continuous state and action spaces with non-linear deterministic dynamics. We review the existing approaches to this problem and find no algorithms that both quickly find feasible solutions and also eventually approach optimal solutions with additional time. The state-of-the-art solution to this problem is the rapidly-exploring random tree (RRT) algorithm that quickly finds a feasible solution. However, the RRT algorithm does not return better results with additional time.

Analysis of Kernel Mean Matching under Covariate Shift

Abstract: In real supervised learning scenarios, it is not uncommon that the training and test sa... more Abstract: In real supervised learning scenarios, it is not uncommon that the training and test sample follow different probability distributions, thus rendering the necessity to correct the sampling bias. Focusing on a particular covariate shift problem, we derive high probability confidence bounds for the kernel mean matching (KMM) estimator, whose convergence rate turns out to depend on some regularity measure of the regression function and also on some capacity measure of the kernel.

A Randomized Strategy for Learning to Combine Many Features

Abstract: We consider the problem of learning a predictor by combining possibly infinitely many l... more Abstract: We consider the problem of learning a predictor by combining possibly infinitely many linear predictors whose weights are to be learned, too, an instance of multiple kernel learning. To control overfitting a group p-norm penalty is used to penalize the empirical loss. We consider a reformulation of the problem that lets us implement a randomized version of the proximal point algorithm.

Module based reinforcement learning for a real robot

Abstract The behaviour of reinforcement learning (RL) algorithms is best understood in completely... more Abstract The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state-and action-space, discrete-time controlled Markov-chains. Robot-learning domains, on the other hand, are inherently infinite both in time and space, and moreover they are only partially observable.

Non-trivial two-armed partial-monitoring games are bandits

Abstract: We consider online learning in partial-monitoring games against an oblivious adversary.... more

Ockham's Razor Modeling of the Matrisome Channels of the Basal Ganglia Thalamocortical Loops

A functional model of the basal ganglia-thalamocortical (BTC) loops is described. In our modeling... more A functional model of the basal ganglia-thalamocortical (BTC) loops is described. In our modeling effort, we try to minimize the complexity of our starting hypotheses. For that reason, we call this type of modeling Ockham's razor modeling. We have the additional constraint that the starting assumptions should not contradict experimental findings about the brain. First assumption: The brain lacks direct representation of paths but represents directions (called speed fields in control theory).

Sequence prediction exploiting similarity information

Abstract When data is scarce or the alphabet is large, smoothing the probability estimates become... more Abstract When data is scarce or the alphabet is large, smoothing the probability estimates becomes inescapable when estimating n-gram models. In this paper we propose a method that implements a form of smoothing by exploiting similarity information of the alphabet elements. The idea is to view the log-conditional probability function as a smooth function defined over the similarity graph.

Bandit View on ontinuous Stochastic Optimization

We present a new strategy, Hierarchical Optimistic Optimization (HOO). It is based on a tree-repr... more We present a new strategy, Hierarchical Optimistic Optimization (HOO). It is based on a tree-representation of the search space, that we explore non-uniformly thanks to upper confidence bounds assigned to each nodes. Main theoretical result: if one knows the local regularity of the mean-payoff function around its maximum, then it is possible to obtain a cumulative regret of order

The adversarial stochastic shortest path problem with unknown transition probabilities

Abstract We consider online learning in a special class of episodic Markovian decision processes,... more Abstract We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode.

LINEAR PREDICTION WITH SIDE INFORMATION

Page 1. LINEAR PREDICTION WITH SIDE INFORMATION Csaba Szepesvári University of Alberta CMPUT 654 ... more

Model selection in reinforcement learning

Abstract We consider the problem of model selection in the batch (offline, non-interactive) reinf... more