Abstract We consider structured multi-armed bandit problems based on the Generalized Linear Model... more Abstract We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case.
Abstract Dyna planning is an efficient way of learning from real and imaginary experience. Existi... more Abstract Dyna planning is an efficient way of learning from real and imaginary experience. Existing tabular and linear Dyna algorithms are single-step, because an “imaginary” feature is predicted only one step into the future. In this paper, we introduce a multi-step Dyna planning that predicts more steps into the future.
Abstract: Nous considérons des problemes de bandits multibras structurés dans lesquels l'agent es... more Abstract: Nous considérons des problemes de bandits multibras structurés dans lesquels l'agent est guidé par une connaissance a priori sur la structure de la récompense qui peut être exploitée de manierea sélectionner efficacement un bras optimal dans des situations ou le nombre de bras est tres grand voire infini. Nous proposons un nouvel algorithme optimiste pour des problemes de bandit paramétriques non-linéaires en utilisant le cadre des modeles linéaires généralisés (GLM).
Abstract Inputs coming from high-dimensional spaces are common in many real-world problems such a... more Abstract Inputs coming from high-dimensional spaces are common in many real-world problems such as a robot control with visual inputs. Yet learning in such cases is in general difficult, a fact often referred to as the “curse of dimensionality”. In particular, in regression or classification, in order to achieve a certain accuracy algorithms are known to require exponentially many samples in the dimension
Abstract Motivated by value function estimation in reinforcement learning, we study statistical l... more Abstract Motivated by value function estimation in reinforcement learning, we study statistical linear inverse problems, ie, problems where the coefficients of a linear system to be solved are observed in noise. We consider penalized estimators, where performance is evaluated using a matrix-weighted two-norm of the defect of the estimator measured with respect to the true, unknown coefficients.
Abstract: The analysis of online least squares estimation is at the heart of many stochastic sequ... more Abstract: The analysis of online least squares estimation is at the heart of many stochastic sequential decision making problems. We employ tools from the self-normalized processes to provide a simple and self-contained proof of a tail bound of a vector-valued martingale. We use the bound to construct a new tighter confidence sets for the least squares estimate.
Abstract: We present a new anytime algorithm that achieves near-optimal regret for any instance o... more Abstract: We present a new anytime algorithm that achieves near-optimal regret for any instance of finite stochastic partial monitoring. In particular, the new algorithm achieves the minimax regret, within logarithmic factors, for both" easy" and" hard" problems. For easy problems, it additionally achieves logarithmic individual regret. Most importantly, the algorithm is adaptive in the sense that if the opponent strategy is in an" easy region" of the strategy space then the regret grows as if the problem was easy.
Abstract: In this paper, we consider planning in stochastic shortest path (SSP) problems, a subcl... more Abstract: In this paper, we consider planning in stochastic shortest path (SSP) problems, a subclass of Markov Decision Problems (MDP). We focus on medium-size problems whose state space can be fully enumerated. This problem has numerous important applications, such as navigation and planning under uncertainty. We propose a new approach for constructing a multi-level hierarchy of progressively simpler abstractions of the original problem.
Abstract We consider the problem of anytime planning in continuous state and action spaces with n... more Abstract We consider the problem of anytime planning in continuous state and action spaces with non-linear deterministic dynamics. We review the existing approaches to this problem and find no algorithms that both quickly find feasible solutions and also eventually approach optimal solutions with additional time. The state-of-the-art solution to this problem is the rapidly-exploring random tree (RRT) algorithm that quickly finds a feasible solution. However, the RRT algorithm does not return better results with additional time.
Abstract: In real supervised learning scenarios, it is not uncommon that the training and test sa... more Abstract: In real supervised learning scenarios, it is not uncommon that the training and test sample follow different probability distributions, thus rendering the necessity to correct the sampling bias. Focusing on a particular covariate shift problem, we derive high probability confidence bounds for the kernel mean matching (KMM) estimator, whose convergence rate turns out to depend on some regularity measure of the regression function and also on some capacity measure of the kernel.
Abstract: We consider the problem of learning a predictor by combining possibly infinitely many l... more Abstract: We consider the problem of learning a predictor by combining possibly infinitely many linear predictors whose weights are to be learned, too, an instance of multiple kernel learning. To control overfitting a group p-norm penalty is used to penalize the empirical loss. We consider a reformulation of the problem that lets us implement a randomized version of the proximal point algorithm.
Abstract The behaviour of reinforcement learning (RL) algorithms is best understood in completely... more Abstract The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state-and action-space, discrete-time controlled Markov-chains. Robot-learning domains, on the other hand, are inherently infinite both in time and space, and moreover they are only partially observable.
Abstract: We consider online learning in partial-monitoring games against an oblivious adversary.... more Abstract: We consider online learning in partial-monitoring games against an oblivious adversary. We show that when the number of actions available to the learner is two and the game is nontrivial then it is reducible to a bandit-like game and thus the minimax regret is $\ Theta (\ sqrt {T}) $.
A functional model of the basal ganglia-thalamocortical (BTC) loops is described. In our modeling... more A functional model of the basal ganglia-thalamocortical (BTC) loops is described. In our modeling effort, we try to minimize the complexity of our starting hypotheses. For that reason, we call this type of modeling Ockham's razor modeling. We have the additional constraint that the starting assumptions should not contradict experimental findings about the brain. First assumption: The brain lacks direct representation of paths but represents directions (called speed fields in control theory).
Abstract When data is scarce or the alphabet is large, smoothing the probability estimates become... more Abstract When data is scarce or the alphabet is large, smoothing the probability estimates becomes inescapable when estimating n-gram models. In this paper we propose a method that implements a form of smoothing by exploiting similarity information of the alphabet elements. The idea is to view the log-conditional probability function as a smooth function defined over the similarity graph.
We present a new strategy, Hierarchical Optimistic Optimization (HOO). It is based on a tree-repr... more We present a new strategy, Hierarchical Optimistic Optimization (HOO). It is based on a tree-representation of the search space, that we explore non-uniformly thanks to upper confidence bounds assigned to each nodes. Main theoretical result: if one knows the local regularity of the mean-payoff function around its maximum, then it is possible to obtain a cumulative regret of order
Abstract We consider online learning in a special class of episodic Markovian decision processes,... more Abstract We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode.
Page 1. LINEAR PREDICTION WITH SIDE INFORMATION Csaba Szepesvári University of Alberta CMPUT 654 ... more Page 1. LINEAR PREDICTION WITH SIDE INFORMATION Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, November 23, 2006 Page 2.
Abstract We consider the problem of model selection in the batch (offline, non-interactive) reinf... more Abstract We consider the problem of model selection in the batch (offline, non-interactive) reinforcement learning setting when the goal is to find an action-value function with the smallest Bellman error among a countable set of candidates functions.
Abstract We consider structured multi-armed bandit problems based on the Generalized Linear Model... more Abstract We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case.
Abstract Dyna planning is an efficient way of learning from real and imaginary experience. Existi... more Abstract Dyna planning is an efficient way of learning from real and imaginary experience. Existing tabular and linear Dyna algorithms are single-step, because an “imaginary” feature is predicted only one step into the future. In this paper, we introduce a multi-step Dyna planning that predicts more steps into the future.
Abstract: Nous considérons des problemes de bandits multibras structurés dans lesquels l'agent es... more Abstract: Nous considérons des problemes de bandits multibras structurés dans lesquels l'agent est guidé par une connaissance a priori sur la structure de la récompense qui peut être exploitée de manierea sélectionner efficacement un bras optimal dans des situations ou le nombre de bras est tres grand voire infini. Nous proposons un nouvel algorithme optimiste pour des problemes de bandit paramétriques non-linéaires en utilisant le cadre des modeles linéaires généralisés (GLM).
Abstract Inputs coming from high-dimensional spaces are common in many real-world problems such a... more Abstract Inputs coming from high-dimensional spaces are common in many real-world problems such as a robot control with visual inputs. Yet learning in such cases is in general difficult, a fact often referred to as the “curse of dimensionality”. In particular, in regression or classification, in order to achieve a certain accuracy algorithms are known to require exponentially many samples in the dimension
Abstract Motivated by value function estimation in reinforcement learning, we study statistical l... more Abstract Motivated by value function estimation in reinforcement learning, we study statistical linear inverse problems, ie, problems where the coefficients of a linear system to be solved are observed in noise. We consider penalized estimators, where performance is evaluated using a matrix-weighted two-norm of the defect of the estimator measured with respect to the true, unknown coefficients.
Abstract: The analysis of online least squares estimation is at the heart of many stochastic sequ... more Abstract: The analysis of online least squares estimation is at the heart of many stochastic sequential decision making problems. We employ tools from the self-normalized processes to provide a simple and self-contained proof of a tail bound of a vector-valued martingale. We use the bound to construct a new tighter confidence sets for the least squares estimate.
Abstract: We present a new anytime algorithm that achieves near-optimal regret for any instance o... more Abstract: We present a new anytime algorithm that achieves near-optimal regret for any instance of finite stochastic partial monitoring. In particular, the new algorithm achieves the minimax regret, within logarithmic factors, for both" easy" and" hard" problems. For easy problems, it additionally achieves logarithmic individual regret. Most importantly, the algorithm is adaptive in the sense that if the opponent strategy is in an" easy region" of the strategy space then the regret grows as if the problem was easy.
Abstract: In this paper, we consider planning in stochastic shortest path (SSP) problems, a subcl... more Abstract: In this paper, we consider planning in stochastic shortest path (SSP) problems, a subclass of Markov Decision Problems (MDP). We focus on medium-size problems whose state space can be fully enumerated. This problem has numerous important applications, such as navigation and planning under uncertainty. We propose a new approach for constructing a multi-level hierarchy of progressively simpler abstractions of the original problem.
Abstract We consider the problem of anytime planning in continuous state and action spaces with n... more Abstract We consider the problem of anytime planning in continuous state and action spaces with non-linear deterministic dynamics. We review the existing approaches to this problem and find no algorithms that both quickly find feasible solutions and also eventually approach optimal solutions with additional time. The state-of-the-art solution to this problem is the rapidly-exploring random tree (RRT) algorithm that quickly finds a feasible solution. However, the RRT algorithm does not return better results with additional time.
Abstract: In real supervised learning scenarios, it is not uncommon that the training and test sa... more Abstract: In real supervised learning scenarios, it is not uncommon that the training and test sample follow different probability distributions, thus rendering the necessity to correct the sampling bias. Focusing on a particular covariate shift problem, we derive high probability confidence bounds for the kernel mean matching (KMM) estimator, whose convergence rate turns out to depend on some regularity measure of the regression function and also on some capacity measure of the kernel.
Abstract: We consider the problem of learning a predictor by combining possibly infinitely many l... more Abstract: We consider the problem of learning a predictor by combining possibly infinitely many linear predictors whose weights are to be learned, too, an instance of multiple kernel learning. To control overfitting a group p-norm penalty is used to penalize the empirical loss. We consider a reformulation of the problem that lets us implement a randomized version of the proximal point algorithm.
Abstract The behaviour of reinforcement learning (RL) algorithms is best understood in completely... more Abstract The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state-and action-space, discrete-time controlled Markov-chains. Robot-learning domains, on the other hand, are inherently infinite both in time and space, and moreover they are only partially observable.
Abstract: We consider online learning in partial-monitoring games against an oblivious adversary.... more Abstract: We consider online learning in partial-monitoring games against an oblivious adversary. We show that when the number of actions available to the learner is two and the game is nontrivial then it is reducible to a bandit-like game and thus the minimax regret is $\ Theta (\ sqrt {T}) $.
A functional model of the basal ganglia-thalamocortical (BTC) loops is described. In our modeling... more A functional model of the basal ganglia-thalamocortical (BTC) loops is described. In our modeling effort, we try to minimize the complexity of our starting hypotheses. For that reason, we call this type of modeling Ockham's razor modeling. We have the additional constraint that the starting assumptions should not contradict experimental findings about the brain. First assumption: The brain lacks direct representation of paths but represents directions (called speed fields in control theory).
Abstract When data is scarce or the alphabet is large, smoothing the probability estimates become... more Abstract When data is scarce or the alphabet is large, smoothing the probability estimates becomes inescapable when estimating n-gram models. In this paper we propose a method that implements a form of smoothing by exploiting similarity information of the alphabet elements. The idea is to view the log-conditional probability function as a smooth function defined over the similarity graph.
We present a new strategy, Hierarchical Optimistic Optimization (HOO). It is based on a tree-repr... more We present a new strategy, Hierarchical Optimistic Optimization (HOO). It is based on a tree-representation of the search space, that we explore non-uniformly thanks to upper confidence bounds assigned to each nodes. Main theoretical result: if one knows the local regularity of the mean-payoff function around its maximum, then it is possible to obtain a cumulative regret of order
Abstract We consider online learning in a special class of episodic Markovian decision processes,... more Abstract We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode.
Page 1. LINEAR PREDICTION WITH SIDE INFORMATION Csaba Szepesvári University of Alberta CMPUT 654 ... more Page 1. LINEAR PREDICTION WITH SIDE INFORMATION Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, November 23, 2006 Page 2.
Abstract We consider the problem of model selection in the batch (offline, non-interactive) reinf... more Abstract We consider the problem of model selection in the batch (offline, non-interactive) reinforcement learning setting when the goal is to find an action-value function with the smallest Bellman error among a countable set of candidates functions.
Uploads
Papers by Csaba Szepesvari