Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Csaba Szepesvari

Abstract We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds... more
Abstract We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case.
Abstract Dyna planning is an efficient way of learning from real and imaginary experience. Existing tabular and linear Dyna algorithms are single-step, because an “imaginary” feature is predicted only one step into the future. In this... more
Abstract Dyna planning is an efficient way of learning from real and imaginary experience. Existing tabular and linear Dyna algorithms are single-step, because an “imaginary” feature is predicted only one step into the future. In this paper, we introduce a multi-step Dyna planning that predicts more steps into the future.
Abstract: Nous considérons des problemes de bandits multibras structurés dans lesquels l'agent est guidé par une connaissance a priori sur la structure de la récompense qui peut être exploitée de manierea sélectionner efficacement un bras... more
Abstract: Nous considérons des problemes de bandits multibras structurés dans lesquels l'agent est guidé par une connaissance a priori sur la structure de la récompense qui peut être exploitée de manierea sélectionner efficacement un bras optimal dans des situations ou le nombre de bras est tres grand voire infini. Nous proposons un nouvel algorithme optimiste pour des problemes de bandit paramétriques non-linéaires en utilisant le cadre des modeles linéaires généralisés (GLM).
Abstract Inputs coming from high-dimensional spaces are common in many real-world problems such as a robot control with visual inputs. Yet learning in such cases is in general difficult, a fact often referred to as the “curse of... more
Abstract Inputs coming from high-dimensional spaces are common in many real-world problems such as a robot control with visual inputs. Yet learning in such cases is in general difficult, a fact often referred to as the “curse of dimensionality”. In particular, in regression or classification, in order to achieve a certain accuracy algorithms are known to require exponentially many samples in the dimension
Abstract Motivated by value function estimation in reinforcement learning, we study statistical linear inverse problems, ie, problems where the coefficients of a linear system to be solved are observed in noise. We consider penalized... more
Abstract Motivated by value function estimation in reinforcement learning, we study statistical linear inverse problems, ie, problems where the coefficients of a linear system to be solved are observed in noise. We consider penalized estimators, where performance is evaluated using a matrix-weighted two-norm of the defect of the estimator measured with respect to the true, unknown coefficients.
Abstract: The analysis of online least squares estimation is at the heart of many stochastic sequential decision making problems. We employ tools from the self-normalized processes to provide a simple and self-contained proof of a tail... more
Abstract: The analysis of online least squares estimation is at the heart of many stochastic sequential decision making problems. We employ tools from the self-normalized processes to provide a simple and self-contained proof of a tail bound of a vector-valued martingale. We use the bound to construct a new tighter confidence sets for the least squares estimate.
Abstract: We present a new anytime algorithm that achieves near-optimal regret for any instance of finite stochastic partial monitoring. In particular, the new algorithm achieves the minimax regret, within logarithmic factors, for both"... more
Abstract: We present a new anytime algorithm that achieves near-optimal regret for any instance of finite stochastic partial monitoring. In particular, the new algorithm achieves the minimax regret, within logarithmic factors, for both" easy" and" hard" problems. For easy problems, it additionally achieves logarithmic individual regret. Most importantly, the algorithm is adaptive in the sense that if the opponent strategy is in an" easy region" of the strategy space then the regret grows as if the problem was easy.
Abstract: In this paper, we consider planning in stochastic shortest path (SSP) problems, a subclass of Markov Decision Problems (MDP). We focus on medium-size problems whose state space can be fully enumerated. This problem has numerous... more
Abstract: In this paper, we consider planning in stochastic shortest path (SSP) problems, a subclass of Markov Decision Problems (MDP). We focus on medium-size problems whose state space can be fully enumerated. This problem has numerous important applications, such as navigation and planning under uncertainty. We propose a new approach for constructing a multi-level hierarchy of progressively simpler abstractions of the original problem.
Abstract We consider the problem of anytime planning in continuous state and action spaces with non-linear deterministic dynamics. We review the existing approaches to this problem and find no algorithms that both quickly find feasible... more
Abstract We consider the problem of anytime planning in continuous state and action spaces with non-linear deterministic dynamics. We review the existing approaches to this problem and find no algorithms that both quickly find feasible solutions and also eventually approach optimal solutions with additional time. The state-of-the-art solution to this problem is the rapidly-exploring random tree (RRT) algorithm that quickly finds a feasible solution. However, the RRT algorithm does not return better results with additional time.
Abstract: In real supervised learning scenarios, it is not uncommon that the training and test sample follow different probability distributions, thus rendering the necessity to correct the sampling bias. Focusing on a particular... more
Abstract: In real supervised learning scenarios, it is not uncommon that the training and test sample follow different probability distributions, thus rendering the necessity to correct the sampling bias. Focusing on a particular covariate shift problem, we derive high probability confidence bounds for the kernel mean matching (KMM) estimator, whose convergence rate turns out to depend on some regularity measure of the regression function and also on some capacity measure of the kernel.
Abstract: We consider the problem of learning a predictor by combining possibly infinitely many linear predictors whose weights are to be learned, too, an instance of multiple kernel learning. To control overfitting a group p-norm penalty... more
Abstract: We consider the problem of learning a predictor by combining possibly infinitely many linear predictors whose weights are to be learned, too, an instance of multiple kernel learning. To control overfitting a group p-norm penalty is used to penalize the empirical loss. We consider a reformulation of the problem that lets us implement a randomized version of the proximal point algorithm.
Abstract The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state-and action-space, discrete-time controlled Markov-chains. Robot-learning domains, on the other hand, are inherently... more
Abstract The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state-and action-space, discrete-time controlled Markov-chains. Robot-learning domains, on the other hand, are inherently infinite both in time and space, and moreover they are only partially observable.
Abstract: We consider online learning in partial-monitoring games against an oblivious adversary. We show that when the number of actions available to the learner is two and the game is nontrivial then it is reducible to a bandit-like... more
Abstract: We consider online learning in partial-monitoring games against an oblivious adversary. We show that when the number of actions available to the learner is two and the game is nontrivial then it is reducible to a bandit-like game and thus the minimax regret is $\ Theta (\ sqrt {T}) $.
A functional model of the basal ganglia-thalamocortical (BTC) loops is described. In our modeling effort, we try to minimize the complexity of our starting hypotheses. For that reason, we call this type of modeling Ockham's razor... more
A functional model of the basal ganglia-thalamocortical (BTC) loops is described. In our modeling effort, we try to minimize the complexity of our starting hypotheses. For that reason, we call this type of modeling Ockham's razor modeling. We have the additional constraint that the starting assumptions should not contradict experimental findings about the brain. First assumption: The brain lacks direct representation of paths but represents directions (called speed fields in control theory).
Abstract When data is scarce or the alphabet is large, smoothing the probability estimates becomes inescapable when estimating n-gram models. In this paper we propose a method that implements a form of smoothing by exploiting similarity... more
Abstract When data is scarce or the alphabet is large, smoothing the probability estimates becomes inescapable when estimating n-gram models. In this paper we propose a method that implements a form of smoothing by exploiting similarity information of the alphabet elements. The idea is to view the log-conditional probability function as a smooth function defined over the similarity graph.
We present a new strategy, Hierarchical Optimistic Optimization (HOO). It is based on a tree-representation of the search space, that we explore non-uniformly thanks to upper confidence bounds assigned to each nodes. Main theoretical... more
We present a new strategy, Hierarchical Optimistic Optimization (HOO). It is based on a tree-representation of the search space, that we explore non-uniformly thanks to upper confidence bounds assigned to each nodes. Main theoretical result: if one knows the local regularity of the mean-payoff function around its maximum, then it is possible to obtain a cumulative regret of order
Abstract We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with... more
Abstract We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode.
Page 1. LINEAR PREDICTION WITH SIDE INFORMATION Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, November 23, 2006 Page 2.
Abstract We consider the problem of model selection in the batch (offline, non-interactive) reinforcement learning setting when the goal is to find an action-value function with the smallest Bellman error among a countable set of... more
Abstract We consider the problem of model selection in the batch (offline, non-interactive) reinforcement learning setting when the goal is to find an action-value function with the smallest Bellman error among a countable set of candidates functions.
Abstract We compare scaling properties of several value-function estimation algorithms. In particular, we prove that Q-learning can scale exponentially slowly with the number of states. We identify the reasons of the slow convergence and... more
Abstract We compare scaling properties of several value-function estimation algorithms. In particular, we prove that Q-learning can scale exponentially slowly with the number of states. We identify the reasons of the slow convergence and show that both TD () and learning with a xed learning-rate enjoy rather fast convergence, just like the model-based method.
Page 1. DISCRETE PREDICTION PROBLEMS: RANDOMIZED PREDICTION Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@cs. ualberta.ca UofA, October 1-6, 2009 Page 2.
Page 1. NONPARAMETRIC BANDITS Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, September 26, 2006 Page 2.
Abstract The problem of estimating the value function underlying a Markovian reward process is considered. As it is well known, the value function underlying a Markovian reward process satisfied a linear fixed point equation. One approach... more
Abstract The problem of estimating the value function underlying a Markovian reward process is considered. As it is well known, the value function underlying a Markovian reward process satisfied a linear fixed point equation. One approach to learning the value function from finite data is to find a good approximation to the value function in a given (linear) subspace of the space of value functions.
Page 1. RIDGE REGRESSION AND VARIANTS Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, November 30, 2006 Page 2.
Abstract In this paper we consider sequence clustering problems and propose an algorithm for the estimation of the number of clusters based on the X-means algorithm. The sequences are modeled using mixtures of Hidden Markov Models. By... more
Abstract In this paper we consider sequence clustering problems and propose an algorithm for the estimation of the number of clusters based on the X-means algorithm. The sequences are modeled using mixtures of Hidden Markov Models. By means of experiments with synthetic data we analyze the proposed algorithm. This algorithm proved to be both computationally efficient and capable of providing accurate estimates of the number of clusters. Some results of experiments with real-world Web-log data are also given.
Sequential decision-making problems are of prominent importance in robotics. Many robotics tasks, such as motion planning for a robotic arm, gait optimization for a quadruple robot, and dynamic balancing of humanoids, can all be... more
Sequential decision-making problems are of prominent importance in robotics. Many robotics tasks, such as motion planning for a robotic arm, gait optimization for a quadruple robot, and dynamic balancing of humanoids, can all be formulated as a sequential decision-making problem.
Abstract. We consider bounded resource planning in a Markovian decision problem, ie, the problem of finding a good policy given access to a generative model of the environment and a limit on the computational resources. We propose to use... more
Abstract. We consider bounded resource planning in a Markovian decision problem, ie, the problem of finding a good policy given access to a generative model of the environment and a limit on the computational resources. We propose to use fitted Q-iteration algorithm with penalized (or regularized) least-squares regression as the regression subroutine to address the problem of selecting an appropriate function approximator in each iteration.
Abstract The task of image auto-annotation, namely assigning a set of relevant tags to an image, is challenging due to the size and variability of tag vocabularies. Consequently, most existing algorithms focus on tag assignment and fix an... more
Abstract The task of image auto-annotation, namely assigning a set of relevant tags to an image, is challenging due to the size and variability of tag vocabularies. Consequently, most existing algorithms focus on tag assignment and fix an often large number of hand-crafted features to describe image characteristics.
Page 1. NON-STOCHASTIC BANDIT PROBLEMS Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, November 9, 2006 Page 2.
Page 1. ON THE LINEAR BELLMAN OPTIMALITY EQUATION AND ALL THAT.. Csaba Szepesvári University of Alberta RLAI LLL-Talk E-mail: szepesva@ ualberta.ca UofA, May 5, 2010 1 / 35 Page 2.
Abstract Bayesian priors offer a compact yet general means of incorporating domain knowledge into many learning tasks. The correctness of the Bayesian analysis and inference, however, largely depends on accuracy and correctness of these... more
Abstract Bayesian priors offer a compact yet general means of incorporating domain knowledge into many learning tasks. The correctness of the Bayesian analysis and inference, however, largely depends on accuracy and correctness of these priors. PAC-Bayesian methods overcome this problem by providing bounds that hold regardless of the correctness of the prior distribution. This paper introduces the first PAC-Bayesian bound for the batch reinforcement learning problem with function approximation.
Abstract: The success of kernel-based learning methods depend on the choice of kernel. Recently, kernel learning methods have been proposed that use data to select the most appropriate kernel, usually by combining a set of base kernels.... more
Abstract: The success of kernel-based learning methods depend on the choice of kernel. Recently, kernel learning methods have been proposed that use data to select the most appropriate kernel, usually by combining a set of base kernels. We introduce a new algorithm for kernel learning that combines a {\ em continuous set of base kernels}, without the common step of discretizing the space of base kernels.
Abstract. In a partial-monitoring problem in every round a learner chooses an action, simultaneously an opponent chooses an outcome, then the learner suffers some loss and receives some feedback. The goal of the learner is to minimize his... more
Abstract. In a partial-monitoring problem in every round a learner chooses an action, simultaneously an opponent chooses an outcome, then the learner suffers some loss and receives some feedback. The goal of the learner is to minimize his (unobserved) cumulative loss. In this paper we explore a variant of this problem where in every round, before the learner makes his decision, he receives some side-information.
Page 1. Policy iteration is strongly polynomial Based on Yinyu Ye's working paper http://www.stanford.edu/˜yyye/simplexmdp1.pdf Csaba Szepesvári University of Alberta E-mail: szepesva@.ualberta.ca Tea-Time Talk, July 29, 2010 Cs.... more
Page 1. Policy iteration is strongly polynomial Based on Yinyu Ye's working paper http://www.stanford.edu/˜yyye/simplexmdp1.pdf Csaba Szepesvári University of Alberta E-mail: szepesva@.ualberta.ca Tea-Time Talk, July 29, 2010 Cs. Szepesvári (UofA) On policy iteration July 29, 2010 1 / 25 ognizers arning w/o the Behavior Policy Page 2. Outline 1 Motivation 2 Background 3 Policy iteration, linear programming and the simplex method 4 The algorithm 5 Proof 6 Conclusions Cs.
Abstract. In this paper we consider two novel kernel machine based feature extraction algorithms in a regression settings. The first method is derived based on the principles underlying the recently introduced Maximum Margin Discimination... more
Abstract. In this paper we consider two novel kernel machine based feature extraction algorithms in a regression settings. The first method is derived based on the principles underlying the recently introduced Maximum Margin Discimination Analysis (MMDA) algorithm. However, here it is shown that the orthogonalization principle employed by the original MMDA algorithm can be motivated using the well-known ambiguity decomposition, thus providing a firm ground for the good performance of the algorithm.
Abstract The TTS system described in this paper is based on the analysis and resynthesis of a given speaker's voice. First, the speaker's voice definition is prepared off-line: a diphone database is recorded, segmented, and analyzed in... more
Abstract The TTS system described in this paper is based on the analysis and resynthesis of a given speaker's voice. First, the speaker's voice definition is prepared off-line: a diphone database is recorded, segmented, and analyzed in every 6 msec to obtain the filter parameters of an all-pole (AR) filter. During the on-line synthesis, the filters are excited with the mixture of a predefined periodic glottal source and white noise.
Abstract We consider linear prediction problems in a stochastic environment. The least mean square (LMS) algorithm is a well-known, easy to implement and computationally cheap solution to this problem. However, as it is well known, the... more
Abstract We consider linear prediction problems in a stochastic environment. The least mean square (LMS) algorithm is a well-known, easy to implement and computationally cheap solution to this problem. However, as it is well known, the LMS algorithm, being a stochastic gradient descent rule, may converge slowly. The recursive least squares (RLS) algorithm overcomes this problem, but its computational cost is quadratic in the problem dimension.
A kutatások célja a sztochasztikus rendszerek legkorszerűbb módszereinek az alkalmazása a pénzügyi piacok modellezésében és maguknak a módszereknek a továbbfejlesztése. A pénzügyi matematika ma egyik legnagyobb kihívása jó fedezeti... more
A kutatások célja a sztochasztikus rendszerek legkorszerűbb módszereinek az alkalmazása a pénzügyi piacok modellezésében és maguknak a módszereknek a továbbfejlesztése. A pénzügyi matematika ma egyik legnagyobb kihívása jó fedezeti stratégiák kialakítása nem-teljes piacokon. Ez matematikailag egy sajátos sztochasztikus adaptív kontrol problémát jelent, ahol a dinamikát egy sokdimenziós switching diffúziós folyamat írja le. Ehhez az általános problémához számos részprobléma köthető..
The use of function approximator to represent the value function of Reinforcement Learning (RL) and Dynamic Programming (DP) problems with large state spaces is inevitable. Although different methods for value function approximation have... more
The use of function approximator to represent the value function of Reinforcement Learning (RL) and Dynamic Programming (DP) problems with large state spaces is inevitable. Although different methods for value function approximation have been considered (such as generalized linear model with predefined basis functions, regression trees, neural networks, etc.), the designer still needs to make non-trivial design choices such as basis function selection or the stopping rule for growing the tree.
Abstract A Markov-chain Monte Carlo based algorithm is provided to solve the Simultaneous localization and mapping (SLAM) problem with general dynamics and observation model under open-loop control and provided that the map-representation... more
Abstract A Markov-chain Monte Carlo based algorithm is provided to solve the Simultaneous localization and mapping (SLAM) problem with general dynamics and observation model under open-loop control and provided that the map-representation is finite dimensional. To our knowledge this is the first provably consistent yet (close-to) practical solution to this problem. The superiority of our algorithm over alternative SLAM algorithms is demonstrated in a difficult loop closing situation.
Abstract Most learning algorithms assume that a data set is given initially. We address the common situation where data is not available initially, but can be obtained, at a cost. We focus on learning Bayesian belief networks (BNs) over... more
Abstract Most learning algorithms assume that a data set is given initially. We address the common situation where data is not available initially, but can be obtained, at a cost. We focus on learning Bayesian belief networks (BNs) over discrete variables. As such BNs are models of probabilistic distributions, we consider the “generative” challenge of learning the parameters for a fixed structure, that best match the true distribution.
We analyze the rate of convergence of the estimation error in regularized least-squares regression when the data is exponentially β-mixing. The results are proven under the assumption that the metric entropy of the balls in the chosen... more
We analyze the rate of convergence of the estimation error in regularized least-squares regression when the data is exponentially β-mixing. The results are proven under the assumption that the metric entropy of the balls in the chosen function space grows at most polynomially. In order to prove our main result, we also derive a relative deviation concentration inequality for β-mixing processes, which might be of independent interest.
Abstract In this paper we consider the problem of finding a good policy given some batch data. We propose a new approach, LAMAPI, that first builds a so-called linear action model (LAM) from the data and then uses the learned model and... more
Abstract In this paper we consider the problem of finding a good policy given some batch data. We propose a new approach, LAMAPI, that first builds a so-called linear action model (LAM) from the data and then uses the learned model and the collected data in approximate policy iteration (API) to find a good policy. A natural choice for the policy evaluation step in this algorithm is to use least-squares temporal difference (LSTD) learning algorithm.
Abstract We consider a generalization of stochastic bandits where the set of arms,, is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to... more
Abstract We consider a generalization of stochastic bandits where the set of arms,, is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems.
We investigate the problem of learning the transition dynamics of deterministic, discrete-state environments. We assume that an agent exploring such an environment is able to perform actions (from a finite set of actions) in the... more
We investigate the problem of learning the transition dynamics of deterministic, discrete-state environments. We assume that an agent exploring such an environment is able to perform actions (from a finite set of actions) in the environment and to sense the state changes. The question investigated is whether the agent can learn the dynamics without visiting all states. Such a goal is unrealistic in general, hence we assume that the environment has structural properties an agent might exploit.

And 108 more