Learning Rates for Q-learning
In this paper we derive convergence rates for Q-learning. We show an interesting relationship between the convergence rate and the learning rate used in Q-learning. For a polynomial learning rate, one which is 1/tω at time t where ω∈(1/2,1), we show ...
Learning the Kernel Matrix with Semidefinite Programming
Kernel-based learning algorithms work by embedding the data into a Euclidean space, and then searching for linear relations among the embedded data points. The embedding is performed implicitly, by specifying the inner products between each pair of ...
Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces
We propose a novel method of dimensionality reduction for supervised learning problems. Given a regression or classification problem in which we wish to predict a response variable Y from an explanatory variable X, we treat the problem of dimensionality ...
In Defense of One-Vs-All Classification
We consider the problem of multiclass classification. Our main thesis is that a simple "one-vs-all" scheme is as accurate as any other approach, assuming that the underlying binary classifiers are well-tuned regularized classifiers such as support ...
Lossless Online Bayesian Bagging
Bagging frequently improves the predictive performance of a model. An online version has recently been introduced, which attempts to gain the benefits of an online algorithm while approximating regular bagging. However, regular online bagging is an ...
Subgroup Discovery with CN2-SD
This paper investigates how to adapt standard classification rule learning approaches to subgroup discovery. The goal of subgroup discovery is to find rules describing subsets of the population that are sufficiently large and statistically unusual. The ...
Generalization Error Bounds for Threshold Decision Lists
In this paper we consider the generalization accuracy of classification methods based on the iterative use of linear classifiers. The resulting classifiers, which we call threshold decision lists act as follows. Some points of the data set to be ...
On the Importance of Small Coordinate Projections
It has been recently shown that sharp generalization bounds can be obtained when the function class from which the algorithm chooses its hypotheses is "small" in the sense that the Rademacher averages of this function class are small. We show that a new ...
Weather Data Mining Using Independent Component Analysis
In this article, we apply the independent component analysis technique for mining spatio-temporal data. The technique has been applied to mine for patterns in weather data using the North Atlantic Oscillation (NAO) as a specific example. We find that ...
Online Choice of Active Learning Algorithms
This work is concerned with the question of how to combine online an ensemble of active learners so as to expedite the learning progress in pool-based active learning. We develop an active-learning master algorithm, based on a known competitive ...
A Compression Approach to Support Vector Model Selection
In this paper we investigate connections between statistical learning theory and data compression on the basis of support vector machine (SVM) model selection. Inspired by several generalization bounds we construct "compression coefficients" for SVMs ...
A Geometric Approach to Multi-Criterion Reinforcement Learning
We consider the problem of reinforcement learning in a controlled Markov environment with multiple objective functions of the long-term average reward type. The environment is initially unknown, and furthermore may be affected by the actions of other ...
RCV1: A New Benchmark Collection for Text Categorization Research
Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually categorized newswire stories recently made available by Reuters, Ltd. for research purposes. Use of this data for research on text categorization requires a detailed understanding of ...
Distributional Scaling: An Algorithm for Structure-Preserving Embedding of Metric and Nonmetric Spaces
We present a novel approach for embedding general metric and nonmetric spaces into low-dimensional Euclidean spaces. As opposed to traditional multidimensional scaling techniques, which minimize the distortion of pairwise distances, our embedding ...
Learning Ensembles from Bites: A Scalable and Accurate Approach
Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive data sets, because the size of the data set can be a bottleneck. Voting many classifiers ...
Robust Principal Component Analysis with Adaptive Selection for Tuning Parameters
The present paper discusses robustness against outliers in a principal component analysis (PCA). We propose a class of procedures for PCA based on the minimum psi principle, which unifies various approaches, including the classical procedure and ...
PAC-learnability of Probabilistic Deterministic Finite State Automata
We study the learnability of Probabilistic Deterministic Finite State Automata under a modified PAC-learning criterion. We argue that it is necessary to add additional parameters to the sample complexity polynomial, namely a bound on the expected length ...
Sources of Success for Boosted Wrapper Induction
In this paper, we examine an important recent rule-based information extraction (IE) technique named Boosted Wrapper Induction (BWI) by conducting experiments on a wider variety of tasks than previously studied, including tasks using several collections ...
Computable Shell Decomposition Bounds
Haussler, Kearns, Seung and Tishby introduced the notion of a shell decomposition of the union bound as a means of understanding certain empirical phenomena in learning curves such as phase transitions. Here we use a variant of their ideas to derive an ...
Exact Bayesian Structure Discovery in Bayesian Networks
Learning a Bayesian network structure from data is a well-motivated but computationally hard task. We present an algorithm that computes the exact posterior probability of a subnetwork, e.g., a directed edge; a modified version of the algorithm finds ...
A Universal Well-Calibrated Algorithm for On-line Classification
We study the problem of on-line classification in which the prediction algorithm, for each "significance level" δ, is required to output as its prediction a range of labels (intuitively, those labels deemed compatible with the available data at the ...
New Techniques for Disambiguation in Natural Language and Their Application to Biological Text
We study the problems of disambiguation in natural language, focusing on the problem of gene vs. protein name disambiguation in biological text and also considering the problem of context-sensitive spelling error correction. We introduce a new family of ...
The Sample Complexity of Exploration in the Multi-Armed Bandit Problem
We consider the multi-armed bandit problem under the PAC ("probably approximately correct") model. It was shown by Even-Dar et al. (2002) that given n arms, a total of O((n/ε2)log(1/δ)) trials suffices in order to find an ε-optimal arm with probability ...
Preference Elicitation and Query Learning
In this paper we explore the relationship between "preference elicitation", a learning-style problem that arises in combinatorial auctions, and the problem of learning via queries studied in computational learning theory. Preference elicitation is the ...
Distance--Based Classification with Lipschitz Functions
The goal of this article is to develop a framework for large margin classification in metric spaces. We want to find a generalization of linear decision functions for metric spaces and define a corresponding notion of margin such that the decision ...
Hierarchical Latent Class Models for Cluster Analysis
Latent class models are used for cluster analysis of categorical data. Underlying such a model is the assumption that the observed variables are mutually independent given the class variable. A serious problem with the use of latent class models, known ...
Bias-Variance Analysis of Support Vector Machines for the Development of SVM-Based Ensemble Methods
Bias-variance analysis provides a tool to study learning algorithms and can be used to properly design ensemble methods well tuned to the properties of a specific base learner. Indeed the effectiveness of ensemble methods critically depends on accuracy, ...
A Fast Algorithm for Joint Diagonalization with Non-orthogonal Transformations and its Application to Blind Source Separation
A new efficient algorithm is presented for joint diagonalization of several matrices. The algorithm is based on the Frobenius-norm formulation of the joint diagonalization problem, and addresses diagonalization with a general, non-orthogonal ...
Feature Discovery in Non-Metric Pairwise Data
Pairwise proximity data, given as similarity or dissimilarity matrix, can violate metricity. This occurs either due to noise, fallible estimates, or due to intrinsic non-metric features such as they arise from human judgments. So far the problem of non-...
Probability Product Kernels
The advantages of discriminative learning algorithms and kernel machines are combined with generative modeling using a novel kernel between distributions. In the probability product kernel, data points in the input space are mapped to distributions over ...