Sparse Machine Learning Methods for Prediction and Personalized Medicine

Yu, Hang

Download PDF

Request Version for Screen Reader

Creator

Yu, Hang
- Affiliation: College of Arts and Sciences, Department of Statistics and Operations Research

Abstract

With growing interest to use black-box machine learning for complex data with many feature variables, it is critical to obtain a prediction model that only depends on a small set of features to maximize generalizability. Therefore, feature selection remains to be an important and challenging problem in modern applications. Most of existing methods for feature selection are based on either parametric or semiparametric models, so the resulting performance can severely suffer from model misspecification when high-order nonlinear interactions among the features are present. A very limited number of approaches for nonparametric feature selection were proposed, but they are computationally intensive and may not even converge. Thus, nonparametric feature selection for high-dimensional data is an important problem in statistics and machine learning fields. Futhermore, in the field of precision medicine, machine learning techniques are usually applied on a large health dataset containing patients' information to find optimal individual treatment rule (ITR), which makes the learning process computational demanding. Thus, identifying the truly important feature variables shortens the computation time and saves the cost of collecting redundant data. Therefore, we focus on developing machine learning techniques to perform variable selection for both prediction and personalized medicine in the dissertation. In the first project, we propose a novel and computationally efficient approach for nonparametric feature selection in regression field based on a tensor-product kernel function over the feature space. The importance of each feature is governed by a parameter in the kernel function which can be efficiently computed iteratively from a modified alternating direction method of multipliers (ADMM) algorithm. We prove the oracle selection property of the proposed method. Finally, we demonstrate the superior performance of our approach compared to existing methods via simulation studies and application to the prediction of Alzheimer's disease. In the second project, we continue to propose a new framework to perform nonparametric feature selection for both regression and classification problems. Under this framework, we learn prediction functions through empirical risk minimization over a reproducing kernel Hilbert space (RKHS). The space is generated by a novel tensor product kernel which depends on a set of parameters that determine the importance of the features. Computationally, we minimize the empirical risk with a penalty to estimate the prediction and kernel parameters simultaneously. The solution can be obtained by iteratively solving convex optimization problems. We study the theoretical property of the kernel feature space and prove oracle selection property and Fisher consistency of our proposed method. Finally, we demonstrate the superior performance of our approach compared to existing methods via extensive simulation studies and application to a microarray study of eye disease in animals. Finally, we focus on applying the nonparametric feature selection framework for treatment decision making with high-dimensional data. We directly estimate the decision function in Reproducing Kernel Hilbert Space (RKHS) generated by a novel constructed tensor product kernel with parameters capturing the importance of each variable. Computationally, we adopt two steps to separate the procedure for both estimating and tuning processes, which makes the computation more fast and stable. Finally, we demonstrate the superior performance of our approach compared to existing methods via one simulation study and application to type 2 diabetes.

Date of publication