Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

SVM Quantile DNA

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

459 2008 Schattauer GmbH

Support Vector Machine Quantile Regression for Detecting Differentially Expressed Genes in Microarray Analysis
I. Sohn1, S. Kim2, C. Hwang3, J. W. Lee1, J. Shim4 Department of Statistics, Korea University, Seoul, Korea 2 Skin Research Institute, AmorePacific R&D Center, Kyounggi-do, Korea 3 Division of Information and Computer Sciences, Dankook University, Kyounggi-do, Korea 4 Department of Applied Statistics, Catholic University of Daegu, Kyungbuk, Korea
1

Objectives: One of the main objectives of microarray analysis is to identify genes differentially expressed under two distinct experimental conditions. This task is complicated by the noisiness of data and the large number of genes that are examined. Fold change (FC) based gene selection often misleads because error variability for each gene is heterogeneous in different intensity ranges. Several statistical methods have been suggested, but some of them result in high false positive rates because they make very strong parametric assumptions. Methods: We present support vector quantile regression (SVMQR) using iterative reweighted least squares (IRWLS) procedure based on the Newton method instead of usual quadratic programming algorithms. This procedure makes it possible to derive the generalized approximate cross validation (GACV) method for choosing the parameters which affect the performance of SVMAR. We propose SVMQR based on a novel method for identifying differentially expressed genes with a small number of replicated microarrays. Results: We applied SVMQR to both three biological dataset and simulated dataset and showed that it performed more reliably and consistently than FC-based gene selection, Newtons method based on the posterior odds of change, or the nonparametric t-test variant implemented in significance analysis of microarrays (SAM). Conclusions: The SVMQR method was an exploratory method for cDNA microarray experiments to identify genes with different expression levels between two types of samples (e.g., tumor versus normal tissue). The SVMQR method performed well in the situation where error variability for each gene was heterogeneous in intensity ranges. cDNA microarray, support vector machine, support vector machine quantile regression Methods Inf Med 2008; 47: 459467 doi:10.3414/ME0396 Received: January 16, 2006; accepted: June 9, 2008

Summary

1. Introduction
The DNA microarray is a new tool in biotechnology. This tool allows the simultaneous monitoring of thousands of gene expressions in cells [1]. It has important applications in pharmaceutical and clinical research, including tumor classification, molecular pathway modeling, and functional genomics. One important and the most accepted use is the comparison of gene expression differences under two distinct experimental conditions (treated vs. untreated samples, diseased vs. normal tissue, mutant vs. wild-type organisms, etc.). In this kind of experimental setup, the main challenge is to determine which genes are differentially expressed across two tissue samples or samples obtained under two experimental conditions, i.e., to find the genes whose expression levels are deeply associated with the response of interest. In the early days, a simple fold change (FC) rule was applied to detect differentially expressed genes by using arbitrary cutoff point [2]. But, it has been known that simply using a FC rule is unreliable and inefficient [3]. Newton et al. [4] considered a hierarchical Gamma-Gamma-Bernoulli model but these assumptions seem to be too strong for routine data analysis use. They are clearly violated by the biological variation in a number of common experimental designs. Besides, the methods for identifying differentially expressed genes applicable to the microarray data, where error variability for each gene is heterogeneous over intensity ranges, have not been investigated. Therefore, we propose the support vector machine quantile regression (SVMQR) utilizing support

Keywords

vector machine (SVM), which perform well in microarray data with heterogeneous error variability depending on signal intensity. We introduce the nonparametric quantile regression method for identifying differentially expressed genes with a small number of replicated microarays. Quantile regression, which was first introduced by Koenker and Bassett [5], is a popular method for estimating the quantiles of a distribution conditional on the values of covariates. Similar to the classical linear regression methods minimizing the sum of squared residuals, quantile regression methods enable one to estimate a wide variety of models for conditional mean functions [6]. By contributing to the estimation of conditional mean functions with techniques for estimating an entire family of conditional quantile functions, quantile regression can potentially give a more complete statistical analysis of the stochastic relationships among the random variables [6]. Originally, SVM was developed by Vapnik [7, 8] to solve classification problems but its application has been expanded to solve regression problems. Because it is based on the structural risk minimization (SRM) principle, which minimizes an upper bound on the expected risk unlike the traditional empirical risk minimization (ERM) minimizing the error on the training data, we believe that SVM regression will be a better performance method for prediction and estimation of regression functions than other neural networks [9] and multivariate adaptive regression splines (MARS) [10]. Detailed information about SVM regression can be found in Cristianini and ShaweTaylor [11], Gunn [12], Smola and Scholkopf [13], and Vapnik [7, 8].
Methods Inf Med 5/2008

Downloaded from www.methods-online.com on 2011-12-17 | IP: 129.215.5.255 For personal or educational use only. No other uses without permission. All rights reserved.

460 Sohn et al.

Since quantile regression is in principle based on absolute deviation loss, to derive quantile regression using the idea of SVM, the procedures of the case = 0 in a standard SVM is adopted. Then the quantile regression problem by the formulation for SVM with xi = (1, xti) t can be expressed as
Fig. 1 Equation 5: Lagrange function

Minimize (3) for (0, 1), The regularization parameter C >0 determines the trade-off between the flatness of quantile function estimate and the amount up to which deviations larger than 0 are tolerated. By introducing slack variables , *, we can rewrite (3) by following optimization problem, Minimize for (0, 1), (4)

Fig. 2

Equation 6: dual optimization problem with kernel function K ( , )

In this article, we propose SVMQR using IRWLS procedure based on the Newton method instead of usual quadratic programming algorithms. We present a SVMQR method for identifying differentially expressed genes with a small number of replicated microarrays. We applied our SVMQR method to both three real datasets of cDNA microarrays and simulated dataset. We compared the performance of our method with that of the fold change (FC) rule, of Newtons method and of the significance analysis of microarrays (SAM) method [14].

tile regression model the quantile function of the response yi for a given xi is assumed to be nonlinearly related to the input vector xi Rd. To allow for the nonlinear quantile regression, the input vectors xi are nonlinearly transformed into a potentially higher-dimensional feature space F by a nonlinear mapping function (.). The quantile function of the response yi for a given xi can be given as Q( xi) = wt (xi) for (0, 1), (1)

subject to

2. Materials and Methods


2.1 Support Vector Machine Quantile Regression (SVMQR)
Conditional quantile estimation has long been studied in the literature. Most commonly used approach is quantile regression introduced by Koenker and Basset [6]. In this section we review the nonlinear quantile regression methods by implementing the idea of SVM [15]. Consider a random sample {xi , yi}in = 1 with input vector xi Rd and output variable yi R. Here the output variable yi is related to the vector xi of covariates, possibly including a constant 1. In the nonlinear quanMethods Inf Med 5/2008

where w is the th regression quantile. Here, similar to SVM for nonlinear regression, the nonlinear regression quantile estimator cannot be given in an explicit form since we use the kernel function of input vectors instead of the dot product of their feature mapping functions except for the identity feature mapping function such that (x) = x. Its estimator is defined as any solution to the optimization problem [6], (2) for (0, 1), where is the check function defined as [r) = r(r 0) + ( 1)rI(r <0), here I(.) is the indicator function, that is, I(true) = 1, I( false) = 0.

where is upper training error, * is lower training error, and C >0 is the regularization parameter. The parameter C determines the trade-off between the flatness of f and the amount up to which deviations larger than 0 are tolerated. Equation 4 corresponds to dealing with an absolute deviation loss function. Since the symmetry of the absolute value yields the median, simply giving different weights to positive and negative residuals would yield the quantiles by minimizing a sum of asymmetrically weighted absolute residuals. This is indeed the case of finding quantile regression. Solving (4) under the constraints yields the 0th sample quantile as its solution. The second term of Equation 4 is, in fact, the tilted absolute value function. The Lagrange function is constructed as can be seen in Figure 1.

Downloaded from www.methods-online.com on 2011-12-17 | IP: 129.215.5.255 For personal or educational use only. No other uses without permission. All rights reserved.

461 SVQM Regression Methods for Microarray Analysis

Notice that the positivity constraints i , i*, i , i* 0 should be satisfied. After taking partial derivatives of Equation 5 with regard to the primal variables (w, i , i*) and plugging them into Equation 5, the dual optimization problem with kernel function K(,) is obtained as can be seen in Figure 2, subject to i [0, C] and i* [0, (1 )C]. By substituting i = i i* it is possible to rewrite the above dual problem as follows: (7) subject to i [(1 )C, C ]. Solving the dual optimization problem with the constraints determines the optimal Lagrange multipliers, i , the 0th regression quantile estimators and the 0th quantile function predictors of the input vector x are obtained, respectively, as follows: and (8) Here, w and Q( | x) depend implicitly on through i depending on . We use a Gaussian kernel function, which is most commonly used and is defined as , where is kernel parameter. The kernel parameter will be determined by the generalized approximate cross validation (GACV). We added this explanation on kernel function.

Nychka et al. [16] suggested employing the modified check function , instead, which differs from only in the region (, ) where

by the penalty constant C and the kernel parameter. To choose the parameters of SVMQR we first need to consider the cross validation (CV) function as follows: , (13)

(9)

By setting small enough, we can get a good approximate solution to (3). Substituting (8) and (9) to (3) yields the problem obtains through minimizing

for (0, 1),

(10)

where Ki is the ith row of the kernel matrix K. Taking partial derivatives of (10) with regard to leads to the optimal values of to be the solution to 0 = K + CKWy + CKWK . (11)

where is the set of parameters and Q (i)( | x) is the quintile function estimated without ith observation. But the computational cost associated with CV function is formidable since for each candidate set of parameters, Q (i)( | x) for i = 1, ..., n should be evaluated. Thus we adopt GACV derived by Muan [17] as a remedy to CV. Muan [17] proposed GACV for the selection of smoothing parameter for the quantile smoothing spline estimates,

(14)

HereW is a diagonal matrix with the ith diagonal element obtained from the derivative of the modified check function as

(12)

where H is the hat matrix such that Q( | x) = Hy with the (i, j)th element Q( | xi) = yi . By the way, this GACV function cannot be applied to SVMQR using QP since the hat H is not computable. But it can be applied to SVMQR using IRWLS since H can be obtained from (8) and (11) as follows: H = K(K/C + KWK) 1 KW . Thus our proposed GACV is given as (15)

2.2 SVMQR Using IRWLS


In this section we propose an IRWLS procedure to estimate SVMQR. This procedure makes it possible to derive GACV for parameter selection of SVMQR. Thus we illustrate the model selection method using GACV technique in order to choose appropriate parameters of SVMQR. The IRWLS procedure basically uses differentiation of the check function . To overcome the nondifferentiability of at 0,

where ri = yi Ki . The solution to (11) cannot be obtained in a single step since W contains therein. Thus we need to apply IRWLS procedure which starts with initialized values of as follows: 1) Calculate W with . 2) Calculate from = (/C + KWK)1KWy. 3) Iterate steps until convergence. The problem of choosing the smoothing parameters is ubiquitous in function estimation. Thus we now illustrate the model selection method which chooses the appropriate parameters of SVMQR. The functional structure of SVMQR is characterized

(16)

where is the set of the penalty constant and the kernel parameter and H is the hat matrix in (15).

2.3 Statistical Methods for Identifying Differential Gene Expression


In this section, we review several statistical methods for identifying differential exMethods Inf Med 5/2008

Downloaded from www.methods-online.com on 2011-12-17 | IP: 129.215.5.255 For personal or educational use only. No other uses without permission. All rights reserved.

462 Sohn et al.

pression in microarray data. For a gene (spot), let R and G denote the measured fluorescence intensities for the red and green dyes, respectively. The gene expression data consist of log-intensity ratios Mij (= log2 Rij /Gij), where i = 1, 2, , p (genes), j = 1, 2, , n (samples). Denote the mean and the standard deviation of Mij for gene as Mi , si , respectively.

, where s0 is a fudge factor. For details, see Tusher et al. [14].

2.2.4 SVMQR Method


Let SVMQ (Ai) and SVMQ(1 )(Ai)be the SVM th quantile regression lower curve and (1 )th quantile regression upper curve as a function of A-value (= log2Ri Gi ) of gene i, respectively. A gene i must satisfy (Mi SVMQi,(1 )(Ai)) >0 in order to be called a positive significant gene and a gene i must satisfy (Mi SVMQR (Ai)) <0 to be called a negative significant gene. The regularization parameter C and the kernel parameter of Gaussian kernel are chosen by using 10-fold cross validation for the implementation of SVMQR.

2.3.1 Fold Change


To be considered a positive significant gene, when a nonzero fold change m is stated, then a gene i must also satisfy |Mi | log2m and a negative significant gene having a gene i must also satisfy |Mi | log2(1/m).

genes were excluded by the following criteria: 1) The PCR amplification of the sequence spotted on the array was deemed acceptable only if the amplification was confirmed and a single size product was obtained. 2)Accurate printing of each spot was required, as shown by an emission signal from more than 40% of the spot area. 3) The signal from the fluorophore labels had to be higher than 28. The datasets were further processed by print-tip-dependent normalization and dye-swap normalization. The k-Nearest Neighbor (KNN) method was used to fill in the missing values of the datasets. The final output datasets were composed of 6340 genes. Sohn et al. [15] used microarray dataset of a diet-induced obese.

2.3.2 Newtons Method


Newtons method [4] is to compute the posterior odds of change at each gene. The odds summarize inference about actual differential expression at each gene using all the data on the microarray. With D = {Ri , Gi} denoting expression measurements on the whole microarray, the posterior odds of change at gene i is odds = where the binary indicator variable zi is equal to 0 unless there is true differential expression and P(zi = 1|D) = 10P(zi = 1|p, Ri , Gi)P(p|D)dp. This is by conditional independence of the data at different genes given the parameter p. For details, see Newton et al. [4].

2.3.2 Microarray Dataset of E. coli Data


We reanalyzed two slides on E. coli data [18] used by Newton et al. [4]. Two microarrays were replicates treatment with isopropyl-beta-D-thiogalactopyranosid(IPTG) in which common protocol suggests that only a few transcripts should be induced. Control RNA (labeled Cy3) was cohybridized with RNA (labeled cy5) from E. coli treated with IPTG. Richmond et al. [18] provide details. Following these authors, we invoke a simple normalization method; on each microarray, we first subtract background intensity from each spot. Then we divide each adjusted intensity by the total intensity obtained by combining all positive adjusted measurements.

2.3 Datasets
2.3.1 Microarray Dataset of a Diet-induced Obese (DIO) Mouse Model
The experimental group consisted of six mice whose diet was a high-fat diet (HFD) for 12 weeks.The control group consisted of age/weight-matched six mice whose diet was a low-fat diet (LFD) for 12 weeks. Equal amounts of RNA from six mice of each group were pooled. Each sample was equally divided. One half was used to generate Cy3-labeled cDNA. The other half was used to generate Cy5-labeled cDNA for dye swapping. Six replicates of hybridization were performed. Three of these were repeated with the fluorophores reversed to prevent dye-bias. The Cy5 and Cy3 probes were mixed and hybridized to a microarray containing 10,336 cDNA probes. Probes were spotted onto glass slides using a 4 8 print head. Two fluorescent images (Cy3 and Cy5) were scanned separately by using a GMS 418 Array Scanner (Affymetrix, Santa Clara, CA, USA). Signal intensity values were obtained from the ImaGene 4.2 (Biodiscovery, Santa Monica, CA, USA) and the MAAS (Gaiagene, Seoul, Korea) software applications. At first, 3996

2.3.3 Microarray Dataset of High-density Lipoprotein (HDL)-deficient Mouse


This study examined gene expression in the mouse model with a low level of HDL. Expression data for the apoAI knockout mouse were obtained from http://www.stat. berkeley.edu/users/terry/zarray/Html/matt. html [19]. The dataset contained three hybridizations of Cy5-labeled mutant mouse mRNA. Each hybridized against a pool of Cy3-labeled mRNA from the same three wild-type mice. A total of 6384 genes and controls were present on each array. Sohn et al. [15] used microarray dataset of highdensity lipoprotein.

2.3.3 SAM (Significant Analysis of Microarrays)


The SAM model gives a score to each gene based on the change of the gene expression relative to the standard deviation of repeated measurements [19]. For genes with a higher score beyond an adjustable threshold, SAM estimates the percentage of genes identified by change the false discovery rate (FDR) uses the permutation of the repeated measurements. The SAM statistic of gene i is
Methods Inf Med 5/2008

Downloaded from www.methods-online.com on 2011-12-17 | IP: 129.215.5.255 For personal or educational use only. No other uses without permission. All rights reserved.

463 SVQM Regression Methods for Microarray Analysis

3. Results
In this section, we applied our SVMQR method to both three real datasets of mouse cDNA microarrays and simulated dataset. We compared the performance of our method with that of the fold change (FC) rule, of Newtons method and of the significance analysis of microarrays (SAM) method [14].

3.1 Real Data Analysis


The comparison of the performance of algorithms for identifying differentially expressed genes is not an easy task as it is usually not known which genes are the true positives for a specific biological sample. Even the verification of microarray results by any conventional technique such as quantitative RT-PCR is just replacing one error-prone method by another [15]. If a gene is likely to be detected as differentially expressed each time the same experiment is repeatedly performed, then it might be defined as a true positive gene. We now use this concept to analyze the above three datasets. In addition, we assess the quality and consistency of the results using previously established biological knowledge.

Fig. 3 A MA plot comparing HFD vs. LFD groups. M represents the log ratio of the two fluorescent dyes used to label probes. A represents the averaged logarithmic intensity. The SVMQR curves represent = 0.025 and = 0.975, respectively. The log posterior odds of change of 1:1, 10:1, and 100:1 are indicated as 0, 1, and 2, respectively.

3.1.1 Diet-induced Obese (DIO) Mouse Model


If a true positive gene is the one that has an increased likelihood of being detected as differentially expressed in any repetition of the experiment, a good algorithm would be the one that consistently detects differentially expressed genes through independently repeated hybridization experiments. In the first attempt to evaluate the consistency of our algorithm, we compared the performance of the SVMQR method, gene selection based on FC values, and Newtons method, using our microarray data from the DIO mouse model [20]. We selected C and parameter values using GACV function for = 0.975 and = 0.025, respectively. For Newtons method, we selected genes whose posterior odd values were higher than 0. The DIO mouse model is illustrated with the M vs. A plot in

Figure 3, where the log-ratios are given by M = log2 (R/G) and average log-intensity by A = log2R/G. Figure 3 shows the upper and lower quantiles of the SVMQR method, twofold change, and contours for Newtons method. The upper and lower curves stand for = 0.975 and = 0.025, respectively. The log posterior odds of change of 1:1, 10:1, and 100 :1 of Newtons method are indicated as 0, 1, and 2, respectively. This MA plot shows a tendency of increasing dispersion of the log-ratio M as the spot intensity, A, decreases. The SVMQR lines have narrower spacing in the lower ranges of intensity, but have wider spacing in the higher ranges of intensity. The conditional distribution of the log-ratio M may be asymmetric and heteroscedastic. The number of significant genes from at least one of three slides by three different methods and the number of repeated detections are shown in Table 1. It can be seen that

the FC and the SVMQR detect about the same number of differentially expressed spots when the upper and lower quartiles for the data were = 0.975 and = 0.025, respectively, and the fold change cutoffs were twofold. According to the repeat recovery rate, i.e., the percentage of spots that are also identified as differentially regulated in their corresponding ones in the second and third slides, the performance of the SVMQR was slightly better; that is, it detected differentially expressed genes more consistently in the three repeated slides. In FC, this rate was 8.2 %. This rate means that about 8% of the detected genes were found simultaneously in the three repeated slides. The corresponding rate in the SVMQR is 12.6%.The Newton method identified a few significant genes with three replicates. The Newton method was not able to identify many of the differentially expressed genes that were detected by the FC or SVMQR
Methods Inf Med 5/2008

Downloaded from www.methods-online.com on 2011-12-17 | IP: 129.215.5.255 For personal or educational use only. No other uses without permission. All rights reserved.

464 Sohn et al.

Table 1 The number of significant genes from at least one of three slides using three different methods and the number of repeated detections in the analysis The diet-induced obese mouse model Method SVMQR FC Newton Method SVMQR FC Newton Cut-off threshold >0.975 or <0.025 fold change > 2 odd values > 0 Cutoff threshold The number of significant genes from at least one of three slides 759 1286 145 The number of significant genes from at least one of three slides 984 263 Repeated detection in three slides (the repeat recovery rate) 96 (12.6%) 106 (8.2%) 12 (8.2%) Repeated detection in three slides (the repeat recovery rate) 98 ( 12.7%) 90 ( 9.1%) 25 (9.0%)

glucose phosphate isomerase 1 complex, were also found differentially expressed in the DIO mouse model by the SVMQR method, but were missed by the FC.

3.1.2 E. coli Model


In the second analysis, we compared the performance of the SVMQR, the FC, and the Newton method on E.coli data. For the SVMQR, the selected C, , and values were the same as the analysis of the DIO mouse model. The number of significant genes from at least one of two slides by three different methods and the number of repeated detections are shown in Table 2. According to the repeat recovery rate, the performance of the SVMQR was better; that is, it detected differentially expressed genes more consistently in the three repeated slides. The Newton method identified a few significant genes with two replicates. The Newton method was not able to identify many of the differentially expressed genes that were detected by the FC or SVMQR method. Figure 5 is the Venn diagram showing the number of the genes identified as differentially regulated by the three methods in the E. coli model. The number of significant genes selected by the SVMQR and FC methods was 117 and 96, respectively. Sixty-eight genes were commonly selected to be significant by both methods.

High-density lipoprotein-deficient mouse model

>0.975 or <0.025 772 fold change >2 odd values >0

method. Figure 4 is the Venn diagram showing the number of the genes identified as differentially regulated by the three methods in the DIO mouse model. The number of significant genes selected by the SVMQR and FC methods was 96 and 106, respectively. Sixty-nine genes were commonly selected to be significant by both methods. The lists of significant genes selected by the SVMQR or FC method are presented in Table 2. Next, we assessed the quality of the results using previously established biological knowledge. According to their biological

function, several interesting and important genes were identified by our SVMQR method. Cytochrome P450, family 4, subfamily a, polypeptide 14 (Mm.250901) is a good example. Previous studies showed that cytochrome P450, family 4, subfamily a, polypeptide 14 (Mm.250901) is likely to be functionally relevant for a DIO mouse model [21, 22]. This gene was found differentially regulated in the DIO mouse model only by the SVMQR method. The genes involved in metabolism, such as glycerol3-phosphate acyltransferase, mitochondrial, lactate dehydrogenase 1, A chain,

Fig. 4 A comparison among three methods using the microarray dataset of the diet-induced obese mouse model. A Venn diagram shows the number of genes identified by each experimental method when using a cutoff of twofold for the FC method, = 0.975 and = 0.025 for the SVMQR method, and posterior odd values >0 for the Newton method.

Fig. 5 A comparison among three methods using the microarray dataset of E. coli model. A Venn diagram shows the number of genes identified by each experimental method when using a cutoff of twofold for the FC method, = 0.975 and = 0.025 for the SVMQR method, and posterior odd values >0 for the Newton method.

Fig. 6 A comparison among the three methods using the microarray dataset of the HDL-deficient mouse model. A Venn diagram shows the numbers of genes identified by each experimental method when using a cutoff of twofold for the FC method, = 0.975 and = 0.025 for the SVMQR method, and selecting top 22 significant genes for the SAM. The list of genes identified by each experimental method is presented in Table 3.

Methods Inf Med 5/2008

Downloaded from www.methods-online.com on 2011-12-17 | IP: 129.215.5.255 For personal or educational use only. No other uses without permission. All rights reserved.

465 SVQM Regression Methods for Microarray Analysis

3.1.3 The High-density Lipoprotein (HDL)deficient Mouse Model


In the third analysis, we compared the performance of the SVMQR, the FC, and the SAM on expression data from the mouse model with a low level of HDL. For the SVMQR, the selected C, , and values were the same as the analysis of the DIO mouse model. For SAM, we selected top-ranking genes in the order of absolute value. The FC and SVMQR methods selected differentially expressed genes when the upper and lower quartiles for the data were = 0.975 and = 0.025 respectively, and the twofold change cutoffs were used. The SAM selected top 19 significant genes. Figure 6 is the Venn diagram showing the number of genes identified as differentially regulated by the three methods in the HDLdeficient mouse model. The numbers and the names of the genes identified as differentially regulated by the three methods in the HDL-deficient mouse model are presented in Table 3. Since the underlying biology of the experiment is well understood, we can use it to assess the quality of the algorithms. The previously verified genes as differentially expressed genes in the HDL-deficient mouse were detected by at least two different methods: 1) EST, highly similar to APOLIPOPROTEIN A-I PRECURSOR (Mus musculus), lipid-UG; 2)ApoAI, lipidImg; 3) ESTs, highly similar to APOLIPOPROTEIN C-III PRECURSOR (Mus musculus), lipid-UG; 4) EST, weakly similar to C-5 STEROL DESATURASE (Saccharomyces cerevisiae), lipid-UG; 5) EST (1496); 6) Apo CIII, lipid-Img; 7) CATECHOL O-METHYLTRANSFERASE, MEMBRANE-BOUND FORM, BrainImg; and 8) EST, similar to yeast sterol desaturase, lipid-Img [23]. Long-chain fatty acyl CoA synthetase mRNA, complete cds, lipid-UG MDB0368, which is likely to be functionally relevant for a HDL-deficient mouse model [24, 25], was found differentially expressed by the SVMQR method, but was missed by the FC or SAM method.

Table 2 The number of significant genes from at least one of two slides using three different methods and the number of repeated detections in the analysis on E. coli data The E. coli model Method SVMQR FC Newton Cut-off threshold The number of significant genes from at least one of two slides 460 18 Repeated detection in two slides (the repeat recovery rate) 117 (33.3%) 96 (20.7%) 1 (5.5%)

>0.975 or <0.025 351 fold change >2 odd values >0

3.2 Simulation Study


We carried out a simulation study to evaluate SVMQR method. We simulated an artificial data set as in Balagurunathan et al. [26], Fujita et al. [27], and Haldermans et al. [28]. The simulated data were generated by the following steps: 1) The true expression signal is generated by an exponential distribution with =

1/3000. Red (R) and green channel (G) intensities for each gene simulate from a normal distribution with mean of the true expression signal and a standard deviation 15% of mean of the true expression signal. 2) We select 10% of the genes to be either over- or under-expressed. The selected genes have a targeted expression ratio that was generated by t = 10 b

Table 3 A comparison of the differentially expressed genes in a HDL-deficient mouse model identified by the support vector machine quantile regression, fold change, and SAM methods. The abbreviations used are: support vector machine quantile regression (Q), fold change (FC), and significance analysis of microarrays (SAM) methods. * The genes were confirmed by biological methods. ID 540 2149 2537 4139 1496 5356 4941 1337 2932 2989 4390 4533 4942 5188 5249 5731 6050 6117 6134 NAME EST, Highly similar to APOLIPOPROTEIN A-I PRECURSOR [Mus musculus], lipid-UG* Apo AI, lipid-Img* ESTs, Highly similar to APOLIPOPROTEIN C-III PRECURSOR [Mus musculus], lipid-UG* EST, Weakly similar to C-5 STEROL DESATURASE [S. cerevisiae], lipid-UG* EST* CATECHOL O-METHYLTRANSFERASE, MEMBRANE-BOUND FORM, Brain-Img* EST, Similar to yeast sterol desaturase, lipid-Img* Psoriasis-associated fatty acid binding protein, lipid-Img Mus musculus long chain fatty acyl CoA synthetase mRNA, complete cds, lipid-UG MDB0368 Cy3RT EST, Highly similar to CALCINEURIN B SUBUNIT [Drosophila melanogaster], heart-UG ORPHAN NUCLEAR RECEPTOR OF STEROID/THYROID SUPERFAMILY, Brain-Img Cy3RT Q < < < < < < < < < < < < < < < < < < FC < < < < < < < < < < < SAM

5' similar to SW:ACT_VOLCA P20904 ACTIN. ;. gi|2186634|gb|AA461743|AA461743 < [2186634] Mus musculus paraoxonase-3 (Pon3) mRNA, complete cds, lipid-UG 5'. gi|2186640|gb|AA461749|AA461749 [2186640] EST, Highly similar to CATECHOL O-METHYLTRANSFERASE, MEMBRANE-BOUND FORM [R. norvegicus], Brain-UG Mouse MAPK mRNA for mitogen-activated protein kinase (p42), heart-UG < < < < < <

Methods Inf Med 5/2008

Downloaded from www.methods-online.com on 2011-12-17 | IP: 129.215.5.255 For personal or educational use only. No other uses without permission. All rights reserved.

466 Sohn et al.

Fig. 7 Two patterns for the simulation study. Left panels show MA plot before Lowess normalization, and right panels show MA plot after Lowess normalization. denotes differently expressed genes.

ated external validation using the simulated data. We considered two models. In the first model (Model 1 in Table 4), a training dataset and a second dataset are a sinusoid shape and a banana shape, respectively. In the second model (Model 2 in Table 4), a training dataset and a second dataset are a banana shape and a sinusoid shape, respectively. We generated 500 genes and Lowess normalization [29] was done. The training dataset is used to fit the models and the second dataset is used to estimate the true predictive performance [30]. This procedure was repeated 100 times. We compared both average sensitivity and specificity from the simulated data. We selected C and parameter values using GACV function for = 0.975 and = 0.025, respectively. For Newtons method, we selected genes whose posterior odd values were higher than 0. Table 4 shows the average number of genes selected, the average number of true genes selected, average sensitivity, and average specificity from the simulated data. As shown in Table 4, SVMQR method gives a higher average sensitivity but a little lower average specificity than Newtons method. Although SVMQR method gives a little lower average specificity than Newtons method, Newtons method missed many true genes.

Model 1 SVMQR Average number of genes selected Average number of true genes selected Average sensitivity Average specificity 50.76 22.96 0.45 0.94 Newton 18.76 16.48 0.32 0.99

Model 2 SVMQR 46.64 23.8 0.47 0.94 Newton 18.72 17.58 0.35 0.99

Table 4 The average number of genes selected, the average number of true genes selected, average sensitivity, and average specificity from the simulated data

4. Discussion
In this paper, we proposed support vector quantile regression (SVMQR) using iterative reweighted least squares (IRWLS) procedure based on the Newton method and new SVMQR method for identifying differentially expressed genes with a small number of replicated microarrays. In microarray studies, gene selection based on foldchange (FC) values is often misleading especially when the error variability for each gene is heterogeneous over the intensity ranges. The FC values calculated from the measured intensity levels may give a different interpretation for a gene whose absolute expression level is low. The old methods, such as by Chen et al.[3] and Newton et al.[4], are based on the assumed parametric models (e.g. Gamma or Gaussian) for the (R, G) intensities, but these assumptions

where b satisfies a beta distribution, b B(1.7, 4.8). R and G intensities of these genes then are converted by R = R t and G = G/t. 3) In order to transform these intensities to nonlinear patterns, Rand Gintensities of all genes are converted by

and
Methods Inf Med 5/2008

(17)

The two patterns were considered. The first pattern is a sinusoid shape with parameters 0 1 2 3 (a1 = 0, a1 = 100 1/0.9, a1 = 0.9, a1 = 1) and 0 = 0, a1 = 100 1/0.7, a2 = 0.7, a3 = 1) in (a1 1 1 1 Equation 17 (see Fig. 7a). The second pattern is a banana shape with parameters 0 1 2 3 (a1 = 0, a1 = 10, a1 = 1, a1 = 1) and 0 1 2 3 (a1 = 0, a1 = 500, a1 = 1, a1 = 1) (see Fig. 7c). To investigate relative performances of SVMQR and Newtons methods, we evalu-

Downloaded from www.methods-online.com on 2011-12-17 | IP: 129.215.5.255 For personal or educational use only. No other uses without permission. All rights reserved.

467 SVQM Regression Methods for Microarray Analysis

seem to be too strong for routine data analysis use. However, our SVMQR method deals with the estimation of the th quantile of the log-ratios (M = log2 (R/G)) given the average log-intensity (A = log2 RG). Therefore, if we use the information on the quantile of the log-ratios (M = log2 (R/G)) for identifying differentially expressed genes, for data with heteroscedasticity (Fig. 3), SVMQR method performs much better than the fold change which uses only absolute log-ratios (M = log2 (R/G) and does not need the parametric assumptions. The SVMQR method was an exploratory method for cDNA microarray experiments to identify genes with different expression levels between two types of samples (e.g., tumor versus normal tissue). The SVMQR method performed well in the situation where error variability for each gene was heterogeneous in intensity ranges.
Acknowledgements This work was supported by a Korea Science and Engineering Foundation Grant (R14-2003-002-01002-0).

References

1. Brown PO, Botstein D. Exploring the new world of the genome with DNA microarrays. The chipping forecast 1999; 21: 33-37. 2. DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997; 278: 680-686. 3. Chen Y, Dougherty ER, Bittner ML. Ratio-based decisions and the quantitative analysis of cDNA microarray image. Biomedical Optics 1997; 2: 364-374. 4. Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J of Com Bio 2001; 8: 37-52. 5. Koenker R, Bassett G. Regression Quantiles. Econometrica 1978; 46: 33-50. 6. Koenker R, Xiao Z. Inference on the quantile regression process. Econometrica; 70 (4): 1583-1612.

7. Vapnik VN. The nature of statistical learning theory. New York: Springer; 1995. 8. Vapnik VN. Statistical Learning Theory. New York: Springer; 1998. 9. Gunn SR, Brown M, Bossley KM. Network performance assessment for neurofuzzy data modelling. Lecture Notes in Computer Science 1997; 208: 313-323. 10. Ripley BD. Neural networks and related methods for classification. Journal of Royal Statistical Society 1994; 56: 409-456. 11. Cristianini N, Shawe-Taylor J. Support Vector Regression. Cambridge University Press; 2000 12. Gunn S. Support Vector Machines for Classification and Regression. ISIS Technical Report, University of Southampton; 1998. 13. Smola A, Scholkopf B. On a Kernel-Based Method for Pattern Recognition, Regression, Approximation and Operator Inversion. Algorithmica 1998; 22: 211-231. 14. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci 2001; 98: 5116-5121. 15. Sohn I, Kim S, Hwang C, Lee JW. New normalization methods using support vector machine quantile regression approach in microarray analysis. Computational Statistics and Data Analysis. In press. 16. Nychka D, Gray G, Haaland P, Martin D, OConnell M. A Nonparametric Regression Approach to Syringe Grading for Quality Improvement. Journal of the American Statistical Association 1995; 90: 1171-1178. 17. Muan M. GACV for quantile smoothing splines, Computational Statistics and Data Analysis 2006; 50 (2006): 813-829 18. Richmond CS, Glasner JD, Mau R, Jin H, Blattner FR. Genome-wide expression profiling in Escherichia coli K-12. Nucleic Acids Res 1999; 27 (19): 3821-3835. 19. Dudoit S, Yang YH, Speed TP, Callow MJ. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 2002; 12: 111-140. 20. Kim S, Sohn I, Ahn J-I, Lee K-H, Lee Y-S, Lee Y-S. Hepatic gene expression profile in long-term high-fat diet-induced obesity mouse model. Gene 2004; 340: 99-109. 21. Becker W, Kluge R, Kantner T, Linnartz K, Korn M, Tschank G, Plum L, Giesen K, Joost HG. Differential hepatic gene expression in a polygenic mouse model with insulin resistance and hyperglycemia: evidence for a combined transcriptional

dysregulation of gluconeogenesis and fatty acid synthesis. J Mol Endocrinol 2004; 32: 195-208. 22. Enriquez A, Leclercq I, Farrell GC, Robertson G. Altered expression of hepatic CYP2E1 and CYP4A in obese, diabetic ob/ob mice, and fa/fa Zucker rats. Biochem Biophys Res Commun 1995; 255: 300-306. 23. Callow MJ, Dudoit S, Gong EL, Speed TP, Rubin EM. Microarray Expression Profiling Identifies Genes with Altered Expression in HDL-Deficient Mice. Genome Reserach 2000; 10: 2022-2029. 24. Memon RA, Fuller J, Moser AH, Smith PJ, Grunfeld C, Feingold KR. Regulation of putative fatty acid transporters and Acyl-CoA synthetase in liver and adipose tissue in ob/ob mice. Diabetes 1999; 48: 121-127. 25. Malewiak MI, Griglio S, Le Liepvre X. Relationship between lipogenesis, ketogenesis, and malonyl-CoA content in isolated hepatocytes from the obese Zucker rat adapted to a high-fat diet. Metabolism 1985; 34: 604-611. 26. Balagurunathan Y, Dougherty E, Chen Y, Bittner M, Trent J. Simulation of cdna microarrays via a parameterized random signal model. Journal of Biomedical Optics 2002; 7: 507-523. 27. Fujita A, Sato JR, de Oliverira Rodrigues L, Ferrerira CE, Sogayar MC. Evaluating different methods of microarray data normalization. BMC Bioinformatics 2006; 7: 469. 28. Haldermans P, Shkedy Z, Sanden SV, Burzykowski T, Aerts M. Using Linear Mixed Models for Normalization of cDNA Microarrays. Statistical Applications in Genetics and Molecular Biology 2007; 6 (1): 1-23. 29. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research 2002; 30 (4): e15. 30. Konig IR, Malley JD, Weimar C, Diener H-C, Ziergler A. Practical experiences on the necessity of external validation. Statist Med 2007. In press.

Correspondence to: Sujong Kim Skin Research Institute AmorePacific R&D Center 314-1 Sanggal-dong Kiheung-gu, Yongin-si Kyounggi-do 449-729 Korea E-mail: sundance@amorepacific.com

Methods Inf Med 5/2008

Downloaded from www.methods-online.com on 2011-12-17 | IP: 129.215.5.255 For personal or educational use only. No other uses without permission. All rights reserved.

You might also like