Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

    Chen-An Tsai

    For medical data mining, the development of a class prediction model has been widely used to deal with various kinds of data classification problems. Classification models especially for high-dimensional gene expression datasets have... more
    For medical data mining, the development of a class prediction model has been widely used to deal with various kinds of data classification problems. Classification models especially for high-dimensional gene expression datasets have attracted many researchers in order to identify marker genes for distinguishing any type of cancer cells from their corresponding normal cells. However, skewed class distributions often occur in the medical datasets in which at least one of the classes has a relatively small number of observations. A classifier induced by such an imbalanced dataset typically has a high accuracy for the majority class and poor prediction for the minority class. In this study, we focus on an SVM classifier with a Gaussian radial basis kernel for a binary classification problem. In order to take advantage of an SVM and to achieve the best generalization ability for improving the classification performance, we will address two important problems: the class imbalance and par...
    Background: Gene set enrichment analyses (GSEA) provide a useful and powerful approach to identify differentially expressed gene sets with prior biological knowledge. Several GSEA algorithms have been proposed to perform enrichment... more
    Background: Gene set enrichment analyses (GSEA) provide a useful and powerful approach to identify differentially expressed gene sets with prior biological knowledge. Several GSEA algorithms have been proposed to perform enrichment analyses on groups of genes. However, many of these algorithms have focused on identification of differentially expressed gene sets in a given phenotype. Objective: In this paper, we propose a gene set analytic framework, Gene Set Correlation Analysis (GSCoA), that simultaneously measures within and between gene sets variation to identify sets of genes enriched for differential expression and highly co-related pathways. Methods: We apply co-inertia analysis to the comparisons of cross-gene sets in gene expression data to measure the costructure of expression profiles in pairs of gene sets. Co-inertia analysis (CIA) is one multivariate method to identify trends or co-relationships in multiple datasets, which contain the same samples. The objective of CIA i...
    Gene set analysis (GSA) aims to evaluate the association between the expression of biological pathways, or a priori defined gene sets, and a particular phenotype. Numerous GSA methods have been proposed to assess the enrichment of sets of... more
    Gene set analysis (GSA) aims to evaluate the association between the expression of biological pathways, or a priori defined gene sets, and a particular phenotype. Numerous GSA methods have been proposed to assess the enrichment of sets of genes. However, most methods are developed with respect to a specific alternative scenario, such as a differential mean pattern or a differential coexpression. Moreover, a very limited number of methods can handle either binary, categorical, or continuous phenotypes. In this paper, we develop two novel GSA tests, called SDRs, based on the sufficient dimension reduction technique, which aims to capture sufficient information about the relationship between genes and the phenotype. The advantages of our proposed methods are that they allow for categorical and continuous phenotypes, and they are also able to identify a variety of enriched gene sets. Through simulation studies, we compared the type I error and power of SDRs with existing GSA methods for...
    An important objective in mass spectrometry (MS) is to identify a set of biomarkers that can be used to potentially distinguish patients between distinct treatments (or conditions) from tens or hundreds of spectra. A common two-step... more
    An important objective in mass spectrometry (MS) is to identify a set of biomarkers that can be used to potentially distinguish patients between distinct treatments (or conditions) from tens or hundreds of spectra. A common two-step approach involving peak extraction and quantification is employed to identify the features of scientific interest. The selected features are then used for further investigation to understand underlying biological mechanism of individual protein or for development of genomic biomarkers to early diagnosis. However, the use of inadequate or ineffective peak detection and peak alignment algorithms in peak extraction step may lead to a high rate of false positives. Also, it is crucial to reduce the false positive rate in detecting biomarkers from ten or hundreds of spectra. Here a new procedure is introduced for feature extraction in mass spectrometry data that extends the continuous wavelet transform-based (CWT-based) algorithm to multiple spectra. The proposed multispectra CWT-based algorithm (MCWT) not only can perform peak detection for multiple spectra but also carry out peak alignment at the same time. The author' MCWT algorithm constructs a reference, which integrates information of multiple raw spectra, for feature extraction. The algorithm is applied to a SELDI-TOF mass spectra data set provided by CAMDA 2006 with known polypeptide m/z positions. This new approach is easy to implement and it outperforms the existing peak extraction method from the Bioconductor PROcess package.
    In DNA microarray studies, gene-set analysis (GSA) has become the focus of gene expression data analysis. GSA utilizes the gene expression profiles of functionally related gene sets in Gene Ontology (GO) categories or priori-defined... more
    In DNA microarray studies, gene-set analysis (GSA) has become the focus of gene expression data analysis. GSA utilizes the gene expression profiles of functionally related gene sets in Gene Ontology (GO) categories or priori-defined biological classes to assess the significance of gene sets associated with clinical outcomes or phenotypes. Many statistical approaches have been proposed to determine whether such functionally related gene sets express differentially (enrichment and/or deletion) in variations of phenotypes. However, little attention has been given to the discriminatory power of gene sets and classification of patients. In this study, we propose a method of gene set analysis, in which gene sets are used to develop classifications of patients based on the Random Forest (RF) algorithm. The corresponding empirical p-value of an observed out-of-bag (OOB) error rate of the classifier is introduced to identify differentially expressed gene sets using an adequate resampling method. In addition, we discuss the impacts and correlations of genes within each gene set based on the measures of variable importance in the RF algorithm. Significant classifications are reported and visualized together with the underlying gene sets and their contribution to the phenotypes of interest. Numerical studies using both synthesized data and a series of publicly available gene expression data sets are conducted to evaluate the performance of the proposed methods. Compared with other hypothesis testing approaches, our proposed methods are reliable and successful in identifying enriched gene sets and in discovering the contributions of genes within a gene set. The classification results of identified gene sets can provide an valuable alternative to gene set testing to reveal the unknown, biologically relevant classes of samples or patients. In summary, our proposed method allows one to simultaneously assess the discriminatory ability of gene sets and the importance of genes for interpretation of data in complex biological systems. The classifications of biologically defined gene sets can reveal the underlying interactions of gene sets associated with the phenotypes, and provide an insightful complement to conventional gene set analyses.
    This paper compares the type I error and power of the one- and two-sample t-tests, and the one- and two-sample permutation tests for detecting differences in gene expression between two microarray samples with replicates using Monte Carlo... more
    This paper compares the type I error and power of the one- and two-sample t-tests, and the one- and two-sample permutation tests for detecting differences in gene expression between two microarray samples with replicates using Monte Carlo simulations. When data are generated from a normal distribution, type I errors and powers of the one-sample parametric t-test and one-sample permutation test are very close, as are the two-sample t-test and two-sample permutation test, provided that the number of replicates is adequate. When data are generated from a t-distribution, the permutation tests outperform the corresponding parametric tests if the number of replicates is at least five. For data from a two-color dye swap experiment, the one-sample test appears to perform better than the two-sample test since expression measurements for control and treatment samples from the same spot are correlated. For data from independent samples, such as the one-channel array or two-channel array experi...
    Microarray analysis is a powerful tool to identify the biological effects of drugs or chemicals on cellular gene expression. In this study, we compare the relationships between traditional measures of genetic toxicology and... more
    Microarray analysis is a powerful tool to identify the biological effects of drugs or chemicals on cellular gene expression. In this study, we compare the relationships between traditional measures of genetic toxicology and mutagen-induced alterations in gene expression profiles. TK6 cells were incubated with 0.01, 0.1, or 1.0 microM +/-anti-benzo(a)pyrene-trans-7,8-dihydrodiol-9,10-epoxide (BPDE) for 4 h and then cultured for an additional 20 h. Aliquots of the exposed cells were removed at 4 and 24 h in order to quantify DNA adduct levels by 32P post-labeling and measure cell viability by cloning efficiency and flow cytometry. Gene expression profiles were developed by extracting total RNA from the control and exposed cells at 4 and 24 h, labeling with Cy3 or Cy5 and hybridizing to a human 350 gene array. Mutant frequencies in the Thymidine Kinase and Hypoxanthine Phosphoribosyl Transferase genes were also determined. The 10alpha-(deoxyguanosin-N(2)-yl)-7alpha,8beta,9beta-trihydro...
    ABSTRACT With the escalating amount of gene expression data being produced by microarray technology, one of important issues in the analysis of expression data is quality assessment, in which we want to know whether the one chip is... more
    ABSTRACT With the escalating amount of gene expression data being produced by microarray technology, one of important issues in the analysis of expression data is quality assessment, in which we want to know whether the one chip is artifactually high or low intensity relative to the majority of the chip. We propose a graphical tool implemented in R for visualizing distributions of two gene chips. Moreover, a statistical test based on chi-square test is employed to quantify degrees of array comparability for pairwise comparisons on a large number of arrays.
    The tumor suppressor protein p53 is a key regulatory element in the cell and is regarded as the "guardian of the genome". Much of the present knowledge of p53 function has come from studies of transgenic mice in which the p53... more
    The tumor suppressor protein p53 is a key regulatory element in the cell and is regarded as the "guardian of the genome". Much of the present knowledge of p53 function has come from studies of transgenic mice in which the p53 gene has undergone a targeted deletion. In order to provide additional insight into the impact on the cellular regulatory networks associated with the loss of this gene, microarray technology was utilized to assess gene expression in tissues from both the p53(-/-) and p53(+/-) mice. Six male mice from each genotype (p53(+/+), p53(+/-), and p53(-/-)) were humanely killed and the tissues processed for microarray analysis. The initial studies have been performed in the liver for which the Dunnett test revealed 1406 genes to be differentially expressed between p53(+/+) and p53(+/-) or between p53(+/+) and p53(-/-) at the level of p < or = 0.05. Both genes with increased expression and decreased expression were identified in p53(+/-) and in p53(-/-) mic...
    Gene set analysis methods aim to determine whether an a priori defined set of genes shows statistically significant difference in expression on either categorical or continuous outcomes. Although many methods for gene set analysis have... more
    Gene set analysis methods aim to determine whether an a priori defined set of genes shows statistically significant difference in expression on either categorical or continuous outcomes. Although many methods for gene set analysis have been proposed, a systematic analysis tool for identification of different types of gene set significance modules has not been developed previously. This work presents an R package, called MAVTgsa, which includes three different methods for integrated gene set enrichment analysis. (1) The one-sided OLS (ordinary least squares) test detects coordinated changes of genes in gene set in one direction, either up- or downregulation. (2) The two-sided MANOVA (multivariate analysis variance) detects changes both up- and downregulation for studying two or more experimental conditions. (3) A random forests-based procedure is to identify gene sets that can accurately predict samples from different experimental conditions or are associated with the continuous phen...
    ... [12] have proposed two global statistics for one-sided test and two statistics for two-sided test.Dinu et al. [13] have proposed a test based on the SAM statistic [1]. Adewale et al. [14] have generalized the SAM-GS statistic from the... more
    ... [12] have proposed two global statistics for one-sided test and two statistics for two-sided test.Dinu et al. [13] have proposed a test based on the SAM statistic [1]. Adewale et al. [14] have generalized the SAM-GS statistic from the framework of regression model. ... mellitus. G e n e ...
    The approval of generic drugs requires the evidence of average bioequivalence (ABE) on both the area under the concentration-time curve and the peak concentration Cmax . The bioequivalence (BE) hypothesis can be decomposed into the... more
    The approval of generic drugs requires the evidence of average bioequivalence (ABE) on both the area under the concentration-time curve and the peak concentration Cmax . The bioequivalence (BE) hypothesis can be decomposed into the non-inferiority (NI) and non-superiority (NS) hypothesis. Most of regulatory agencies employ the two one-sided tests (TOST) procedure to test ABE between two formulations. As it is based on the intersection-union principle, the TOST procedure is conservative in terms of the type I error rate. However, the type II error rate is the sum of the type II error rates with respect to each null hypothesis of NI and NS hypotheses. When the difference in population means between two treatments is not 0, no close-form solution for the sample size for the BE hypothesis is available. Current methods provide the sample sizes with either insufficient power or unnecessarily excessive power. We suggest an approximate method for sample size determination, which can also provide the type II rate for each of NI and NS hypotheses. In addition, the proposed method is flexible to allow extension from one pharmacokinetic (PK) response to determination of the sample size required for multiple PK responses. We report the results of a numerical study. An R code is provided to calculate the sample size for BE testing based on the proposed methods.
    This paper investigates the effects of the ratio of positive-to-negative samples on the sensitivity, specificity, and concordance. When the class sizes in the training samples are not equal, the classification rule derived will favor the... more
    This paper investigates the effects of the ratio of positive-to-negative samples on the sensitivity, specificity, and concordance. When the class sizes in the training samples are not equal, the classification rule derived will favor the majority class and result in a low sensitivity on the minority class prediction. We propose an ensemble classification approach to adjust for differential class sizes in a binary classifier system. An ensemble classifier consists of a set of base classifiers; its prediction rule is based on a summary measure of individual classifications by the base classifiers. Two re-sampling methods, augmentation and abatement, are proposed to generate different bootstrap samples of equal class size to build the base classifiers. The augmentation method balances the two class sizes by bootstrapping additional samples from the minority class, whereas the abatement method balances the two class sizes by sampling only a subset of samples from the majority class. The proposed procedure is applied to a data set to predict estrogen receptor binding activity and to a data set to predict animal liver carcinogenicity using SAR (structure-activity relationship) models as base classifiers. The abatement method appears to perform well in balancing sensitivity and specificity.
    Standard classification algorithms are generally designed to maximize the number of correct predictions (concordance). The criterion of maximizing the concordance may not be appropriate in certain applications. In practice, some... more
    Standard classification algorithms are generally designed to maximize the number of correct predictions (concordance). The criterion of maximizing the concordance may not be appropriate in certain applications. In practice, some applications may emphasize high sensitivity (e.g., clinical diagnostic tests) and others may emphasize high specificity (e.g., epidemiology screening studies). This paper considers effects of the decision threshold on sensitivity, specificity, and concordance for four classification methods: logistic regression, classification tree, Fisher's linear discriminant analysis, and a weighted k-nearest neighbor. We investigated the use of decision threshold adjustment to improve performance of either sensitivity or specificity of a classifier under specific conditions. We conducted a Monte Carlo simulation showing that as the decision threshold increases, the sensitivity decreases and the specificity increases; but, the concordance values in an interval around the maximum concordance are similar. For specified sensitivity and specificity levels, an optimal decision threshold might be determined in an interval around the maximum concordance that meets the specified requirement. Three example data sets were analyzed for illustrations.
    The percent active (A) and inactive (I) chemicals in a database can directly affect the sensitivity (% active chemicals predicted correctly) and specificity (% inactive chemicals predicted correctly) of structure-activity relationship... more
    The percent active (A) and inactive (I) chemicals in a database can directly affect the sensitivity (% active chemicals predicted correctly) and specificity (% inactive chemicals predicted correctly) of structure-activity relationship (SAR) analyses. Subdividing the National Center for Toxicological Research (NCTR) liver cancer database (NCTRlcdb) into various A/I ratios, which varied from 0.2 to 5.5, resulted in sensitivity/specificity ratios that varied from 0.1 to 6.5. As percent active chemicals increased (increasing A/I ratio), the sensitivity rose, the specificity decreased, and the concordance (% total chemicals predicted correctly) remained fairly constant. The numbers of chemicals in the various data sets ranged from 187 to 999 and appeared to have no affect on any of the 3 predictors of sensitivity, specificity, or concordance.
    ABSTRACT Testing for significance with gene expression data from DNA microarray experiments involves simultaneous comparisons of hundreds or thousands of genes. In common exploratory microarray experiments, most genes are not expected to... more
    ABSTRACT Testing for significance with gene expression data from DNA microarray experiments involves simultaneous comparisons of hundreds or thousands of genes. In common exploratory microarray experiments, most genes are not expected to be differentially expressed. The family-wise error (FWE) rate and false discovery rate (FDR) are two common approaches used to account for multiple hypothesis tests to identify differentially expressed genes. When the number of hypotheses is very large and some null hypotheses are expected to be true, the power of an FWE or FDR procedure can be improved if the number of null hypotheses is known. The mean of differences (MD) of ranked p-values has been proposed to estimate the number of true null hypotheses under the independence model. This article proposes to incorporate the MD estimate into an FWE or FDR approach for gene identification. Simulation results show that the procedure appears to control the FWE and FDR well at the FWE=0.05 and FDR=0.05 significant levels; it exceeds the nominal level for FDR=0.01 when the null hypotheses are highly correlated, a correlation of 0.941. The proposed approach is applied to a public colon tumor data set for illustration.
    Several lines of evidence implicate glutamatergic neurotransmission in the pathophysiology of obsessive compulsive disorder (OCD). Sarcosine is an endogenous antagonist of glycine transporter-1. By blocking glycine uptake, sarcosine may... more
    Several lines of evidence implicate glutamatergic neurotransmission in the pathophysiology of obsessive compulsive disorder (OCD). Sarcosine is an endogenous antagonist of glycine transporter-1. By blocking glycine uptake, sarcosine may increase the availability of synaptic glycine and enhance N-methyl-d-aspartate (NMDA) subtype glutamatergic neurotransmission. In this 10-week open-label trial, we examined the potential benefit of sarcosine treatment in OCD patients. Twenty-six outpatients with OCD and baseline Yale-Brown Obsessive Compulsive Scale (YBOCS) scores higher than 16 were enrolled. Drug-naive subjects (group 1, n = 8) and those who had discontinued serotonin reuptake inhibitors for at least 8 weeks at study entry (group 2, n = 6) received sarcosine monotherapy. The other subjects (group 3, n = 12) received sarcosine as adjunctive treatment. A flexible dosage schedule of sarcosine 500 to 2000 mg/d was applied. The primary outcome measures were Y-BOCS and Hamilton Anxiety Inventory, rated at weeks 0, 2, 4, 6, 8, and 10. Results were analyzed by repeated-measures analysis of variance. Data of 25 subjects were eligible for analysis. The mean ± SD Y-BOCS scores decreased from 27.6 ± 5.8 to 22.7 ± 8.7, indicating a mean decrease of 19.8% ± 21.7% (P = 0.0035). Eight (32%) subjects were regarded as responders with greater than 35% reduction of Y-BOCS scores. Five of the responders achieved the good response early by week 4. Although not statistically significant, drug-naive (group 1) subjects had more profound and sustained improvement and more responders than the subjects who had received treatment before (groups 2 and 3). Sarcosine was tolerated well; only one subject withdrew owing to transient headache. Sarcosine treatment can achieve a fast therapeutic effect in some OCD patients, particularly those who are treatment naive. The study supports the glycine transporter-1 as a novel target for developing new OCD treatment. Large-series placebo-controlled, double-blind studies are recommended.
    We propose an integrated tree-based approach for prognostic grouping of localized melanoma patients. This approach incorporates the survival tree model with the agglomerative hierarchical clustering to group terminal subgroups with... more
    We propose an integrated tree-based approach for prognostic grouping of localized melanoma patients. This approach incorporates the survival tree model with the agglomerative hierarchical clustering to group terminal subgroups with similar prognoses together. The Brier score is used to evaluate the goodness of fit and the k-fold cross-validation test is used to evaluate the reproducibility of the scheme for prediction. The proposed approach is applied to an American Joint Committee on Cancer (AJCC) localized melanoma data set and compared with the current AJCC staging system. This approach performs more efficiently than the standard tree methods and has made improvement over the current AJCC melanoma staging system.
    Microarray technology allows the measurement of expression levels of a large number of genes simultaneously. There are inherent biases in microarray data generated from an experiment. Various statistical methods have been proposed for... more
    Microarray technology allows the measurement of expression levels of a large number of genes simultaneously. There are inherent biases in microarray data generated from an experiment. Various statistical methods have been proposed for data normalization and data analysis. This paper proposes a generalized additive model for the analysis of gene expression data. This model consists of two sub-models: a non-linear model and a linear model. We propose a two-step normalization algorithm to fit the two sub-models sequentially. The first step involves a non-parametric regression using lowess fits to adjust for non-linear systematic biases. The second step uses a linear ANOVA model to estimate the remaining effects including the interaction effect of genes and treatments, the effect of interest in a study. The proposed model is a generalization of the ANOVA model for microarray data analysis. We show correspondences between the lowess fit and the ANOVA model methods. The normalization procedure does not assume the majority of genes do not change their expression levels, and neither does it assume two channel intensities from the same spot are independent. The procedure can be applied to either one channel or two channel data from the experiments with multiple treatments or multiple nuisance factors. Two toxicogenomic experiment data sets and a simulated data set are used to contrast the proposed method with the commonly known lowess fit and ANOVA methods.
    A common objective in microarray experiments is to select genes that are differentially expressed between two classes (two treatment groups). Selection of differentially expressed genes involves two steps. The first step is to calculate a... more
    A common objective in microarray experiments is to select genes that are differentially expressed between two classes (two treatment groups). Selection of differentially expressed genes involves two steps. The first step is to calculate a discriminatory score that will rank the genes in order of evidence of differential expressions. The second step is to determine a cutoff for the ranked scores. Summary indices of the receiver operating characteristic (ROC) curve provide relative measures for a ranking of differential expressions. This article proposes using the hypothesis-testing approach to compute the raw p-values and/or adjusted p-values for three ROC discrimination measures. A cutoff p-value can be determined from the (ranked) p-values or the adjusted p-values to select differentially expressed genes. To quantify the degree of confidence in the selected top-ranked genes, the conditional false discovery rate (FDR) over the selected gene set and the "Type I" (false positive) error probability for each selected gene are estimated. The proposed approach is applied to a public colon tumor data set for illustration. The selected gene sets from three ROC summary indices and the commonly used two-sample t-statistic are applied to the sample classification to evaluate the predictability of the four discrimination measures.
    Identifying genes that are differentially expressed in response to DNA damage may help elucidate markers for genetic damage and provide insight into the cellular responses to specific genotoxic agents. We utilized cDNA microarrays to... more
    Identifying genes that are differentially expressed in response to DNA damage may help elucidate markers for genetic damage and provide insight into the cellular responses to specific genotoxic agents. We utilized cDNA microarrays to develop gene expression profiles for ionizing radiation-exposed human lymphoblastoid TK6 cells. In order to relate changes in the expression profiles to biological responses, the effects of ionizing radiation on cell viability, cloning efficiency, and micronucleus formation were measured. TK6 cells were exposed to 0.5, 1, 5, 10, and 20 Gy ionizing radiation and cultured for 4 or 24 hr. A significant (P < 0.0001) decrease in cloning efficiency was observed at all doses at 4 and 24 hr after exposure. Flow cytometry revealed significant decreases in cell viability at 24 hr in cells exposed to 5 (P < 0.001), 10 (P < 0.0001), and 20 Gy (P < 0.0001). An increase in micronucleus frequency occurred at both 4 and 24 hr at 0.5 and 1 Gy; however, insufficient binucleated cells were present for analysis at the higher doses. Gene expression profiles were developed from mRNA isolated from cells exposed to 5, 10, and 20 Gy using a 350 gene human cDNA array platform. Overall, more genes were differentially expressed at 24-hr than at the 4-hr time point. The genes upregulated (> 1.5-fold) or downregulated (< 0.67-fold) at 4 hr were those primarily involved in the cessation of the cell cycle, cellular detoxification pathways, DNA repair, and apoptosis. At 24 hr, glutathione-associated genes were induced in addition to genes involved in apoptosis. Genes involved in cell cycle progression and mitosis were downregulated at 24 hr. Real-time quantitative PCR was used to confirm the microarray results and to evaluate expression levels of selected genes at the low doses (0.5 and 1.0 Gy). The expression profiles reflect the cellular and molecular responses to ionizing radiation related to the recognition of DNA damage, a halt in progression through the cell cycle, activation of DNA-repair pathways, and the promotion of apoptosis.
    DNA microarray technology provides useful tools for profiling global gene expression patterns in different cell/tissue samples. One major challenge is the large number of genes relative to the number of samples. The use of all genes can... more
    DNA microarray technology provides useful tools for profiling global gene expression patterns in different cell/tissue samples. One major challenge is the large number of genes relative to the number of samples. The use of all genes can suppress or reduce the performance of a classification rule due to the noise of nondiscriminatory genes. Selection of an optimal subset from the original gene set becomes an important prestep in sample classification. In this study, we propose a family-wise error (FWE) rate approach to selection of discriminatory genes for two-sample or multiple-sample classification. The FWE approach controls the probability of the number of one or more false positives at a prespecified level. A public colon cancer data set is used to evaluate the performance of the proposed approach for the two classification methods: k nearest neighbors (k-NN) and support vector machine (SVM). The selected gene sets from the proposed procedure appears to perform better than or comparable to several results reported in the literature using the univariate analysis without performing multivariate search. In addition, we apply the FWE approach to a toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV) for a total of 55 samples for a multisample classification. Two gene sets are considered: the gene set omegaF formed by the ANOVA F-test, and a gene set omegaT formed by the union of one-versus-all t-tests. The predicted accuracies are evaluated using the internal and external crossvalidation. Using the SVM classification, the overall accuracies to predict 55 samples into one of the nine treatments are above 80% for internal crossvalidation. OmegaF has slightly higher accuracy rates than omegaT. The overall predicted accuracies are above 70% for the external crossvalidation; the two gene sets omegaT and omegaF performed equally well.
    Recent development of high-throughput technology has accelerated interest in the development of molecular biomarker classifiers for safety assessment, disease diagnostics and prognostics, and prediction of response for patient assignment.... more
    Recent development of high-throughput technology has accelerated interest in the development of molecular biomarker classifiers for safety assessment, disease diagnostics and prognostics, and prediction of response for patient assignment. This article reviews and evaluates some important aspects and key issues in the development of biomarker classifiers. Development of a biomarker classifier for high-throughput data involves two components: (i) model building and (ii) performance assessment. This article focuses on feature selection in model building and cross validation for performance assessment. A 'frequency' approach to feature selection is presented and compared to the 'conventional' approach in terms of the predictive accuracy and stability of the selected feature set. The two approaches are compared based on four biomarker classifiers, each with a different feature selection method and well-known classification algorithm. In each of the four classifiers the feature predictor set selected by the frequency approach is more stable than the feature set selected by the conventional approach.
    Testing for significance with gene expression data from DNA microarray experiments involves simultaneous comparisons of hundreds or thousands of genes. If R denotes the number of rejections (declared significant genes) and V denotes the... more
    Testing for significance with gene expression data from DNA microarray experiments involves simultaneous comparisons of hundreds or thousands of genes. If R denotes the number of rejections (declared significant genes) and V denotes the number of false rejections, then V/R, if R > 0, is the proportion of false rejected hypotheses. This paper proposes a model for the distribution of the number of rejections and the conditional distribution of V given R, V / R. Under the independence assumption, the distribution of R is a convolution of two binomials and the distribution of V / R has a noncentral hypergeometric distribution. Under an equicorrelated model, the distributions are more complex and are also derived. Five false discovery rate probability error measures are considered: FDR = E(V/R), pFDR = E(V/R / R > 0) (positive FDR), cFDR = E(V/R / R = r) (conditional FDR), mFDR = E(V)/E(R) (marginal FDR), and eFDR = E(V)/r (empirical FDR). The pFDR, cFDR, and mFDR are shown to be equivalent under the Bayesian framework, in which the number of true null hypotheses is modeled as a random variable. We present a parametric and a bootstrap procedure to estimate the FDRs. Monte Carlo simulations were conducted to evaluate the performance of these two methods. The bootstrap procedure appears to perform reasonably well, even when the alternative hypotheses are correlated (rho = .25). An example from a toxicogenomic microarray experiment is presented for illustration.
    Microarray experiments often involve hundreds or thousands of genes. In a typical experiment, only a fraction of genes are expected to be differentially expressed; in addition, the measured intensities among different genes may be... more
    Microarray experiments often involve hundreds or thousands of genes. In a typical experiment, only a fraction of genes are expected to be differentially expressed; in addition, the measured intensities among different genes may be correlated. Depending on the experimental objectives, sample size calculations can be based on one of the three specified measures: sensitivity, true discovery and accuracy rates. The sample size problem is formulated as: the number of arrays needed in order to achieve the desired fraction of the specified measure at the desired family-wise power at the given type I error and (standardized) effect size. We present a general approach for estimating sample size under independent and equally correlated models using binomial and beta-binomial models, respectively. The sample sizes needed for a two-sample z-test are computed; the computed theoretical numbers agree well with the Monte Carlo simulation results. But, under more general correlation structures, the beta-binomial model can underestimate the needed samples by about 1-5 arrays. jchen@nctr.fda.gov.

    And 1 more