The growing number of available treatment options have led to urgent needs for reliable answers w... more The growing number of available treatment options have led to urgent needs for reliable answers when choosing the best course of treatment for a patient. As it is often infeasible to compare a large number of treatments in a single randomized controlled trial, multivariate network meta-analyses (NMAs) are used to synthesize evidence from {existing trials} of a subset of the available treatments, where outcomes related to both efficacy and safety are considered simultaneously. However, these large-scale multiple-outcome NMAs have created challenges to existing methods due to the increasingly complexity of the unknown correlation structures between different outcomes and treatment comparisons.{ In this paper, we proposed a new framework for PAtient-centered treatment ranking via Large-scale Multivariate network meta-analysis, termed as PALM, which includes} a parsimonious modeling approach, a fast algorithm for parameter estimation and inference, a novel visualization tool for {compar...
Motivation: The Illumina BeadArray is a popular platform for profiling DNA methylation, an import... more Motivation: The Illumina BeadArray is a popular platform for profiling DNA methylation, an important epigenetic event associated with gene silencing and chromosomal instability. However, current approaches rely on an arbitrary detection P-value cutoff for excluding probes and samples from subsequent analysis as a quality control step, which results in missing observations and information loss. It is desirable to have an approach that incorporates the whole data, but accounts for the different quality of individual observations. Results: We first investigate and propose a statistical framework for removing the source of biases in Illumina Methylation BeadArray based on several positive control samples. We then introduce a weighted model-based clustering called LumiWCluster for Illumina BeadArray that weights each observation according to the detection P-values systematically and avoids discarding subsets of the data. LumiWCluster allows for discovery of distinct methylation patterns ...
Journal of the American Statistical Association, 2015
We have developed a statistical method named IsoDOT to assess differential isoform expression (DI... more We have developed a statistical method named IsoDOT to assess differential isoform expression (DIE) and differential isoform usage (DIU) using RNA-seq data. Here isoform usage refers to relative isoform expression given the total expression of the corresponding gene. IsoDOT performs two tasks that cannot be accomplished by existing methods: to test DIE/DIU with respect to a continuous covariate, and to test DIE/DIU for one case versus one control. The latter task is not an uncommon situation in practice, e.g., comparing paternal and maternal allele of one individual or comparing tumor and normal sample of one cancer patient. Simulation studies demonstrate the high sensitivity and specificity of IsoDOT. We apply IsoDOT to study the effects of haloperidol treatment on mouse transcriptome and identify a group of genes whose isoform usages respond to haloperidol treatment. keywords: RNA-seq, isoform, penalized regression, differential isoform expression, differential isoform usage importance to understand the functional complexity of a living organism, the evolutionary changes in transcriptome [Barbosa-Morais et al., 2012], and the genomic basis of human diseases [Wang and Cooper, 2007]. Gene expression is traditionally measured by microarrays. Most microarray platforms provide one measurement per gene, which does not distinguish the expression of multiple isoforms. Exon arrays can be used to study RNA isoform expression [Purdom et al., 2008, Richard et al., 2010]. However, RNA sequencing (RNA-seq) provides much better data for this purpose [Wang et al., 2009]. In an RNA-seq study, fragments of RNA molecules (typically 200-500 bps long) are reverse transcribed and amplified, and then sequenced on one end (single-end sequencing) or both ends (paired-end sequencing). A sequenced end is called an RNA-seq read, which could be 30-150 bps or even longer. These RNA-seq reads are mapped to reference genome and the number of RNA-seq fragments overlapping each gene can be counted. The expression of the j-th gene in the i-th sample can be measured by normalized fragment count after adjusting for read-depth of the i-th sample and the length of the j-th gene [Mortazavi et al., 2008].
Background In epidemiologic research, little emphasis has been placed on methods to account for l... more Background In epidemiologic research, little emphasis has been placed on methods to account for left-hand censoring of 'exposures' due to a limit of detection (LOD).
A zero-inflated log-normal mixture model (which assumes that the data has a probability mass at z... more A zero-inflated log-normal mixture model (which assumes that the data has a probability mass at zero and a continuous response for values greater than zero) with left censoring due to assay measurements falling below detection limits has been applied to compare treatment groups in randomized clinical trials and observational cohort studies. The sample size calculation (for a given type I error rate and a desired statistical power) has not been studied for this type of data under the assumption of equal proportions of true zeros in the treatment and control groups. In this article, we derive the sample sizes based on the expected differences between the non-zero values of individuals in treatment and control groups. Methods for calculation of statistical power are also presented. When computing the sample sizes, caution is needed as some irregularities occur, namely that the location parameter is sometimes underestimated due to the mixture distribution and left censoring. In such cases, the aforementioned methods fail. We calculated the required sample size for a recent randomized chemoprevention trial estimating the effect of oltipraz on reducing aflatoxin. A Monte Carlo simulation study was also conducted to investigate the performance of the proposed methods. The simulation results illustrate that the proposed methods provide adequate sample size estimates. However, when the aforementioned irregularity occurs, our methods are restricted and further research is needed.
A marginal approach and a variance-component mixed effect model approach (here called a condition... more A marginal approach and a variance-component mixed effect model approach (here called a conditional approach) are commonly used to analyze variables that are subject to limit of detection. We examine the theoretical relationship and investigate the numerical performance of these two approaches. We make some recommendations based on our results. The marginal approach is recommended for bivariate normal variables, and the variance-component mixed effect model is preferable for other multivariate analysis in most circumstances. Two approaches are illustrated through one case study from a preclinical experiment.
Journal of The Royal Statistical Society Series C-applied Statistics, 2005
Summary.  In individuals who are infected with human immunodeficiency virus (HIV), distributions ... more Summary.  In individuals who are infected with human immunodeficiency virus (HIV), distributions of quantitative HIV ribonucleic acid measurements may be highly left censored with an extra spike below the limit of detection LD of the assay. A two-component mixture model with the lower component entirely supported on [0, LD] is recommended to model the extra spike in univariate analysis better. Let LD1 and LD2 be the limits of detection for the two HIV viral load measurements. When estimating the correlation coefficient between two different measures of viral load obtained from each of a sample of patients, a bivariate Gaussian mixture model is recommended to model the extra spike on [0, LD1] and [0, LD2] better when the proportion below LD is incompatible with the left-hand tail of a bivariate Gaussian distribution. When the proportion of both variables falling below LD is very large, the parameters of the lower component may not be estimable since almost all observations from the lower component are falling below LD. A partial solution is to assume that the lower component's entire support is on [0, LD1]×[0, LD2]. Maximum likelihood is used to estimate the parameters of the lower and higher components. To evaluate whether there is a lower component, we apply a Monte Carlo approach to assess the p-value of the likelihood ratio test and two information criteria: a bootstrap-based information criterion and a cross-validation-based information criterion. We provide simulation results to evaluate the performance and compare it with two ad hoc estimators and a single-component bivariate Gaussian likelihood estimator. These methods are applied to the data from a cohort study of HIV-infected men in Rio de Janeiro, Brazil, and the data from the Women's Interagency HIV oral study. These results emphasize the need for caution when estimating correlation coefficients from data with a large proportion of non-detectable values when the proportion below LD is incompatible with the left-hand tail of a bivariate Gaussian distribution.
PURPOSE: Misclassification can produce bias in measures of association. Sensitivity analyses have... more PURPOSE: Misclassification can produce bias in measures of association. Sensitivity analyses have been suggested to explore the impact of such bias, but do not supply formally justified interval estimates. METHODS: To account for exposure misclassification, recently developed Bayesian approaches were extended to incorporate prior uncertainty and correlation of sensitivity and specificity. Under nondifferential misclassification, a contour plot is used to depict relations among the corrected odds ratio, sensitivity, and specificity. RESULTS: Methods are illustrated by application to a case-control study of cigarette smoking and invasive pneumococcal disease while varying the distributional assumptions about sensitivity and specificity. Results are compared with those of conventional methods, which do not account for misclassification, and a sensitivity analysis, which assumes fixed sensitivity and specificity. CONCLUSION: By using Bayesian methods, investigators can incorporate uncertainty about misclassification into probabilistic inferences.
Background: The relative contributions of the different classes of antiretroviral therapy (ART), ... more Background: The relative contributions of the different classes of antiretroviral therapy (ART), HIV infection per se, and aging to body shape changes in HIV-infected patients have not been clearly defined in longitudinal studies.
Journal of The American Statistical Association, 2009
In studies of the accuracy of diagnostic tests, it is common that both the diagnostic test itself... more In studies of the accuracy of diagnostic tests, it is common that both the diagnostic test itself and the reference test are imperfect. This is the case for the microsatellite instability test, which is routinely used as a prescreening procedure to identify individuals with Lynch syndrome, the most common hereditary colorectal cancer syndrome. The microsatellite instability test is known to have imperfect sensitivity and specificity. Meanwhile, the reference test, mutation analysis, is also imperfect. We evaluate this test via a random effects meta-analysis of 17 studies. Study-specific random effects account for between-study heterogeneity in mutation prevalence, test sensitivities and specificities under a nonlinear mixed effects model and a Bayesian hierarchical model. Using model selection techniques, we explore a range of random effects models to identify a best-fitting model. We also evaluate sensitivity to the conditional independence assumption between the microsatellite instability test and the mutation analysis by allowing for correlation between them. Finally, we use simulations to illustrate the importance of including appropriate random effects and the impact of overfitting, underfitting, and misfitting on model performance. Our approach can be used to estimate the accuracy of two imperfect diagnostic tests from a meta-analysis of multiple studies or a multicenter study when the prevalence of disease, test sensitivities and/or specificities may be heterogeneous among studies or centers.
The growing number of available treatment options have led to urgent needs for reliable answers w... more The growing number of available treatment options have led to urgent needs for reliable answers when choosing the best course of treatment for a patient. As it is often infeasible to compare a large number of treatments in a single randomized controlled trial, multivariate network meta-analyses (NMAs) are used to synthesize evidence from {existing trials} of a subset of the available treatments, where outcomes related to both efficacy and safety are considered simultaneously. However, these large-scale multiple-outcome NMAs have created challenges to existing methods due to the increasingly complexity of the unknown correlation structures between different outcomes and treatment comparisons.{ In this paper, we proposed a new framework for PAtient-centered treatment ranking via Large-scale Multivariate network meta-analysis, termed as PALM, which includes} a parsimonious modeling approach, a fast algorithm for parameter estimation and inference, a novel visualization tool for {compar...
Motivation: The Illumina BeadArray is a popular platform for profiling DNA methylation, an import... more Motivation: The Illumina BeadArray is a popular platform for profiling DNA methylation, an important epigenetic event associated with gene silencing and chromosomal instability. However, current approaches rely on an arbitrary detection P-value cutoff for excluding probes and samples from subsequent analysis as a quality control step, which results in missing observations and information loss. It is desirable to have an approach that incorporates the whole data, but accounts for the different quality of individual observations. Results: We first investigate and propose a statistical framework for removing the source of biases in Illumina Methylation BeadArray based on several positive control samples. We then introduce a weighted model-based clustering called LumiWCluster for Illumina BeadArray that weights each observation according to the detection P-values systematically and avoids discarding subsets of the data. LumiWCluster allows for discovery of distinct methylation patterns ...
Journal of the American Statistical Association, 2015
We have developed a statistical method named IsoDOT to assess differential isoform expression (DI... more We have developed a statistical method named IsoDOT to assess differential isoform expression (DIE) and differential isoform usage (DIU) using RNA-seq data. Here isoform usage refers to relative isoform expression given the total expression of the corresponding gene. IsoDOT performs two tasks that cannot be accomplished by existing methods: to test DIE/DIU with respect to a continuous covariate, and to test DIE/DIU for one case versus one control. The latter task is not an uncommon situation in practice, e.g., comparing paternal and maternal allele of one individual or comparing tumor and normal sample of one cancer patient. Simulation studies demonstrate the high sensitivity and specificity of IsoDOT. We apply IsoDOT to study the effects of haloperidol treatment on mouse transcriptome and identify a group of genes whose isoform usages respond to haloperidol treatment. keywords: RNA-seq, isoform, penalized regression, differential isoform expression, differential isoform usage importance to understand the functional complexity of a living organism, the evolutionary changes in transcriptome [Barbosa-Morais et al., 2012], and the genomic basis of human diseases [Wang and Cooper, 2007]. Gene expression is traditionally measured by microarrays. Most microarray platforms provide one measurement per gene, which does not distinguish the expression of multiple isoforms. Exon arrays can be used to study RNA isoform expression [Purdom et al., 2008, Richard et al., 2010]. However, RNA sequencing (RNA-seq) provides much better data for this purpose [Wang et al., 2009]. In an RNA-seq study, fragments of RNA molecules (typically 200-500 bps long) are reverse transcribed and amplified, and then sequenced on one end (single-end sequencing) or both ends (paired-end sequencing). A sequenced end is called an RNA-seq read, which could be 30-150 bps or even longer. These RNA-seq reads are mapped to reference genome and the number of RNA-seq fragments overlapping each gene can be counted. The expression of the j-th gene in the i-th sample can be measured by normalized fragment count after adjusting for read-depth of the i-th sample and the length of the j-th gene [Mortazavi et al., 2008].
Background In epidemiologic research, little emphasis has been placed on methods to account for l... more Background In epidemiologic research, little emphasis has been placed on methods to account for left-hand censoring of 'exposures' due to a limit of detection (LOD).
A zero-inflated log-normal mixture model (which assumes that the data has a probability mass at z... more A zero-inflated log-normal mixture model (which assumes that the data has a probability mass at zero and a continuous response for values greater than zero) with left censoring due to assay measurements falling below detection limits has been applied to compare treatment groups in randomized clinical trials and observational cohort studies. The sample size calculation (for a given type I error rate and a desired statistical power) has not been studied for this type of data under the assumption of equal proportions of true zeros in the treatment and control groups. In this article, we derive the sample sizes based on the expected differences between the non-zero values of individuals in treatment and control groups. Methods for calculation of statistical power are also presented. When computing the sample sizes, caution is needed as some irregularities occur, namely that the location parameter is sometimes underestimated due to the mixture distribution and left censoring. In such cases, the aforementioned methods fail. We calculated the required sample size for a recent randomized chemoprevention trial estimating the effect of oltipraz on reducing aflatoxin. A Monte Carlo simulation study was also conducted to investigate the performance of the proposed methods. The simulation results illustrate that the proposed methods provide adequate sample size estimates. However, when the aforementioned irregularity occurs, our methods are restricted and further research is needed.
A marginal approach and a variance-component mixed effect model approach (here called a condition... more A marginal approach and a variance-component mixed effect model approach (here called a conditional approach) are commonly used to analyze variables that are subject to limit of detection. We examine the theoretical relationship and investigate the numerical performance of these two approaches. We make some recommendations based on our results. The marginal approach is recommended for bivariate normal variables, and the variance-component mixed effect model is preferable for other multivariate analysis in most circumstances. Two approaches are illustrated through one case study from a preclinical experiment.
Journal of The Royal Statistical Society Series C-applied Statistics, 2005
Summary.  In individuals who are infected with human immunodeficiency virus (HIV), distributions ... more Summary.  In individuals who are infected with human immunodeficiency virus (HIV), distributions of quantitative HIV ribonucleic acid measurements may be highly left censored with an extra spike below the limit of detection LD of the assay. A two-component mixture model with the lower component entirely supported on [0, LD] is recommended to model the extra spike in univariate analysis better. Let LD1 and LD2 be the limits of detection for the two HIV viral load measurements. When estimating the correlation coefficient between two different measures of viral load obtained from each of a sample of patients, a bivariate Gaussian mixture model is recommended to model the extra spike on [0, LD1] and [0, LD2] better when the proportion below LD is incompatible with the left-hand tail of a bivariate Gaussian distribution. When the proportion of both variables falling below LD is very large, the parameters of the lower component may not be estimable since almost all observations from the lower component are falling below LD. A partial solution is to assume that the lower component's entire support is on [0, LD1]×[0, LD2]. Maximum likelihood is used to estimate the parameters of the lower and higher components. To evaluate whether there is a lower component, we apply a Monte Carlo approach to assess the p-value of the likelihood ratio test and two information criteria: a bootstrap-based information criterion and a cross-validation-based information criterion. We provide simulation results to evaluate the performance and compare it with two ad hoc estimators and a single-component bivariate Gaussian likelihood estimator. These methods are applied to the data from a cohort study of HIV-infected men in Rio de Janeiro, Brazil, and the data from the Women's Interagency HIV oral study. These results emphasize the need for caution when estimating correlation coefficients from data with a large proportion of non-detectable values when the proportion below LD is incompatible with the left-hand tail of a bivariate Gaussian distribution.
PURPOSE: Misclassification can produce bias in measures of association. Sensitivity analyses have... more PURPOSE: Misclassification can produce bias in measures of association. Sensitivity analyses have been suggested to explore the impact of such bias, but do not supply formally justified interval estimates. METHODS: To account for exposure misclassification, recently developed Bayesian approaches were extended to incorporate prior uncertainty and correlation of sensitivity and specificity. Under nondifferential misclassification, a contour plot is used to depict relations among the corrected odds ratio, sensitivity, and specificity. RESULTS: Methods are illustrated by application to a case-control study of cigarette smoking and invasive pneumococcal disease while varying the distributional assumptions about sensitivity and specificity. Results are compared with those of conventional methods, which do not account for misclassification, and a sensitivity analysis, which assumes fixed sensitivity and specificity. CONCLUSION: By using Bayesian methods, investigators can incorporate uncertainty about misclassification into probabilistic inferences.
Background: The relative contributions of the different classes of antiretroviral therapy (ART), ... more Background: The relative contributions of the different classes of antiretroviral therapy (ART), HIV infection per se, and aging to body shape changes in HIV-infected patients have not been clearly defined in longitudinal studies.
Journal of The American Statistical Association, 2009
In studies of the accuracy of diagnostic tests, it is common that both the diagnostic test itself... more In studies of the accuracy of diagnostic tests, it is common that both the diagnostic test itself and the reference test are imperfect. This is the case for the microsatellite instability test, which is routinely used as a prescreening procedure to identify individuals with Lynch syndrome, the most common hereditary colorectal cancer syndrome. The microsatellite instability test is known to have imperfect sensitivity and specificity. Meanwhile, the reference test, mutation analysis, is also imperfect. We evaluate this test via a random effects meta-analysis of 17 studies. Study-specific random effects account for between-study heterogeneity in mutation prevalence, test sensitivities and specificities under a nonlinear mixed effects model and a Bayesian hierarchical model. Using model selection techniques, we explore a range of random effects models to identify a best-fitting model. We also evaluate sensitivity to the conditional independence assumption between the microsatellite instability test and the mutation analysis by allowing for correlation between them. Finally, we use simulations to illustrate the importance of including appropriate random effects and the impact of overfitting, underfitting, and misfitting on model performance. Our approach can be used to estimate the accuracy of two imperfect diagnostic tests from a meta-analysis of multiple studies or a multicenter study when the prevalence of disease, test sensitivities and/or specificities may be heterogeneous among studies or centers.
Uploads
Papers by Haitao Chu