Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

    Dabao Zhang

    Understanding genome and chromosome evolution is important for understanding genetic inheritance and evolution. Universal events comprising DNA replication, transcription, repair, mobile genetic element transposition, chromosome... more
    Understanding genome and chromosome evolution is important for understanding genetic inheritance and evolution. Universal events comprising DNA replication, transcription, repair, mobile genetic element transposition, chromosome rearrangements, mitosis, and meiosis underlie inheritance and variation of living organisms. Although the genome of a species as a whole is important, chromosomes are the basic units subjected to genetic events that coin evolution to a large extent. Now many complete genome sequences are available, we can address evolution and variation of individual chromosomes across species. For example, ‘‘How are the repeat and nonrepeat proportions of genetic codes distributed among different chromosomes in a multichromosome species?’ ’ ‘‘Is there a general rule behind the intuitive observation that chromosome lengths tend to be similar in a species, and if so, can we generalize any findings in chromosome content and size across different taxonomic groups?’ ’ Here, we s...
    PROCEEDINGS
    Appropriate screening tool for excessive alcohol use (EAU) is clinically important as it may help providers encourage early intervention and prevent adverse outcomes. We hypothesized that patients with excessive alcohol use will have... more
    Appropriate screening tool for excessive alcohol use (EAU) is clinically important as it may help providers encourage early intervention and prevent adverse outcomes. We hypothesized that patients with excessive alcohol use will have distinct serum metabolites when compared to healthy controls. Serum metabolic profiling of 22 healthy controls and 147 patients with a history of EAU was performed. We employed seemingly unrelated regression to identify the unique metabolites and found 67 metabolites (out of 556), which were differentially expressed in patients with EAU. Sixteen metabolites belong to the sphingolipid metabolism, 13 belong to phospholipid metabolism, and the remaining 38 were metabolites of 25 different pathways. We also found 93 serum metabolites that were significantly associated with the total quantity of alcohol consumption in the last 30 days. A total of 15 metabolites belong to the sphingolipid metabolism, 11 belong to phospholipid metabolism, and 7 metabolites belong to lysolipid. Using a Venn diagram approach, we found the top 10 metabolites with differentially expressed in EAU and significantly associated with the quantity of alcohol consumption, sphingomyelin (d18:2/18:1), sphingomyelin (d18:2/21:0,d16:2/23:0), guanosine, S-methylmethionine, 10-undecenoate (11:1n1), sphingomyelin (d18:1/20:1, d18:2/20:0), sphingomyelin (d18:1/17:0, d17:1/18:0, d19:1/16:0), N-acetylasparagine, sphingomyelin (d18:1/19:0, d19:1/18:0), and 1-palmitoyl-2-palmitoleoyl-GPC (16:0/16:1). The diagnostic performance of the top 10 metabolites, using the area under the ROC curve, was significantly higher than that of commonly used markers. We have identified a unique metaboloic signature among patients with EAU. Future studies to validate and determine the kinetics of these markers as a function of alcohol consumption are needed.
    Logistic regression is an effective tool in case-control analysis. With the advanced high throughput technology, a quest to seek a fast and efficient method in fitting high-dimensional logistic regression has gained much interest. An... more
    Logistic regression is an effective tool in case-control analysis. With the advanced high throughput technology, a quest to seek a fast and efficient method in fitting high-dimensional logistic regression has gained much interest. An empirical Bayes model for logistic regression is considered in this article. A spike-and-slab prior is used for variable selection purpose, which plays a vital role in building an effective predictive model while making model interpretable. To increase the power of variable selection, we incorporate biological knowledge through the Ising prior. The development of the iterated conditional modes/medians (ICM/M) algorithm is proposed to fit the logistic model that has computational advantage over Markov Chain Monte Carlo (MCMC) algorithms. The implementation of the ICM/M algorithm for both linear and logistic models can be found in R package icmm that is freely available on Comprehensive R Archive Network (CRAN). Simulation studies were carried out to assess the performances of our method, with lasso and adaptive lasso as benchmark. Overall, the simulation studies show that the ICM/M outperform the others in terms of number of false positives and have competitive predictive ability. An application to a real data set from Parkinson's disease study was also carried out for illustration. To identify important variables, our approach provides flexibility to select variables based on local posterior probabilities while controlling false discovery rate at a desired level rather than relying only on regression coefficients.
    Despite the increasing popularity and applicability of metabolomics for putative biomarker identification, analysis of the data is challenged by low statistical power resulting from the small sample sizes and large numbers of metabolites... more
    Despite the increasing popularity and applicability of metabolomics for putative biomarker identification, analysis of the data is challenged by low statistical power resulting from the small sample sizes and large numbers of metabolites and other omics information, as well as confounding demographic and clinical variables. To enhance the statistical power and improve reproducibility of the identified metabolite-based biomarkers, we advocate the use of advanced statistical methods that can simultaneously evaluate the relationship between a group of metabolites and various types of variables including other omics profiles, demographic and clinical data, as well as the complex interactions between them. Accordingly, in this chapter, we describe the method of seemingly unrelated regression that can simultaneously analyze multiple metabolites while controlling the confounding effects of demographic and clinical variables (such as gender, age, BMI, smoking status). We also introduce penalized orthogonal components regression as a screening approach that can handle millions of omics predictors in the model.
    There is continued debate regarding the exact relation between lower cholesterol levels and increased respiratory disease mortality. One of the goals of this study is to reveal the relationship between subcomponents of cholesterol and... more
    There is continued debate regarding the exact relation between lower cholesterol levels and increased respiratory disease mortality. One of the goals of this study is to reveal the relationship between subcomponents of cholesterol and pulmonary function. We consider the subcomponents of total cholesterol, namely high-density lipoprotein cholesterol and low-density lipoprotein cholesterol, to investigate the relationship of cholesterol levels with pulmonary function in a longitudinal study. To answer these questions, we propose new methodology for hierarchical reciprocal graphical models. We consider the identification and estimation of these models, and propose maximum likelihood estimation using a generalized EM algorithm. A simulation study of the algorithm and the corresponding estimates reveals excellent performance of the proposed procedures. Application of this methodology to the Normative Aging Study reveals complicated associations between pulmonary function and the subcomponents of total cholesterol.
    Commonly accepted intensity-dependent normalization in spotted microarray studies takes account of measurement errors in the differential expression ratio but ignores measurement errors in the total intensity, although the definitions... more
    Commonly accepted intensity-dependent normalization in spotted microarray studies takes account of measurement errors in the differential expression ratio but ignores measurement errors in the total intensity, although the definitions imply the same measurement error components are involved in both statistics. Furthermore, identification of differentially expressed genes is usually considered separately following normalization, which is statistically problematic. By incorporating the measurement errors in both total intensities and differential expression ratios, we propose a measurement-error model for intensity-dependent normalization and identification of differentially expressed genes. This model is also flexible enough to incorporate intra-array and inter-array effects. A Bayesian framework is proposed for the analysis of the proposed measurement-error model to avoid the potential risk of using the common two-step procedure. We also propose a Bayesian identification of differentially expressed genes to control the false discovery rate instead of the ad hoc thresholding of the posterior odds ratio. The simulation study and an application to real microarray data demonstrate promising results.
    Gene regulation plays an important role in understanding the mechanisms of human biology and diseases. However, inferring causal relationships between all genes is challenging due to the large number of genes in the transcriptome. Here,... more
    Gene regulation plays an important role in understanding the mechanisms of human biology and diseases. However, inferring causal relationships between all genes is challenging due to the large number of genes in the transcriptome. Here, we present SIGNET (Statistical Inference on Gene Regulatory Networks), a flexible software package that reveals networks of causal regulation between genes built upon large-scale transcriptomic and genotypic data at the population level. Like Mendelian randomization, SIGNET uses genotypic variants as natural instrumental variables to establish such causal relationships but constructs a transcriptome-wide gene regulatory network with high confidence. SIGNET makes such a computationally heavy task feasible by deploying a well-designed statistical algorithm over a parallel computing environment. It also provides a user-friendly interface allowing for parameter tuning, efficient parallel computing scheduling, interactive network visualization, and confir...
    Key message Association analysis for ionomic concentrations of 20 elements identified independent genetic factors underlying the root and shoot ionomes of rice, providing a platform for selecting and dissecting causal genetic variants.... more
    Key message Association analysis for ionomic concentrations of 20 elements identified independent genetic factors underlying the root and shoot ionomes of rice, providing a platform for selecting and dissecting causal genetic variants. Abstract Understanding the genetic basis of mineral nutrient acquisition is key to fully describing how terrestrial organisms interact with the non-living environment. Rice (Oryza sativa L.) serves both as a model organism for genetic studies and as an important component of the global food system. Studies in rice ionomics have primarily focused on above ground tissues evaluated from field-grown plants. Here, we describe a comprehensive study of the genetic basis of the rice ionome in both roots and shoots of 6-week-old rice plants for 20 elements using a controlled hydroponics growth system. Building on the wealth of publicly available rice genomic resources, including a panel of 373 diverse rice lines, 4.8 M genome-wide single-nucleotide polymorphis...
    We developed a novel statistical method to identify structural differences between networks characterized by structural equation models. We propose to reparameterize the model to separate the differential structures from common... more
    We developed a novel statistical method to identify structural differences between networks characterized by structural equation models. We propose to reparameterize the model to separate the differential structures from common structures, and then design an algorithm with calibration and construction stages to identify these differential structures. The calibration stage serves to obtain consistent prediction by building the L2 regularized regression of each endogenous variables against pre-screened exogenous variables, correcting for potential endogeneity issue. The construction stage consistently selects and estimates both common and differential effects by undertaking L1 regularized regression of each endogenous variable against the predicts of other endogenous variables as well as its anchoring exogenous variables. Our method allows easy parallel computation at each stage. Theoretical results are obtained to establish nonasymptotic error bounds of predictions and estimates at b...
    Constructing gene regulatory networks is crucial to unraveling the genetic architecture of complex traits and to understanding the mechanisms of diseases. On the basis of gene expression and single nucleotide polymorphism data in the... more
    Constructing gene regulatory networks is crucial to unraveling the genetic architecture of complex traits and to understanding the mechanisms of diseases. On the basis of gene expression and single nucleotide polymorphism data in the yeast, Saccharomyces cerevisiae, we constructed gene regulatory networks using a two-stage penalized least squares method. A large system of structural equations via optimal prediction of a set of surrogate variables was established at the first stage, followed by consistent selection of regulatory effects at the second stage. Using this approach, we identified subnetworks that were enriched in gene ontology categories, revealing directional regulatory mechanisms controlling these biological pathways. Our mapping and analysis of expression-based quantitative trait loci uncovered a known alteration of gene expression within a biological pathway that results in regulatory effects on companion pathway genes in the phosphocholine network. In addition, we id...
    Many genetic variants have been linked to familial or sporadic Parkinson's disease (PD), among which those identified in PARK16, BST1, SNCA, LRRK2, GBA and MAPT genes have been demonstrated to be the most common risk factors... more
    Many genetic variants have been linked to familial or sporadic Parkinson's disease (PD), among which those identified in PARK16, BST1, SNCA, LRRK2, GBA and MAPT genes have been demonstrated to be the most common risk factors worldwide. Moreover, complex gene-gene and gene-environment interactions have been highlighted in PD pathogenesis. Compared to studies focusing on the predisposing effects of genes, there is a relative lack of research investigating how these genes and their interactions influence the clinical profiles of PD. In a cohort consisting of 2,011 Chinese Han PD patients, we selected 9 representative variants from the 6 above-mentioned common PD genes to analyze their main and epistatic effects on the Unified Parkinson's Disease Rating Scale (UPDRS) and the Hoehn and Yahr (H-Y) stage of PD. With multiple linear regression models adjusting for medication status, disease duration, gender and age at onset, none of the variants displayed significant main effects on...
    Protein arginine methyltransferase 5 (PRMT5) symmetrically methylates arginine residues of histones and non-histone protein substrates and regulates a variety of cellular processes through epigenetic control of target gene expression or... more
    Protein arginine methyltransferase 5 (PRMT5) symmetrically methylates arginine residues of histones and non-histone protein substrates and regulates a variety of cellular processes through epigenetic control of target gene expression or post-translational modification of signaling molecules. Recent evidence suggests that PRMT5 may function as an oncogene and its overexpression contributes to the development and progression of several human cancers. However, the mechanism underlying the regulation of PRMT5 expression in cancer cells remains largely unknown. In the present study, we have mapped the proximal promoter of PRMT5 to the -240bp region and identified nuclear transcription factor Y (NF-Y) as a critical transcription factor that binds to the two inverted CCAAT boxes and regulates PRMT5 expression in multiple cancer cell lines. Further, we present evidence that loss of PRMT5 is responsible for cell growth inhibition induced by knockdown of NF-YA, a subunit of NF-Y that forms a ...
    It is of particular interest to identify cancer-specific molecular signatures for early diagnosis, monitoring effects of treatment and predicting patient survival time. Molecular information about patients is usually generated from high... more
    It is of particular interest to identify cancer-specific molecular signatures for early diagnosis, monitoring effects of treatment and predicting patient survival time. Molecular information about patients is usually generated from high throughput technologies such as microarray and mass spectrometry. Statistically, we are challenged by the large number of candidates but only a small number of patients in the study, and the right-censored clinical data further complicate the analysis. We present a two-stage procedure to profile molecular signatures for survival outcomes. Firstly, we group closely-related molecular features into linkage clusters, each portraying either similar or opposite functions and playing similar roles in prognosis; secondly, a Bayesian approach is developed to rank the centroids of these linkage clusters and provide a list of the main molecular features closely related to the outcome of interest. A simulation study showed the superior performance of our approac...
    ABSTRACT Graphical models for clustered data mixed with discrete and continuous responses are developed. Discrete responses are assumed to be regulated by some latent continuous variables and particular link functions are used to describe... more
    ABSTRACT Graphical models for clustered data mixed with discrete and continuous responses are developed. Discrete responses are assumed to be regulated by some latent continuous variables and particular link functions are used to describe the regulatory mechanisms. Inferential procedures are constructed using the full-information maximum likelihood estimation and observed/empirical Fisher information matrices. Implementation is carried out by stochastic versions of the generalized EM algorithm. As an illustrative application, clustered data from a developmental toxicity study is re-investigated using the directed graphical model and the proposed algorithms. A new interesting directed association between two mixed outcomes reveals. The proposed methods also apply to cross-sectional data with discrete and continuous responses.
    Genome-wide associations between single-nucleotide polymorphisms and clinical traits were simultaneously conducted using penalized orthogonal-components regression. This method was developed to identify the genetic variants controlling... more
    Genome-wide associations between single-nucleotide polymorphisms and clinical traits were simultaneously conducted using penalized orthogonal-components regression. This method was developed to identify the genetic variants controlling phenotypes from a massive number of candidate variants. By investigating the association between all single-nucleotide polymorphisms to the phenotype of antibodies against cyclic citrullinated peptide using the rheumatoid arthritis data provided by Genetic Analysis Workshop 16, we identified genetic regions which may contribute to the pathogenesis of rheumatoid arthritis. Bioinformatic analysis of these genomic regions showed most of them harbor protein-coding gene(s).
    Genome-wide association studies have successfully identified numerous loci at which common variants influence disease risks or quantitative traits of interest. Despite these successes, the variants identified by these studies have... more
    Genome-wide association studies have successfully identified numerous loci at which common variants influence disease risks or quantitative traits of interest. Despite these successes, the variants identified by these studies have generally explained only a small fraction of the variations in the phenotype. One explanation may be that many rare variants that are not included in the common genotyping platforms may contribute substantially to the genetic variations of the diseases. Next-generation sequencing, which would better allow for the analysis of rare variants, is now becoming available and affordable; however, the presence of a large number of rare variants challenges the statistical endeavor to stably identify these disease-causing genetic variants. We conduct a genome-wide association study of Genetic Analysis Workshop 17 case-control data produced by the next-generation sequencing technique and propose that collapsing rare variants within each genetic region through a super...
    Research Interests:
    Research Interests:
    We propose a simple approach, the multiplicative background correction, to solve a perplexing problem in spotted microarray data analysis: correcting the foreground intensities for the background noise, especially for spots with genes... more
    We propose a simple approach, the multiplicative background correction, to solve a perplexing problem in spotted microarray data analysis: correcting the foreground intensities for the background noise, especially for spots with genes that are weakly expressed or not at all. The conventional approach, the additive background correction, directly subtracts the background intensities from foreground intensities. When the foreground intensities marginally dominate the background intensities, the additive background correction provides unreliable estimates of the differential gene expression levels and usually presents M–A plots with ‘fishtails’ or fans. Unreliable additive background correction makes it preferable to ignore the background noise, which may increase the number of false positives. Based on the more realistic multiplicative assumption instead of the conventional additive assumption, we propose to logarithmically transform the intensity readings before the background correc...

    And 10 more