Tree- and linear-shaped cell differentiation trajectories have been widely observed in developmen... more Tree- and linear-shaped cell differentiation trajectories have been widely observed in developmental biologies and can be also inferred through computational methods from single-cell RNA-sequencing datasets. However, trajectories with complicated topologies such as loops, disparate lineages and bifurcating hierarchy remain difficult to infer accurately. Here, we introduce a density-based trajectory inference method capable of constructing diverse shapes of topological patterns including the most intriguing bifurcations. The novelty of our method is a step to exploit overlapping probability distributions to identify transition states of cells for determining connectability between cell clusters, and another step to infer a stable trajectory through a base-topology guided iterative fitting. Our method precisely re-constructed various benchmark reference trajectories. As a case study to demonstrate practical usefulness, our method was tested on single-cell RNA sequencing profiles of bl...
Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common... more Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is a...
Motivation: The 2009 H1N1 influenza A attacked worldwide population and caused substantial mortal... more Motivation: The 2009 H1N1 influenza A attacked worldwide population and caused substantial mortality. The epitope conservation in the HA1 protein allows antibodies to cross-neutralize the 1918 and 2009 H1N1 viruses. However, few works have thoroughly studied the binding hot spots in the two antigen-antibody interfaces which are responsible for the antibody cross neutralization. Results: We apply predictive methods to identify binding hot spots at the epitope sites of the HA1 proteins and at the paratope sites of the 2D1 antibody. We find that the six mutations at the HA1’s epitopes from 1918 to 2009 do not contribute greatly to the 2D1 antibody binding. However, the change of binding free energy on the whole exhibits an increased tendency after these mutations, making the binding stronger. This is consistent with the observation that the 1918 H1N1 neutralizing antibody can cross-react to 2009 H1N1. We have identified five distinguished hot spot residues, including Lys166, which are ...
Motivation Infection with strains of different subtypes and the subsequent crossover reading betw... more Motivation Infection with strains of different subtypes and the subsequent crossover reading between the two strands of genomic RNAs by host cells’ reverse transcriptase are the main causes of the vast HIV-1 sequence diversity. Such inter-subtype genomic recombinants can become circulating recombinant forms (CRFs) after widespread transmissions in a population. Complete prediction of all the subtype sources of a CRF strain is a complicated machine learning problem. It is also difficult to understand whether a strain is an emerging new subtype and if so, how to accurately identify the new components of the genetic source. Results We introduce a multi-label learning algorithm for the complete prediction of multiple sources of a CRF sequence as well as the prediction of its chronological number. The prediction is strengthened by a voting of various multi-label learning methods to avoid biased decisions. In our steps, frequency and position features of the sequences are both extracted t...
RNA-binding proteins (RBPs) control RNA metabolism to orchestrate gene expression, and dysfunctio... more RNA-binding proteins (RBPs) control RNA metabolism to orchestrate gene expression, and dysfunctional RBPs underlie many human diseases. Proteome-wide discovery efforts predict thousands of novel RBPs, many of which lack canonical RNA-binding domains. Here, we present a hybrid ensemble RBP classifier (HydRA) that leverages information from both intermolecular protein interactions and internal protein sequence patterns to predict RNA-binding capacity with unparalleled specificity and sensitivity using support vector machine, convolutional neural networks and transformer-based protein language models. HydRA enables Occlusion Mapping to robustly detect known RNA-binding domains and to predict hundreds of uncharacterized RNA-binding domains. Enhanced CLIP validation for a diverse collection of RBP candidates reveals genome-wide targets and confirms RNA-binding activity for HydRA-predicted domains. The HydRA computational framework accelerates construction of a comprehensive RBP catalogue...
BackgroundCurrent protein family modeling methods like profile Hidden Markov Model (pHMM),k-mer b... more BackgroundCurrent protein family modeling methods like profile Hidden Markov Model (pHMM),k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions.ResultsWe present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset ...
A correction has been published and is appended to both the HTML and PDF versions of this paper. ... more A correction has been published and is appended to both the HTML and PDF versions of this paper. The error has been fixed in the paper.
Reactive astrogliosis is a critical process in neuropathological conditions and neurotrauma. Alth... more Reactive astrogliosis is a critical process in neuropathological conditions and neurotrauma. Although it has been suggested that it confers neuroprotective effects, the exact genomic mechanism has not been explored. The prevailing dogma of the role of astrogliosis in inhibition of axonal regeneration has been challenged by recent findings in rodent model's spinal cord injury, demonstrating its neuroprotection and axonal regeneration properties. We examined whether their neuroprotective and axonal regeneration potentials can be identify in human spinal cord reactive astrocytes in vitro. Here, reactive astrogliosis was induced with IL1β. Within 24 hours of IL1β induction, astrocytes acquired reactive characteristics. Transcriptome analysis of over 40000 transcripts of genes and analysis with PFSnet subnetwork revealed upregulation of chemokines and axonal permissive factors including FGF2, BDNF, and NGF. In addition, most genes regulating axonal inhibitory molecules, including ROB...
Oil palm is the most productive oil crop in the world and composes 36% of the world production. H... more Oil palm is the most productive oil crop in the world and composes 36% of the world production. However, the molecular mechanisms of hybrids vigor (or heterosis) between Dura, Pisifera and their hybrid progeny Tenera has not yet been well understood. Here we compared the temporal and spatial compositions of lipids and transcriptomes for two oil yielding organs mesocarp and endosperm from Dura, Pisifera and Tenera. Multiple lipid biosynthesis pathways are highly enriched in all non-additive expression pattern in endosperm, while cytokinine biosynthesis and cell cycle pathways are highly enriched both in endosperm and mesocarp. Compared with parental palms, the high oil content in Tenera was associated with much higher transcript levels of EgWRI1, homolog of Arabidopsis thaliana WRINKLED1. Among 338 identified genes in lipid synthesis, 207 (61%) has been identified to contain the WRI1 specific binding AW motif. We further functionally identified EgWRI1-1, one of three EgWRI1 orthologs...
In the past decade "Big Science" such as the Genome Project has generated an en... more In the past decade "Big Science" such as the Genome Project has generated an enormous amount of data in the life sciences. Concurrently, the synergy of this project with existing research has quickened the pace of biological discovery. But the major drawback that is beginning to be felt worldwide is the primitive level of organisation in the data accumulated. Without a proper framework or knowledge scaffold to hang and interconnect the various bits of data and information, the national knowledge-to-data ratio is declining rapidly. We are trying to serve a solution to this enigma by providing a World Wide Web (WWW) interface to Biosoftware and at the same time have come up with a database integration tool that can query heterogeneous, geographically scattered and disparate databases simultaneously. In this report we will talk about BioInformatics in general with specific reference to BioInformatics Centre (BIC) at the National University of Singapore.
Tree- and linear-shaped cell differentiation trajectories have been widely observed in developmen... more Tree- and linear-shaped cell differentiation trajectories have been widely observed in developmental biologies and can be also inferred through computational methods from single-cell RNA-sequencing datasets. However, trajectories with complicated topologies such as loops, disparate lineages and bifurcating hierarchy remain difficult to infer accurately. Here, we introduce a density-based trajectory inference method capable of constructing diverse shapes of topological patterns including the most intriguing bifurcations. The novelty of our method is a step to exploit overlapping probability distributions to identify transition states of cells for determining connectability between cell clusters, and another step to infer a stable trajectory through a base-topology guided iterative fitting. Our method precisely re-constructed various benchmark reference trajectories. As a case study to demonstrate practical usefulness, our method was tested on single-cell RNA sequencing profiles of bl...
Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common... more Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is a...
Motivation: The 2009 H1N1 influenza A attacked worldwide population and caused substantial mortal... more Motivation: The 2009 H1N1 influenza A attacked worldwide population and caused substantial mortality. The epitope conservation in the HA1 protein allows antibodies to cross-neutralize the 1918 and 2009 H1N1 viruses. However, few works have thoroughly studied the binding hot spots in the two antigen-antibody interfaces which are responsible for the antibody cross neutralization. Results: We apply predictive methods to identify binding hot spots at the epitope sites of the HA1 proteins and at the paratope sites of the 2D1 antibody. We find that the six mutations at the HA1’s epitopes from 1918 to 2009 do not contribute greatly to the 2D1 antibody binding. However, the change of binding free energy on the whole exhibits an increased tendency after these mutations, making the binding stronger. This is consistent with the observation that the 1918 H1N1 neutralizing antibody can cross-react to 2009 H1N1. We have identified five distinguished hot spot residues, including Lys166, which are ...
Motivation Infection with strains of different subtypes and the subsequent crossover reading betw... more Motivation Infection with strains of different subtypes and the subsequent crossover reading between the two strands of genomic RNAs by host cells’ reverse transcriptase are the main causes of the vast HIV-1 sequence diversity. Such inter-subtype genomic recombinants can become circulating recombinant forms (CRFs) after widespread transmissions in a population. Complete prediction of all the subtype sources of a CRF strain is a complicated machine learning problem. It is also difficult to understand whether a strain is an emerging new subtype and if so, how to accurately identify the new components of the genetic source. Results We introduce a multi-label learning algorithm for the complete prediction of multiple sources of a CRF sequence as well as the prediction of its chronological number. The prediction is strengthened by a voting of various multi-label learning methods to avoid biased decisions. In our steps, frequency and position features of the sequences are both extracted t...
RNA-binding proteins (RBPs) control RNA metabolism to orchestrate gene expression, and dysfunctio... more RNA-binding proteins (RBPs) control RNA metabolism to orchestrate gene expression, and dysfunctional RBPs underlie many human diseases. Proteome-wide discovery efforts predict thousands of novel RBPs, many of which lack canonical RNA-binding domains. Here, we present a hybrid ensemble RBP classifier (HydRA) that leverages information from both intermolecular protein interactions and internal protein sequence patterns to predict RNA-binding capacity with unparalleled specificity and sensitivity using support vector machine, convolutional neural networks and transformer-based protein language models. HydRA enables Occlusion Mapping to robustly detect known RNA-binding domains and to predict hundreds of uncharacterized RNA-binding domains. Enhanced CLIP validation for a diverse collection of RBP candidates reveals genome-wide targets and confirms RNA-binding activity for HydRA-predicted domains. The HydRA computational framework accelerates construction of a comprehensive RBP catalogue...
BackgroundCurrent protein family modeling methods like profile Hidden Markov Model (pHMM),k-mer b... more BackgroundCurrent protein family modeling methods like profile Hidden Markov Model (pHMM),k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions.ResultsWe present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset ...
A correction has been published and is appended to both the HTML and PDF versions of this paper. ... more A correction has been published and is appended to both the HTML and PDF versions of this paper. The error has been fixed in the paper.
Reactive astrogliosis is a critical process in neuropathological conditions and neurotrauma. Alth... more Reactive astrogliosis is a critical process in neuropathological conditions and neurotrauma. Although it has been suggested that it confers neuroprotective effects, the exact genomic mechanism has not been explored. The prevailing dogma of the role of astrogliosis in inhibition of axonal regeneration has been challenged by recent findings in rodent model's spinal cord injury, demonstrating its neuroprotection and axonal regeneration properties. We examined whether their neuroprotective and axonal regeneration potentials can be identify in human spinal cord reactive astrocytes in vitro. Here, reactive astrogliosis was induced with IL1β. Within 24 hours of IL1β induction, astrocytes acquired reactive characteristics. Transcriptome analysis of over 40000 transcripts of genes and analysis with PFSnet subnetwork revealed upregulation of chemokines and axonal permissive factors including FGF2, BDNF, and NGF. In addition, most genes regulating axonal inhibitory molecules, including ROB...
Oil palm is the most productive oil crop in the world and composes 36% of the world production. H... more Oil palm is the most productive oil crop in the world and composes 36% of the world production. However, the molecular mechanisms of hybrids vigor (or heterosis) between Dura, Pisifera and their hybrid progeny Tenera has not yet been well understood. Here we compared the temporal and spatial compositions of lipids and transcriptomes for two oil yielding organs mesocarp and endosperm from Dura, Pisifera and Tenera. Multiple lipid biosynthesis pathways are highly enriched in all non-additive expression pattern in endosperm, while cytokinine biosynthesis and cell cycle pathways are highly enriched both in endosperm and mesocarp. Compared with parental palms, the high oil content in Tenera was associated with much higher transcript levels of EgWRI1, homolog of Arabidopsis thaliana WRINKLED1. Among 338 identified genes in lipid synthesis, 207 (61%) has been identified to contain the WRI1 specific binding AW motif. We further functionally identified EgWRI1-1, one of three EgWRI1 orthologs...
In the past decade "Big Science" such as the Genome Project has generated an en... more In the past decade "Big Science" such as the Genome Project has generated an enormous amount of data in the life sciences. Concurrently, the synergy of this project with existing research has quickened the pace of biological discovery. But the major drawback that is beginning to be felt worldwide is the primitive level of organisation in the data accumulated. Without a proper framework or knowledge scaffold to hang and interconnect the various bits of data and information, the national knowledge-to-data ratio is declining rapidly. We are trying to serve a solution to this enigma by providing a World Wide Web (WWW) interface to Biosoftware and at the same time have come up with a database integration tool that can query heterogeneous, geographically scattered and disparate databases simultaneously. In this report we will talk about BioInformatics in general with specific reference to BioInformatics Centre (BIC) at the National University of Singapore.
Uploads