Gene-diseases Databases

The process of database compilation and curation.
The process of database compilation and curation.
	File:Databasecompilation.png The curated data may comprise a process from practical experience and literature review to web publication of the database

Pathway homogeneity for individual diseases
Pathway homogeneity for individual diseases
	File:Pathwayhomogeneity.png Showing the concept that diseases have large association with a variety of genes, mean pathway homogeneity values of single diseases and random controls are plotted for four networks binned by the number of associated gene products per disease.

In Bioinformatics, a gene-disease database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms^[1].

Introduction

Experts in different areas of bio-knowledge have been trying to comprehend the molecular mechanisms of diseases to design precautionary and therapeutic strategies for a long time. For some illnesses, it has become apparent that it is not enough to obtain an index of the disease-related genes but to uncover how disruptions of molecular grids in the cell give rise to disease phenotypes^[2]. Moreover, with the unprecedented wealth of information available, even obtaining such catalogue is extremely difficult.

genetic illnesses are caused by aberrations in genes or chromosomes. Many genetic [[diseases]] are environments existing from before birth. The association of a single gene to a disease is rare and a genetic disease may or may not be a transmissible disorder^[3]. Some genetic diseases are delivered down from the parent’s genes, but others are frequently caused by new mutations or changes to the DNA. In other occurrences, the same illness, for instance, some forms of carcinoma or melanoma, may stem from an inbred condition in some people, from new changes in other people, and from non-genetic causes in still other individuals^[4].

There are more than six thousand known single-gene disorders (monogenic), which occur in about 1 out of every 200 births. As their term suggests, these diseases are caused by a mutation in one gene. By contrast, polygenic disorders are caused by several genes, regularly in combination with environmental factors^[5]. Examples of genetic phenotypes include Alzheimer's disease, breast cancer, leukemia, Down syndrome, heart defects, and deafness; therefore, cataloguing to sort out all the diseases related to genes is needed.

Concerns

One of the main concerns in biological and biomedical research is to recognise the underlying mechanisms behind this intricate genetic phenotypes. Great effort has been spent on finding the genes related to diseases^[6]

However, increasingly evidences point out that most human diseases cannot be attributed to a single gene but arise due to complex interactions among multiple genetic variants and environmental risk factors. Several databases have been developed storing associations between genes and diseases such as the Comparative Toxicogenomics Database (CTD), Online Mendelian Inheritance in Man (OMIM®), the genetic Association Database (GAD) or the Disease genetic Association Database (DisGeNET). Each of these databases focuses on different aspects of the phenotype-genotype relationship, and due to the nature of the database curation process, they are not complete, but in a way they are fully complementary between each other^[8]. Gene-disease databases integrates human gene-disease associations from various expert curated databases and text-mining derived associations including Mendelian, complex and environmental diseases^[9].

Sorts of Databases

Essentially, there are four types of data bases, to create a gene-disease compendium: curated databases, predicted databases, literature databases and integrative databases^[10]

Curated databases

The term curated data refers to information, that may comprise the most sophisticated computational formats for structured data, scientific updates, and curated knowledge, that has been composed and prepared under the regulation of one or more experts considered to be qualified to engage in such an activity^[11] The implication is that the resulting database is of high quality. The contrast is with data which may have been gathered through some automated process or using particularly low or inexpert unsupported data quality and possibly untrustworthy ^[12]. Some of the most common examples include: CTD and UNIPROT.

CTD

The Comparative Toxigenomics Database, endorses understanding about the effects of environmental compounds on human health by integrating data from curated scientific literature to describe biochemical interactions with genes and proteins, and links between diseases and chemicals, and diseases and genes or proteins ^[13]. CTD contains curated data defining cross-species chemical–gene/protein interactions and chemical– and gene–disease associations to illuminate molecular mechanisms underlying variable susceptibility and environmentally influenced diseases. These data deliver insights into complex chemical–gene and protein interaction networks. One of the main sources in this Database is curated information from OMIM ^[14].

UNIPROT

The Universal Protein Resource (UniProt) is an inclusive resource for protein sequence and annotation data. Is a comprehensive, first-class and freely accessible database of protein sequence and functional information, that has many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the study literature, which can hint to a direct connexion between gene-protein-disease ^[15].

Predicted databases

A predicted database is one based on statistical inference. One particular approach to such inference is known as predictive inference, but the prediction can be undertaken within any of the several approaches to statistical inference. Indeed, one description of biostatistics is that it provides a means of transferring knowledge about a sample of a genetic population to the whole population (genomics), and to other related genes or genomes, which the same as prediction over time is not necessarily ^[16]. When information is transferred across time, often to specific points in time, the process is known as forecasting. Three of the main examples of databases that can be considered in this category include: The Mouse genome Database (MGD), The Rat genome Database (RGD), OMIM and Ensembl.

The Mouse genome Database (MGD)

The Mouse genome Database (MGD) is the international community resource for integrated genetic, genomic and biological data about the laboratory mouse. MGD provides full annotation of phenotypes and human disease associations for mouse models (genotypes) using terms from the Mammalian Phenotype Ontology and disease names from OMIM® ^[17].

RGD

The Rat genome Database (RGD) is a collaborative effort between leading research institutions involved in rat genetic and genomic research. The rat continues to be extensively used by researchers as a model organism for investigating the biology and pathophysiology of disease. In the past several years, there has been a rapid increase in rat genetic and genomic data ^[18]. This explosion of information highlighted the need for a centralized database to efficiently and effectively collect, manage, and distribute a rat-centric view of this data to researchers around the world. The Rat genome Database was created to serve as a repository of rat genetic and genomic data, as well as mapping, strain, and physiological information. It also facilitates investigators research efforts by providing tools to search, mine, and predict this data ^[19].

OMIM

Supported by the NCBI, The Online Mendelian Inheritance in Man (OMIM) is a database that catalogues all the known diseases with a genetic component, and predicts their relationship to relevant genes in the human genome and provides references for further research and tools for genomic analysis of a catalogued gene ^[20]. OMIM is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The database has been used as a resource for predicting relevant information to inherited conditions ^[21].

Ensembl

Is one of the largest resources available for all genomic and genetic studies, it provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model disease organisms. Ensembl is one of several well-known genome browsers for the retrieval of genomic-disease information. Ensembl imports variation data from a variety of different sources, Ensembl predicts the effects of variants ^[22]. For each variation that is mapped to the reference genome, each Ensembl transcript is identified that overlap the variation. Then uses a rule-based approach to predict the effects that each allele of the variation may have on the transcript. The set of consequence terms, defined by the Sequence Ontology (SO) can be currently assigned to each combination of an allele and a transcript. Each allele of each variation may have a different effect in different transcripts. A variety of different tools are used to predict human mutations in the Ensembl database, one of the most widely used is SIFT, that predicts whether an amino acid substitution is likely to affect protein function based on sequence homology and the physic-chemical similarity between the alternate amino acids. The data provided for each amino acid substitution is a score and a qualitative prediction (either 'tolerated' or 'deleterious'). The score is the normalized probability that the amino acid change is tolerated so scores nearer 0 are more likely to be deleterious. The qualitative prediction is derived from this score such that substitutions with a score < 0.05 are called 'deleterious' and all others are called 'tolerated' ^[23].

Literature databases

This sort of databases summarize books, articles, book reviews, dissertations, and good annotations about gene-disease databases. Some of the following are good examples of this type: GAD, LGHDN and BeFree Data.

GAD

The genetic Association Database is an archive of human genetic association studies of complex diseases. GAD is primarily focused on archiving information on common complex human disease rather than rare Mendelian disorders as found in the OMIM. It includes curated summary data extracted from published papers in peer reviewed journals on candidate gene and genome Wide Association Studies (GWAS) ^[24].

LHGDN

The literature-derived human gene-disease network (LHGDN) is a text mining derived database with focus on extracting and classifying gene-disease associations with respect to several biomolecular conditions. It uses a machine learning based algorithm to extract semantic gene-disease relations from a textual source of interest. Is part of the Linked Life Data, of the LMU in Munchen, Germany ^[25].

BeFree Data

Extracts gene-disease associations from MEDLINE abstract using the BeFree system. BeFree is composed of a biomedical Named Entity Recognition (BioNER) module to detect diseases and genes and a relation extraction module based on morphosyntactic information ^[26].

Integrative databases

This sort of databases include Mendelian, compound and environmental diseases in an integrated gene-disease association archive and show that the concept of modularity applies for all of them. Providing a functional analysis of disease-related modules in case of important new biological insights, which might not be discovered when considering each of the gene-disease association repositories independently. Hence, they present a suitable framework for the study of how genetic and environmental factors, such as drugs, contribute to diseases. The best example for this sort of database is DisGeNET ^[27].

DisGeNET

It’s a comprehensive gene-disease association database by integrating associations from several sources that cover different biomedical aspects of diseases. In particular, focused on the current knowledge of human genetic diseases including Mendelian, complex and environmental diseases. To assess the concept of modularity of human diseases, this database performs a systematic study of the emergent properties of human gene-disease networks by means of network topology and functional annotation analysis ^[28]. The results indicate a highly shared genetic origin of human diseases and show that for most diseases, including Mendelian, complex and environmental diseases, functional modules exist. Moreover, a core set of biological pathways is found to be associated with most human diseases. Obtaining similar results when studying clusters of diseases, the findings in this database suggest that related diseases might arise due to dysfunction of common biological processes in the cell. The network analysis of this integrated database points out that data integration is needed to obtain a comprehensive view of the genetic landscape of human diseases and that the genetic origin of complex diseases is much more common than expected ^[29].

Remarks about the future in Gene-Disease Database

The completion of the human genome has changed the way the search for disease genes is performed. In the past, the approach was to focus on one or a few genes at a time. Now, projects like the DisGeNET exemplify the efforts to systematically analyze all the gene alterations involved in a single or multiple diseases ^[30]. The next step is to produce a complete picture of the mechanistic aspects of the diseases and the design of drugs against them. For that, a combination of two approaches will be needed: a systematic search and in-depth study of each gene. The future of the field will be defined by new techniques to integrate large bodies of data from different sources and to incorporate functional information into the analysis of large-scale data. The response of bioinformatics to new experimental techniques brings a new perspective into the analysis of the experimental data, as demonstrated by the advances in the analysis of information from gene disease databases and other technologies. It is expected that this trend will continue with novel approaches to respond to new techniques, such as next-generation sequencing technologies. For instance, the availability of large numbers of individual human genomes will promote the development of computational analyses of rare variants, including the statistical mining of their relations to lifestyles, drug interactions and other factors ^[31]. biomedical research will also be driven by our ability to efficiently mine the large body of existing and continuously generated biomedical data. Text-mining techniques, in particular, when combined with other molecular data, can provide information about gene mutations and interactions and will become crucial to stay ahead of the exponential growth of data generated in biomedical research. Another field that is benefiting from the advances in mining and integration of molecular, clinical and drug analysis is pharmacogenomics. In silico studies of the relationships between human variations and their effect on diseases will be key to the development of personalized medicine ^[32]. In summary, gene-disease databases have already transformed the search for illness genes and has the potential to become a crucial component of other areas of medical research ^[33].

References

^ A. Bauer-Mehren, "Gene-Disease network Analysis Reveals Functional Modules in [[Mendelian]], Complex and Environmental diseases," PLOS One, pp. 1-3, 2011.
^ American Medical Informatics Association, "American Medical Informatics Association Strategic Plan," August 2011. [Online]. Available: http://www.amia.org/inside/stratplan/. [Accessed 15 October 2014].
^ B. H. Oti M, "The modular nature of genetic diseases.," Clinical genetics, vol. 71, no. 1, pp. 1-11, 2007.
^ A. Davis and B. King, "The Comparative Toxicogenomics Database: update 2011," Nucl. Acids Res, vol. 39, no. 1, pp. 1067-1072, 2011.
^ A. Davis and T. Wiegers, "Text Mining Effectively Scores and Ranks the Literature for Improving Chemical-Gene-Disease Curation at the Comparative Toxicogenomics Database," PLOS One, vol. 8, no. 4, pp. 1-29, 2013.
^ A. Bauer-Mehren and M. Rautscha, "DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene–disease networks," Bioinformatics, vol. 26, no. 22, pp. 2924-2926, 2010.
^ K. Brown, "Online Predicted human Interaction Database," Bioinformatics, vol. 21, no. 9, pp. 2076-2082, 2005.
^ I. Vogt, "Systematic analysis of gene properties influencing organ system phenotypes in mammalian perturbations," Bioinformatics, vol. 30, no. 21, pp. 3093-3100, 2014.
^ R. N. Botstein D, "Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease.," Nature genetics, vol. 33, no. 1, pp. 228-237, 2003.
^ A. Bauer-Mehren, "Gene-Disease network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental diseases," PLOS One, pp. 1-3, 2011.
^ P. Buneman, "Curated Databases," Bibliometrics, vol. 978, no. 1, pp. 152-162, 2008.
^ P. Buneman, “Curated Databases,” Bibliometrics, vol. 978, no. 1, pp. 152-162, 2008
^ C. Murphy and A. Davis, “Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks,” Bioinformatics, vol. 37, no. 1, pp. 786-792, 2009
^ C. Murphy and A. Davis, “Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks,” Bioinformatics, vol. 37, no. 1, pp. 786-792, 2009
^ The UniProt Consortium, “The Universal Protein Resource (UniProt),” Nucleic Acids Research, vol. 36, no. 1, pp. 190-195, 2008
^ S. Hunter and P. Jones, “InterPro in 2011: new developments in the family and domain prediction database,” Nucleic Acids Research, vol. 10, no. 1, pp. 12-22, 2011
^ C. Bult and J. Eppig, “The Mouse genome Database (MGD): mouse biology and model systems,” Nucleic Acids Research, vol. 36, no. 1, pp. 724-728, 2007
^ M. Dwinell, E. Worthey and S. M, “The Rat genome Database 2009: variation, ontologies and pathways,” Nucleic Acids Research, vol. 37, no. 1, pp. 744-749, 2009
^ M. Dwinell, E. Worthey and S. M, “The Rat genome Database 2009: variation, ontologies and pathways,” Nucleic Acids Research, vol. 37, no. 1, pp. 744-749, 2009
^ A. Homosh, “Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders,” Nucleic Acids Research, vol. 33, no. 1, pp. 514-517, 2005
^ A. Homosh, “Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders,” Nucleic Acids Research, vol. 33, no. 1, pp. 514-517, 2005
^ P. Flicek and M. Ridwan, “Ensembl 2012,” Nucleic Acids Research, vol. 40, no. 1, pp. 84-90, 2012
^ P. Flicek and M. Ridwan, “Ensembl 2012,” Nucleic Acids Research, vol. 40, no. 1, pp. 84-90, 2012
^ K. Becker and K. Barnes, “The genetic Association Database,” Nature genetics, vol. 36, no. 1, pp. 431-432, 2004
^ A. Bauer-Mehren, “Gene-Disease network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental diseases,” PLOS One, pp. 1-3, 2011.
^ B. A, “Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research,” Pathology, vol. 1, no. 1, p. 10, 2014
^ A. Bauer-Mehren and M. Rautscha, “DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene–disease networks,” Bioinformatics, vol. 26, no. 22, pp. 2924-2926, 2010
^ N. Queralt, “Publishing DisGeNET as Nanopublications,” bioRxiv, vol. 12, no. 1, pp. 10-14, 2014
^ A. Bauer-Mehren, “Gene-Disease network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental diseases,” PLOS One, pp. 1-3, 2011.
^ B. H. Oti M, “The modular nature of genetic diseases.,” Clinical genetics, vol. 71, no. 1, pp. 1-11, 2007
^ American Medical Informatics Association, “American Medical Informatics Association Strategic Plan,” August 2011. [Online]. Available: http://www.amia.org/inside/stratplan/. [Accessed 15 October 2014]
^ B. A, “Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research,” Pathology, vol. 1, no. 1, p. 10, 2014
^ A. Bauer-Mehren, “Gene-Disease network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental diseases,” PLOS One, pp. 1-3, 2011.

[1] A. Bauer-Mehren, "Gene-Disease network Analysis Reveals Functional Modules in [[Mendelian]], Complex and Environmental diseases," PLOS One, pp. 1-3, 2011.

[2] American Medical Informatics Association, "American Medical Informatics Association Strategic Plan," August 2011. [Online]. Available: http://www.amia.org/inside/stratplan/. [Accessed 15 October 2014].

[3] B. H. Oti M, "The modular nature of genetic diseases.," Clinical genetics, vol. 71, no. 1, pp. 1-11, 2007.

[4] A. Davis and B. King, "The Comparative Toxicogenomics Database: update 2011," Nucl. Acids Res, vol. 39, no. 1, pp. 1067-1072, 2011.

[5] A. Davis and T. Wiegers, "Text Mining Effectively Scores and Ranks the Literature for Improving Chemical-Gene-Disease Curation at the Comparative Toxicogenomics Database," PLOS One, vol. 8, no. 4, pp. 1-29, 2013.

[6] A. Bauer-Mehren and M. Rautscha, "DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene–disease networks," Bioinformatics, vol. 26, no. 22, pp. 2924-2926, 2010.

[7] K. Brown, "Online Predicted human Interaction Database," Bioinformatics, vol. 21, no. 9, pp. 2076-2082, 2005.

[8] I. Vogt, "Systematic analysis of gene properties influencing organ system phenotypes in mammalian perturbations," Bioinformatics, vol. 30, no. 21, pp. 3093-3100, 2014.

[9] R. N. Botstein D, "Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease.," Nature genetics, vol. 33, no. 1, pp. 228-237, 2003.

[10] A. Bauer-Mehren, "Gene-Disease network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental diseases," PLOS One, pp. 1-3, 2011.

[11] P. Buneman, "Curated Databases," Bibliometrics, vol. 978, no. 1, pp. 152-162, 2008.

[12] P. Buneman, “Curated Databases,” Bibliometrics, vol. 978, no. 1, pp. 152-162, 2008

[13] C. Murphy and A. Davis, “Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks,” Bioinformatics, vol. 37, no. 1, pp. 786-792, 2009

[14] C. Murphy and A. Davis, “Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks,” Bioinformatics, vol. 37, no. 1, pp. 786-792, 2009

[15] The UniProt Consortium, “The Universal Protein Resource (UniProt),” Nucleic Acids Research, vol. 36, no. 1, pp. 190-195, 2008

[16] S. Hunter and P. Jones, “InterPro in 2011: new developments in the family and domain prediction database,” Nucleic Acids Research, vol. 10, no. 1, pp. 12-22, 2011

[17] C. Bult and J. Eppig, “The Mouse genome Database (MGD): mouse biology and model systems,” Nucleic Acids Research, vol. 36, no. 1, pp. 724-728, 2007

[18] M. Dwinell, E. Worthey and S. M, “The Rat genome Database 2009: variation, ontologies and pathways,” Nucleic Acids Research, vol. 37, no. 1, pp. 744-749, 2009

[19] M. Dwinell, E. Worthey and S. M, “The Rat genome Database 2009: variation, ontologies and pathways,” Nucleic Acids Research, vol. 37, no. 1, pp. 744-749, 2009

[20] A. Homosh, “Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders,” Nucleic Acids Research, vol. 33, no. 1, pp. 514-517, 2005

[21] A. Homosh, “Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders,” Nucleic Acids Research, vol. 33, no. 1, pp. 514-517, 2005

[22] P. Flicek and M. Ridwan, “Ensembl 2012,” Nucleic Acids Research, vol. 40, no. 1, pp. 84-90, 2012

[23] P. Flicek and M. Ridwan, “Ensembl 2012,” Nucleic Acids Research, vol. 40, no. 1, pp. 84-90, 2012

[24] K. Becker and K. Barnes, “The genetic Association Database,” Nature genetics, vol. 36, no. 1, pp. 431-432, 2004

[25] A. Bauer-Mehren, “Gene-Disease network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental diseases,” PLOS One, pp. 1-3, 2011.

[26] B. A, “Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research,” Pathology, vol. 1, no. 1, p. 10, 2014

[27] A. Bauer-Mehren and M. Rautscha, “DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene–disease networks,” Bioinformatics, vol. 26, no. 22, pp. 2924-2926, 2010

[28] N. Queralt, “Publishing DisGeNET as Nanopublications,” bioRxiv, vol. 12, no. 1, pp. 10-14, 2014

[29] A. Bauer-Mehren, “Gene-Disease network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental diseases,” PLOS One, pp. 1-3, 2011.

[30] B. H. Oti M, “The modular nature of genetic diseases.,” Clinical genetics, vol. 71, no. 1, pp. 1-11, 2007

[31] American Medical Informatics Association, “American Medical Informatics Association Strategic Plan,” August 2011. [Online]. Available: http://www.amia.org/inside/stratplan/. [Accessed 15 October 2014]

[32] B. A, “Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research,” Pathology, vol. 1, no. 1, p. 10, 2014

[33] A. Bauer-Mehren, “Gene-Disease network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental diseases,” PLOS One, pp. 1-3, 2011.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

Razhiel

Contents