Generating Structural Data Analysis
Generating Structural Data Analysis
: 06 Computational Biology
Principal Investigator: Dr. Vibha Dhawan, Distinguished Fellow and Sr. Director
The Energy and Resources Institute (TERI), New Delhi
Computational Biology
Biotechnology
Generating Structural Data & Analysis
Description of Module
Subject Name Biotechnology
Module Id 01
Pre-requisites
Objectives
Keywords
Computational Biology
Biotechnology
Generating Structural Data & Analysis
P06-M01: Databases in Biology
What are the major types of databases which are used in biology for query, search and
analysis ? How they are generated?
Biological databases will contain information on the sequences & structures of
macromolecules like DNA, Protein, Carbohydrate, small molecules etc, textual descriptions,
context based descriptions, pathways, cellular localization, citations etc. Primary Database as
defined contain raw data from experiments directly like Genebank and secondary database
contains extracted and curetted data from many primary databases, these can be linked .
Another concept has recently come up called Metadata ,ie, information about the data which
is very useful while analyzing data from different databases, which are of different origin,
structured or unstructured and of different accuracy. Because in biological sciences data
gathering is not just collection of information, the experiments are actually designed specific
and efficient, so that one needs to collect the data in an organized manner. Different types of
data are collected, some are structured like genome sequence but some are unstructured like
fluorescence pictures in cell.
Computational Biology
Biotechnology
Generating Structural Data & Analysis
sequence database actually came up in fragmented manner as sequencing was done in
different places in the world, it was gathered, curetted and organized under community who
were the major users. However, since 2000, larger community has evolved, use of different
databases expanded outside the data generators, and most of them are data analyzers. Hence,
it is worth to discuss in detail about a few databases hosting the genome associated data. One
such is NCBI RefSeq which has been discussed in details inclusive of recent publication as
reference. UniProt is similar useful database containing high quality resource for protein
sequence data with functional information. In these databases, when the data one sequence is
collated and added information related to the same is called annotation. This annotation is
automated now a days, so that updating frequency of these databases are quite high. There is
a bias observed regarding collection as the genome revelation is more for prokaryotes than
eukaryotes. Some of the databases are enriched by functional annotation like Swiss-Prot,
TrEMBL, and PIR which are set of information collection on Proteins sequences.
Data Format :
Most common File Format for Sequence to represent is Fasta. It is a text based format and
sequence can be represented for DNA as ATGC and for Amino acid in Protein by single
Computational Biology
Biotechnology
Generating Structural Data & Analysis
letter codes. Top line contains the description and is distinguished by greater-than (">")
symbol at the beginning, from the sequence lines. Maximum 80 characters in length are
recommended for sequence typing. This makes the program writing quite easy and so Fasta is
very popular in DNA or Protein sequence presentation. Genebank has different type of format
whereas BED (Browser Extensible Data) format provides a flexible way to define the data
lines that are displayed in an annotation track. BED lines have three required fields and nine
additional optional fields. Track definition lines can be used to configure the display the
sequences, mostly used in association with chromosomal position for Genome sequences.
Computational Biology
Biotechnology
Generating Structural Data & Analysis
of personal expert in data acquisition, organization and analysis is a big demand of future
days,
Structural Database:
Large amount of resources are also generated by researchers elucidating three dimensional
structures of DNA/RNA, Protein and other Macromolecules since Myoglobin structure was
published using X-ray crystallography. Major data generator till now happens to be X-ray &
Neutron Crystallography , a small contribution from NMR and modeling community, but
next decade the data explosion will be due to Electron Microscopy delivering almost same
order of magnitude resolution structures which earlier X-ray crystallography use to produce,
challenges to handle such enormous data is looming above.
Historically, named as Protein data bank also has evolved to take care of deposition of three
dimensional structural data from X-ray and NMR. Since last two decades many
reorganization happened in this community and three continents, USA, Europe & Asia
(Japan) are managing with extensive effort the repositories, namely http://www.wwpdb.org ,
https://www.rcsb.org/ , https://www.ebi.ac.uk/pdbe and https://www.pdbj.org/ . This
Database has developed many validations methods and graphical display of structures while
accepting the data which makes it exclusive. It also supports large amount of relevant tools
for analysis of structure deposited. One example cited here, HIV protease for which many 3D
structures are available. But it will be relevant to know the quality of structures for interaction
mapping at the active site, which can be helped by tools provided by PDB. It is interesting to
note that one can use a colour bar and use own Metric Validation before selecting for further
study. In addition it has provided links to other sequence and functional databases for
researchers help. Some of extracted or secondary databases are also important for those who
are interested in protein and their ligand interactions, which is main driving force for drug
designing community. List of such databases are also included here. PDB not only contains
protein structure, it also provides researchers to look into the fold available in protein,
characterize the available folds, and their organization in higher order functional
arrangements. It also has been seen that there is a tendency to saturate in the fold space of
protein, which can motivate researchers to design new proteins with different functions.
Computational Biology
Biotechnology
Generating Structural Data & Analysis
Many important weblink has been provided for further use. To learn how to characterize the
fold space and hierarchy reference papers along with little exploration of suitable links is
suggested.
Chemical Databases:
These databases play very important role for understanding biological interactions, specially
designing chemicals to interrupt or inhibit or modulate biological interactions or reactions
causing disease. There are several types of chemical databases like , Literature driven
database, Chemical structure Databases, Databases derived from Crystallography & NMR
spectra, reaction database etc. Recent addition is ChEMBL which provide not only the
structure of chemicals also the bioactivity; this makes it very exclusive for new compound
design. PubChem is also another such database which integrates bioactivity from assays done
experimentally. Some other useful databases and their availability are enclosed. About the
Format of Chemical library is not so easy to explain as there are almost > 150 types of
formats used. These can be transformed into each other by program known as “babel”
(http://openbabel.org/wiki/Main_Page). Many other Software sold by different vendors are
also available, where as Babel is Opensource and freely downloadable. As in Sequence
database, here also mostly used 3D format is MDL or SDF and Line notation format is
SMILES, both are shown as sample. The preparation of chemical databases are described in
detail with their application because most of them are available in text format as string,
SMILES , which may not be suitable for finding interaction at receptor site (DOCKING) or
for ligand based novel compound design ( PHARMACOPHORE) . These topics will be taken
up in coming modules in details.
Computational Biology
Biotechnology
Generating Structural Data & Analysis
Summary:
In summary I have discussed all the different kinds of databases used in Biological research,
like Sequence database, Structural database and Chemical database. Obviously they are
stored in different kind of formats. The quality of information what we extract from the
databases depends on what are the method of generation of the data, the accuracy of data and
coverage of data. One must remember that many of the weblink have been updated since my
talk and may need to update them, but the citation in literature will help to do so. This field is
rapidly changing so does the update of data. Here I have only discussed the development of
last 30 years and most of the Databases are organized in simple SQL, however new evolving
concept called Graph Database will be the future of Biological databases, so that connectivity
between different databases will be easily established and functional relation can be used for
this.
Computational Biology
Biotechnology
Generating Structural Data & Analysis