Lecture 5- DataBase
Lecture 5- DataBase
to Bioinformatics
Lecture 5 – Bioinformatics DataBase
Dr. Athira B
Asst. Professor, CSE
IIIT Kottayam
Motivation
• Key concept in Molecular Biology is the information flow
DNA →RNA→ Protein
• From a data point of view: we have multiple omic data:
Genomics → Trancriptomics → Proteomic → Metabolomisc
• This vast amount of data needs to be stored and organized for easy
access around the globe
Motivation-Human Genome Project
• A landmark global scientific effort whose signature goal was to
generate the first sequence of the human genome (almost all genes in
human)
• Identified 1,00,000 genes in DNA
• more than 3 Billion base pairs were extracted
• The goals were:
• Alert patients that are at risk of certain diseases
• Reliably predict course of disease
• Precise diagnose and treatment
• Developing new treatments at molecular level
• Milestone in Biomedical Research
• https://www.genome.gov/about-genomics/educational-resources/
fact-sheets/human-genome-project.
Motivation-Biological Big Data
• Advancement in sequencing techniques generated good amount of
Biological data
• Similar to human, genetic data of other model organisms are also
generated:
• Yeast (Saccharomyces cerevisiae)
• Fruit fly (Drosophila melanogaster)
• Nematode worm (Caenorhabditis elegans)
• Western clawed frog (Xenopus tropicalis)
• Mouse (Mus musculus)
• Zebrafish (Danio rerio)
• How to store these data so that researchers can easily retrieve data
efficiently
Databases
• Database stores and organizes related data for easy retrieval
Eg: Your Phone contact book
• Most common form of Database is relational database (SQL)
• There are many other databases- column databases, graph databases,
etc
• Biological databases stores biological data and associated knowledge
• These knowledge bases are fundamentals to the survival of science
Biological Databases
• Store and handle the staggering volume of Biological information
through the establishment and use of computer databases
• Current biological databases use all three types of database
structures: flat files, relational, and object oriented
• Based on their contents, biological databases can be roughly divided
into three categories: primary databases, secondary databases, and
specialized databases.
Primary Databases
• Contain original biological data. They are archives of raw sequence or
structural data submitted by the scientific community
• GenBank, the European Molecular Biology Laboratory (EMBL)
database, Protein Data Bank (PDB) and the DNA Data Bank of
Japan (DDBJ)
Secondary Databases
• Secondary databases contain computationally processed or manually
curated information, based on original information from primary
databases.
• Translated protein sequence databases containing functional
annotation belong to this category
SWISS-PROT
Specialized Databases
• Specialized databases normally serve a specific research community
or focus on a particular organism
• The content of these databases may be sequences or other types of
information
• Examples include Flybase, WormBase, AceDB, Microarray gene
expression database, and TAIR
Composite Databases
• Variety of primary databases combined
• One place for different primary databases
Information Retrieval from Biological
Databases
• The most popular retrieval systems for biological databases are
Entrez and Sequence Retrieval Systems (SRS)
• Join a series of keywords using logical terms such as AND, OR, and
NOT to indicate relationships between the keywords used in a search
• Entrez3, a biological database retrieval system by NCBI
• For a complex search, a user can use the Boolean operators
• Online Mendelian Inheritance in Man (OMIM) accessible from Entrez,
which is a non-sequence-based database of human disease genes and
human genetic disorders
GenBank
• GenBank is the most complete collection of annotated nucleic acid
sequence data for almost every organism.
• The content includes genomic DNA, mRNA, cDNA, ESTs, high
throughput raw sequence data, and sequence polymorphisms
• There is also a GenPept database for protein sequences
GenBank: Sequence Format
Header
• origin of the sequence, identification of organism, unique identifiers
• Locus: unique database identifier
• Sequence length and molecule type(DNA or RNA)
• Three-letter code eg: PLN for plant, BCT for bacteria…
• Definition : name of the sequence, name and source of organism,
whether sequence is partial or complete
• Accession number : number cited in publications
• Version number : to identify the current version, if the sequence is
revised at a later stage
• Organism: source of organism with the scientific name of the species
• Reference : author and title information, contact information
Gene information
• Features : annotation information
• Source: length of sequence, scientific name of organism
• Gene : nucleotide coding sequence and its name
• CDS : information about boundaries of the sequence that can be
translated into amino acids. For eukaryotic, locaton of exons also
mentioned
DNA SEQUENCE
• ORIGIN: sequence itself; ends with two forward slashes (“//”)