0% found this document useful (0 votes)

73 views

Lecture 5 Information Retrieval From Databases

The document discusses various types of biological databases and information retrieval systems. It describes primary, secondary, and derived databases as well as different retrieval systems like Entrez and SRS. It also covers topics like sequence formats, conversion between formats, and addressing issues like redundancy and errors in databases.

Uploaded by

Veer khade

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views

Lecture 5 Information Retrieval From Databases

Uploaded by

Veer khade

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Information

retrieval from
biological
databases &
Sequence formats
by
Dr. Aditya Kumar Padhi, Ph.D.

Laboratory for Computational

Biology & Biomolecular Design
Lecture-5 (LCBD),
School of Biochemical
Engineering, IIT (BHU)
Contents
• Pitfalls of biological databases

• Information retrieval systems for biological data

• Major types of retrieval systems

• Entrez

• Sequence retrieval system (SRS)

• Alternative sequence formats

• FASTA

• Conversion of sequence formats

• Conclusion
2
Types of biological databases

Primary databases Secondary databases Derived databases

Nucleotide Protein Protein Domain

sequence sequence structure and motif
database database database database
1. NCBI- 1. Swissprot 1. PDB 1. Prosite
GenBank 2. PIR 2. EBI-MSD 2. Blocks
2. DDBJ 3. GenePept 3. MMDB 3. COG
3. EMBL

Structure Gene expression Metabolic pathway Specialized

database database database database
1. GEO 1. KEGG 1. TGI
1. SCOPe
2. GXD 2. PathDB 2. GSOB
2. CATH
3. MGED 3. EMP 3. GPCRD
Pitfalls of biological databases
1) Overreliance

• One major drawback associated with biological databases is an overdependence

on sequence information and related annotations, without understanding the
reliability of the information.

• There are many errors in sequence databases (mostly due to a lack of good
quality sequencing techniques in earlier times).

• These types of errors can be passed on to other databases, causing the

propagation of errors.
Pitfalls of biological databases
2) Redundancy

• The second important major drawback is high levels of redundancy in the primary sequence
databases.

• These errors can also be passed on to other databases, causing the propagation of errors.

• Annotations of genes can also occasionally be false or incomplete.

• Some of these errors cause frameshifts that make whole gene identification difficult or protein
translation impossible.

• There is tremendous duplication of information in the databases, for various reasons.

• Causes include:

(1) repeated submission of identical or overlapping sequences by the same or different authors,
(2) revision of annotations,
(3) poor database management that fails to detect redundancy.
Steps taken to eliminate
redundancy
v By NCBI

• The NCBI has created a non-redundant database, called RefSeq, in which identical sequences
from the same organism and associated sequence fragments are merged into a single entry.

• Protein sequences derived from the same DNA sequences are explicitly linked as related entries.

• Sequence variants from the same organism with very minor differences, which may well be caused
by sequencing errors, are treated as distinctly related entries.

• This carefully curated database can be considered a secondary database.

v By SWISS-PROT

• The SWISS-PROT database also has minimal redundancy for protein sequences compared to most
other databases.

• If conflicts exist between various sequencing reports, they are indicated in the feature table of the
corresponding SWISS-PROT entry.
Steps taken to eliminate erroneous
annotations
• Often, the same gene sequence is found under different names resulting in multiple entries.

• Conversely, unrelated genes bearing the same name are found in the databases.

• To alleviate the problem of naming genes, reannotation of genes and proteins using a set of common,
controlled vocabulary to describe a gene or protein was developed.

• The goal is to provide a consistent and unambiguous naming system for all genes and proteins. A
prominent example of such a system is Gene Ontology.
Steps taken to eliminate erroneous
annotations

• Errors in an annotation can be particularly damaging because the large majority of new sequences are
assigned functions based on similarity with sequences in the databases that are already annotated.

• Some of these errors can be corrected at the informatics level by studying the protein domains and
families. However, others eventually have to be corrected using experimental work.
Information retrieval from databases
• Most popular retrieval systems for biological databases are

(1) Entrez

(2) Sequence Retrieval Systems (SRS)

• These provide access to multiple databases for the retrieval of integrated search results.

• Complex queries in a database often require the use of Boolean operators.

• Join a series of keywords using logical terms such as AND, OR, and NOT to indicate relationships.

o AND means that the search result must contain both words;

o OR means to search for results containing either word or both;

o NOT excludes results containing either one of the words.

• Most search engines of public biological databases use some form of this Boolean logic.
Entrez
• The NCBI developed and maintains Entrez, a biological database retrieval system.

• A gateway that allows text-based searches for a wide variety of data, including annotated
genetic sequence information, structural information, as well as literature, etc.

• The key feature of Entrez is its ability to integrate information, which comes from cross-
referencing between NCBI databases based on preexisting and logical relationships between
individual entries.

• Users do not have to visit multiple databases located in disparate places.

• For example,
o on a nucleotide sequence page, one can find cross-referencing links to the

o translated protein sequence,

o genome mapping data,
o the related PubMed literature information, and
o protein structures if available.
Entrez
Entrez
Entrez
• Entrez search engine has 4 main options/features to help narrow the search.

1) Limits: This helps to restrict the search to a subset or to a particular database. (e.g., the
field for author or publication date) or a particular type of data (e.g., chloroplast
DNA/RNA).

2) Preview/Index: This connects different searches with the Boolean operators and uses a
string of logically connected keywords to perform a new search. The search can also be
limited to a particular search field (e.g., gene name or accession number).

3) History: This provides a record of the previous searches so that the user can review,
revise, or combine the results of earlier searches.

4) Clipboard: This stores search results for later viewing for a limited time. To store
information in the Clipboard, the “Send to Clipboard” function should be used.
PubMed
• One of the databases accessible from Entrez is
a biomedical literature database known as
PubMed.

• This contains abstracts and in some cases full-

text articles from > 5,000 journals.

• For a complex search, a user can use the

Boolean operators or a combination of Limits
and Preview/Index features to conduct complex
searches.

• PubMed uses a list of tags for literature

searches. The search terms can be specified by
the tags which are joined by Boolean operators.
PubMed
Alternate sequence formats
FASTA (Fast alignment sequence test for application):

• In addition to the GenBank format, there are many other sequence formats.

• FASTA is one of the simplest and the most popular sequence formats because it contains plain
sequence information that is readable by many bioinformatics analysis programs.

• It has a single definition line that begins with a right-angle bracket (>) followed by a sequence name.
Sometimes, extra information such as protein or nucleotide ID or comments can be given separated
by a “|” symbol.

>1B1I_1|Chain A|HYDROLASE ANGIOGENIN|Homo sapiens (9606)

QDNSRYTHFLTQHYDAKPQGRDDRYCESIMRRRGLTSPCKDINTFIHGNKRSIKAICENKNG
NPHRENLRISKSSFQVTTCKLHGGSPWPPCQYRATAGFRNVVVACENGLPVHLDQSIFRRP

• The extra information is considered optional and is ignored by sequence analysis programs.

• The plain sequence in standard 1-letter symbols starts in the second line. Each line of sequence data
is limited to 60-80 characters.
Alternate sequence formats
Abstract Syntax Notation One (ASN.1):

• ASN.1 is a data mark-up language with a structure

specifically designed for accessing relational
databases.

• It describes sequences with each item of

information in a sequence record separated by tags
so that each sub-portion of the sequence record
can be easily added to relational tables and later
extracted.

• Though more difficult for people to read, this format

makes it easy for computers to filter and parse the
data.

• This format also facilitates the transmission and

integration of data between databases.
GenBank

NCBI GenBank/GenPept format showing the three major

components of a sequence file.
Conversion of sequence formats

• In sequence analysis and

phylogenetic analysis, there is a
frequent need to convert between
sequence formats.

• One popular computer program

for sequence format conversion
is “EMBOSS Seqret”.

• It recognizes sequences in
almost any format and writes a
new file in an alternative format.
Sequence retrieval system

• SRS is a retrieval system maintained by the EBI, which is comparable to

NCBI Entrez.

• It is not as integrated as Entrez but allows the user to query multiple

databases simultaneously.

• It also offers direct access to certain sequence analysis applications such as

sequence similarity search and Clustal sequence alignment.

• The search results contain the query sequence and sequence annotation as
well as links to literature, metabolic pathways, and other biological
databases.
Conclusion
• Various solutions to correct annotation and reduce redundancy, for example, merging
redundant sequences into a single entry or storing highly redundant sequences in a
separate database.

• NCBI databases accessible through Entrez are among the most integrated databases.

• Effective information retrieval involves the use of Boolean operators.

• Entrez has additional user-friendly features to help conduct complex searches.

• One can use NCBI-specific field qualifiers to conduct searches.

• To retrieve sequence information from NCBI GenBank, an understanding of the format

of GenBank sequence files is necessary.

• FASTA format in sequences is the most widely used sequence format.

Thank you

NCBI Handbook
No ratings yet
NCBI Handbook
492 pages
Module 2 (Bioinformatics)
No ratings yet
Module 2 (Bioinformatics)
81 pages
Sec1 Introduction to Bioinformatics
No ratings yet
Sec1 Introduction to Bioinformatics
20 pages
Ncbi
No ratings yet
Ncbi
25 pages
Data Retrieval System: Text-Based Database Searching
No ratings yet
Data Retrieval System: Text-Based Database Searching
54 pages
Lecture_3
No ratings yet
Lecture_3
55 pages
Entrez
No ratings yet
Entrez
46 pages
Lecture 5- DataBase
No ratings yet
Lecture 5- DataBase
18 pages
lecture1_BIOF242_shuvadeep
No ratings yet
lecture1_BIOF242_shuvadeep
38 pages
Biological Databases: DR Z Chikwambi Biotechnology
No ratings yet
Biological Databases: DR Z Chikwambi Biotechnology
47 pages
M Lec 01 & 02 Biological Database
No ratings yet
M Lec 01 & 02 Biological Database
50 pages
02. Biological Sequence Databases
No ratings yet
02. Biological Sequence Databases
35 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Bioinformatics Day 5
No ratings yet
Bioinformatics Day 5
6 pages
Biological Databases
No ratings yet
Biological Databases
20 pages
Data Retrieval
67% (3)
Data Retrieval
17 pages
Lec2 Databases
No ratings yet
Lec2 Databases
135 pages
4Bioinformaticsdatabases
No ratings yet
4Bioinformaticsdatabases
71 pages
Bioinformatics lecture 1
No ratings yet
Bioinformatics lecture 1
48 pages
CH12
No ratings yet
CH12
8 pages
Biol BDs Singapore
No ratings yet
Biol BDs Singapore
24 pages
5.7. Data Retrieval
No ratings yet
5.7. Data Retrieval
16 pages
Bioinformatics: Intended Learning Outcomes
No ratings yet
Bioinformatics: Intended Learning Outcomes
9 pages
Bioinformatics PPT Section B Data Storage and Retrival Group 3
No ratings yet
Bioinformatics PPT Section B Data Storage and Retrival Group 3
36 pages
"MBG1002 Biological Databases Week II
No ratings yet
"MBG1002 Biological Databases Week II
37 pages
LO4 Access to Sequenced Data and Related Information
No ratings yet
LO4 Access to Sequenced Data and Related Information
11 pages
Bio PPT
No ratings yet
Bio PPT
35 pages
Bio in For Ma Tics
No ratings yet
Bio in For Ma Tics
52 pages
Tics - A Brief Introduction
No ratings yet
Tics - A Brief Introduction
4 pages
9. Biological Databases
No ratings yet
9. Biological Databases
17 pages
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
No ratings yet
FALLSEM2019-20 BIT2001 ETH VL2019201000690 Reference Material I 11-Jul-2019 Unit I New
48 pages
Biological Database 1
No ratings yet
Biological Database 1
50 pages
unit 1
No ratings yet
unit 1
24 pages
Bioinformatics and Omics Topic: Database and Biological Database With Examples Assignment-3
No ratings yet
Bioinformatics and Omics Topic: Database and Biological Database With Examples Assignment-3
5 pages
Bioinformatics
No ratings yet
Bioinformatics
47 pages
2024.HF_BioInformatics_Lec3p
No ratings yet
2024.HF_BioInformatics_Lec3p
11 pages
IInd Sem Class1
No ratings yet
IInd Sem Class1
56 pages
120-202 Lab 01 - Fall 2018
No ratings yet
120-202 Lab 01 - Fall 2018
13 pages
Lecture2-DataMining for Bioinformatics
No ratings yet
Lecture2-DataMining for Bioinformatics
7 pages
Biological Databases_May2023
No ratings yet
Biological Databases_May2023
30 pages
Module_2_Reference Course content
No ratings yet
Module_2_Reference Course content
19 pages
Data Base in Bioinformatics
No ratings yet
Data Base in Bioinformatics
30 pages
Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)
No ratings yet
Biological Sequence Databases: A. National Center For Biotechnology Information (NCBI)
41 pages
Nucleic_Acid_Databases
No ratings yet
Nucleic_Acid_Databases
37 pages
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
No ratings yet
Bioinform-Tica-Pdf-May-6-2010-12-38-Pm-3-5-Meg
105 pages
Online Biological Databases: A/Prof. Ly Le
No ratings yet
Online Biological Databases: A/Prof. Ly Le
64 pages
Database
No ratings yet
Database
40 pages
Lecture 4 Biological Databases
No ratings yet
Lecture 4 Biological Databases
29 pages
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
No ratings yet
Bioinformatics Tools For Nucleotide Sequence Analysis and Database Exploration
75 pages
Module 2 Biodata
No ratings yet
Module 2 Biodata
36 pages
BCH 505 Bioinformatics 3(2 2) Databases
No ratings yet
BCH 505 Bioinformatics 3(2 2) Databases
17 pages
UNIT II
No ratings yet
UNIT II
23 pages
Lab 1 - Introduction and Protocol
No ratings yet
Lab 1 - Introduction and Protocol
28 pages
Bioinformatics Lecture Notes Database
No ratings yet
Bioinformatics Lecture Notes Database
28 pages
Bioinformatics Lab Assignment Group 3
No ratings yet
Bioinformatics Lab Assignment Group 3
7 pages
Plant Biotechnology
No ratings yet
Plant Biotechnology
44 pages
15GN402L_final_bioinformatics_lab_manual (1)
No ratings yet
15GN402L_final_bioinformatics_lab_manual (1)
68 pages
Biological Data Bases
No ratings yet
Biological Data Bases
36 pages
Bioinformatics: Merging Biology and Technology
From Everand
Bioinformatics: Merging Biology and Technology
Mani Devar
No ratings yet
Introduction to Bioinformatics Using Action Labs
From Everand
Introduction to Bioinformatics Using Action Labs
Jean-Louis Lassez
5/5 (1)
Advanced Perl Techniques for Bioinformatics: Optimizing Data Analysis and Computational Biology
From Everand
Advanced Perl Techniques for Bioinformatics: Optimizing Data Analysis and Computational Biology
Adam Jones
No ratings yet
SID 135117246 - PubChem
No ratings yet
SID 135117246 - PubChem
1 page
Bio Chemistry
No ratings yet
Bio Chemistry
8 pages
Malay Kumar Basu: Curriculum Vitæ
No ratings yet
Malay Kumar Basu: Curriculum Vitæ
10 pages
Experiment 1 & 2 - Bioinformatics Lab
No ratings yet
Experiment 1 & 2 - Bioinformatics Lab
20 pages
Lecture No. 3
No ratings yet
Lecture No. 3
25 pages
Moesm1 Esm PDF
No ratings yet
Moesm1 Esm PDF
66 pages
BIOINFORMATICS
100% (1)
BIOINFORMATICS
4 pages
(Ebooks PDF) Download MATLAB Bioinformatics Toolbox User S Guide The Mathworks Full Chapters
100% (3)
(Ebooks PDF) Download MATLAB Bioinformatics Toolbox User S Guide The Mathworks Full Chapters
52 pages
Proteomics Introduction
67% (3)
Proteomics Introduction
39 pages
Bioinformatics Notebook: By: Abdul Hannan Malik
No ratings yet
Bioinformatics Notebook: By: Abdul Hannan Malik
29 pages
Mastering Bioinformatics and Computational Biology_ Unraveling the Complexities of Life Through Data-Driven Discovery
No ratings yet
Mastering Bioinformatics and Computational Biology_ Unraveling the Complexities of Life Through Data-Driven Discovery
216 pages
Uwemyces
No ratings yet
Uwemyces
2 pages
Databases - Final
No ratings yet
Databases - Final
50 pages
Using Genbank and BLAST in The Biology Classroom: Matt Wester
No ratings yet
Using Genbank and BLAST in The Biology Classroom: Matt Wester
9 pages
Roshani Kumari 2021-2022
No ratings yet
Roshani Kumari 2021-2022
52 pages
Es 243 Biology For Engineers Assignment-2: Question-1
No ratings yet
Es 243 Biology For Engineers Assignment-2: Question-1
23 pages
2010.12 - A Robust, Simple Genotyping-By-Sequencing (GBS) Approach For High Diversity Species
No ratings yet
2010.12 - A Robust, Simple Genotyping-By-Sequencing (GBS) Approach For High Diversity Species
10 pages
Biopython Org DIST Docs Tutorial Tutorial HTML
No ratings yet
Biopython Org DIST Docs Tutorial Tutorial HTML
267 pages
Biological Database
No ratings yet
Biological Database
8 pages
Bioinfo PPT Unit 1 Half
No ratings yet
Bioinfo PPT Unit 1 Half
42 pages
Note That There Are Several Different "Basic Blast" Programs Available at Ncbi (Including Nucleotide Blast, Protein Blast, and Blastx)
No ratings yet
Note That There Are Several Different "Basic Blast" Programs Available at Ncbi (Including Nucleotide Blast, Protein Blast, and Blastx)
10 pages
Serves List
100% (1)
Serves List
34 pages
DeepMicrobes Taxonomic Classification For Metagenomics Using Deep Learning
No ratings yet
DeepMicrobes Taxonomic Classification For Metagenomics Using Deep Learning
13 pages
PlasmidFinder and in Silico PMLST. Identification and Typing of Plasmid Replicons in Whole-Genome Sequencing (WGS)
No ratings yet
PlasmidFinder and in Silico PMLST. Identification and Typing of Plasmid Replicons in Whole-Genome Sequencing (WGS)
10 pages
bioinformatics
No ratings yet
bioinformatics
3 pages
BLAST Command Line Applications User Manual: Christiam Camacho
No ratings yet
BLAST Command Line Applications User Manual: Christiam Camacho
42 pages
BTH 403-BTG407 LECTURE 1
No ratings yet
BTH 403-BTG407 LECTURE 1
6 pages
BI205 Prac 5&6
No ratings yet
BI205 Prac 5&6
11 pages