Lecture 5 Information Retrieval From Databases
Lecture 5 Information Retrieval From Databases
retrieval from
biological
databases &
Sequence formats
by
Dr. Aditya Kumar Padhi, Ph.D.
• Entrez
• FASTA
• Conclusion
2
Types of biological databases
• There are many errors in sequence databases (mostly due to a lack of good
quality sequencing techniques in earlier times).
• The second important major drawback is high levels of redundancy in the primary sequence
databases.
• These errors can also be passed on to other databases, causing the propagation of errors.
• Some of these errors cause frameshifts that make whole gene identification difficult or protein
translation impossible.
• Causes include:
(1) repeated submission of identical or overlapping sequences by the same or different authors,
(2) revision of annotations,
(3) poor database management that fails to detect redundancy.
Steps taken to eliminate
redundancy
v By NCBI
• The NCBI has created a non-redundant database, called RefSeq, in which identical sequences
from the same organism and associated sequence fragments are merged into a single entry.
• Protein sequences derived from the same DNA sequences are explicitly linked as related entries.
• Sequence variants from the same organism with very minor differences, which may well be caused
by sequencing errors, are treated as distinctly related entries.
v By SWISS-PROT
• The SWISS-PROT database also has minimal redundancy for protein sequences compared to most
other databases.
• If conflicts exist between various sequencing reports, they are indicated in the feature table of the
corresponding SWISS-PROT entry.
Steps taken to eliminate erroneous
annotations
• Often, the same gene sequence is found under different names resulting in multiple entries.
• Conversely, unrelated genes bearing the same name are found in the databases.
• To alleviate the problem of naming genes, reannotation of genes and proteins using a set of common,
controlled vocabulary to describe a gene or protein was developed.
• The goal is to provide a consistent and unambiguous naming system for all genes and proteins. A
prominent example of such a system is Gene Ontology.
Steps taken to eliminate erroneous
annotations
• Errors in an annotation can be particularly damaging because the large majority of new sequences are
assigned functions based on similarity with sequences in the databases that are already annotated.
• Some of these errors can be corrected at the informatics level by studying the protein domains and
families. However, others eventually have to be corrected using experimental work.
Information retrieval from databases
• Most popular retrieval systems for biological databases are
(1) Entrez
• These provide access to multiple databases for the retrieval of integrated search results.
• Join a series of keywords using logical terms such as AND, OR, and NOT to indicate relationships.
o AND means that the search result must contain both words;
• Most search engines of public biological databases use some form of this Boolean logic.
Entrez
• The NCBI developed and maintains Entrez, a biological database retrieval system.
• A gateway that allows text-based searches for a wide variety of data, including annotated
genetic sequence information, structural information, as well as literature, etc.
• The key feature of Entrez is its ability to integrate information, which comes from cross-
referencing between NCBI databases based on preexisting and logical relationships between
individual entries.
• For example,
o on a nucleotide sequence page, one can find cross-referencing links to the
1) Limits: This helps to restrict the search to a subset or to a particular database. (e.g., the
field for author or publication date) or a particular type of data (e.g., chloroplast
DNA/RNA).
2) Preview/Index: This connects different searches with the Boolean operators and uses a
string of logically connected keywords to perform a new search. The search can also be
limited to a particular search field (e.g., gene name or accession number).
3) History: This provides a record of the previous searches so that the user can review,
revise, or combine the results of earlier searches.
4) Clipboard: This stores search results for later viewing for a limited time. To store
information in the Clipboard, the “Send to Clipboard” function should be used.
PubMed
• One of the databases accessible from Entrez is
a biomedical literature database known as
PubMed.
• In addition to the GenBank format, there are many other sequence formats.
• FASTA is one of the simplest and the most popular sequence formats because it contains plain
sequence information that is readable by many bioinformatics analysis programs.
• It has a single definition line that begins with a right-angle bracket (>) followed by a sequence name.
Sometimes, extra information such as protein or nucleotide ID or comments can be given separated
by a “|” symbol.
• The extra information is considered optional and is ignored by sequence analysis programs.
• The plain sequence in standard 1-letter symbols starts in the second line. Each line of sequence data
is limited to 60-80 characters.
Alternate sequence formats
Abstract Syntax Notation One (ASN.1):
• It recognizes sequences in
almost any format and writes a
new file in an alternative format.
Sequence retrieval system
• The search results contain the query sequence and sequence annotation as
well as links to literature, metabolic pathways, and other biological
databases.
Conclusion
• Various solutions to correct annotation and reduce redundancy, for example, merging
redundant sequences into a single entry or storing highly redundant sequences in a
separate database.
• NCBI databases accessible through Entrez are among the most integrated databases.