Bioinformatics
Bioinformatics
Bioinformatics
1
What is Bioinformatics?
2
NIH – definitions
What is Bioinformatics? - Research, development,
and application of computational tools and on molecular
approaches for expanding the use of biological,
medical, behavioral, and health data, including the
means to acquire, store, organize, archive, analyze,
or visualize such data.
4
NSF – mission statement
The present bottlenecks in bioinformatics include the education of
biologists in the use of advanced computing tools, the recruitment
of computer scientists into this evolving field, the limited
availability of developed databases of biological information, and
the need for more efficient and intelligent search engines for
complex databases.
5
NSF – mission statement
The present bottlenecks in bioinformatics include the education of
biologists in the use of advanced computing tools, the recruitment of
computer scientists into this evolving field, the limited availability
of developed databases of biological information, and the need for
more efficient and intelligent search engines for complex databases.
6
Molecular Bioinformatics
Molecular Bioinformatics involves the use
of computational tools to discover new
information in complex data sets (from the
one-dimensional information of DNA through
the two-dimensional information of RNA and
the three-dimensional information of proteins,
to the four-dimensional information of
evolving living systems).
7
Bioinformatics (Oxford English Dictionary):
8
The field of science in which biology, computer science and
information technology merge into a single discipline
Biologists
collect molecular data:
DNA & Protein sequences,
gene expression, etc.
Bioinformaticians
Study biological questions by
analyzing molecular data
Computer scientists
(+Mathematicians, Statisticians, etc.)
Develop tools, softwares, algorithms
to store and analyze the data.
9
Some biological background….
A biologist
10
The hereditary information of all living organisms, with the exception of some viruses,
is carried by deoxyribonucleic acid (DNA) molecules.
2 purines: 2 pyrimidines:
12
Circular genome
Central dogma: DNA makes RNA makes Protein
Modified dogma: DNA makes DNA and RNA, RNA makes DNA, RNA
an Protein
13
Amino acids - The protein building blocks
14
15
Any region of the DNA sequence can, in principle, code for six different amino acid
sequences, because any one of three different reading frames can be used to
interpret each of the two strands.
16
Protein folding
A human Hemoglobin:
17
How does it all looks like on a computer monitor?
18
A cDNA sequence
19
A cDNA sequence (reading frame)
A protein sequence
20
And, a whole genome…
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCTGG
GGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGACCT
ACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAACGC
CGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGGTC
AACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCT
CCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTTCT
TGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCG
GCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGCCT
GGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAGAC
CTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCAAC
GCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCCGG
TCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGC
CTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTT
CTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGG
CGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCGC
CTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACCACCAAG
ACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACGCGCTGACCA
ACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGGACCC
GGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCAC
GCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGC
TTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTG
GGCGGCGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGCGCACAAGCTTCGGGTGG
ACCCGGTCAACTTCAAGCTCCTAAGCCACTGCCTGCTGGTGACCCTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGT
GCACGCCTCCCTGGACAAGTTCCTGGCTTCTGTGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCC
ATGCTTCTTGCCCCTTGGGCCTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTG
AGTGGGCGGCACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAA
GGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGATGTTCCTGTCCTTCCCCACC
21
ACCAAGACCTACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCG...
How big are whole genomes?
22
What do we actually do with bioinformatics?
23
Sequence assembly
24
(next generation sequencing)
Genome annotation
25
Molecular evolution
Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic
26
Smith et al. (2009) Nature 459, 1122-1125
Analysis of gene expression
28
Toledo and Bardot (2009) Nature 460, 466-467
Protein structure prediction
Protein docking
29
30
Luscombe, Greenbaum, Gerstein (2001)
From DNA to Genome
Sanger sequences
Watson and Crick insulin protein
1955
DNA model
1960 Dayhoff’s Atlas
Sequence
alignment 1965 ARPANET
(early Internet)
1970
PDB (Protein
Sanger dideoxy
Data Bank) 1975 DNA sequencing
1980 PCR (Polymerase
GenBank database
Chain Reaction)31
1985
NCBI SWISS-PROT
database
FASTA 1990
Human Genome
Initiative
BLAST
EBI
1995
33
In 1965, Dayhoff gathered all the available
sequence data to create the first bioinformatic
database (Atlas of Protein Sequence and
Structure).
35
Complete Genomes
as of August 2011:
Eukaryotes 37
Prokaryotes 1708
Total 1745
36
What can we do with sequences and other type of molecular information?
37
Open reading frames
Functional sites
Annotation
Structure, function
38
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT ......
.............. TGAAAAACGTA
39
promoter TF binding site
Transcription
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
Start Site
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT .................................
Identifying orthologs
Inferences on structure
and function
Comparative
genomics
Comparing functional sites
Inferences on regulatory
networks
41
Similarity profiles
Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ
Bos CGSHLVEALYLVCGERGFFYTPKARREVEG
***************:***** ** :*::*
Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV
Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV
.**. ** * * *****
Xenopus EQCCHSTCSLFQLENYCN
Bos EQCCASVCSLYQLENYCN
**** *.***:******* 43
44
Ultraconserved Elements in the
Human Genome
Gill Bejerano, Michael Pheasant, Igor Makunin, Stuart
Stephen, W. James Kent, John S. Mattick, & David Haussler
(Science 2004. 304:1321-1325)
There are 481 segments longer than 200 base pairs (bp) that
are absolutely conserved (100% identity with no insertions or
deletions) between orthologous regions of the human, rat, and
mouse genomes. Nearly all of these segments are also
conserved in the chicken and dog genomes, with an average of
95 and 99% identity, respectively. Many are also significantly
conserved in fish. These ultraconserved elements of the human
genome are most often located either overlapping exons in
genes involved in RNA processing or in introns or nearby genes
involved in the regulation of transcription and development.
Junk is real!
46
Genome-wide profiling of:
• mRNA levels
• Protein levels
Co-expression of genes
and/or proteins
Functional
genomics
Identifying protein-protein
interactions
Networks of interactions
47
Understanding the function of genes and other
parts of the genome
48
Structural Assign structure to all
genomics proteins encoded in
a genome
49
Biological
databases
50
Database or databank?
Initially
• Databank (in UK)
• Database (in the USA)
Solution
• The abbreviation db
51
What is a Database?
Accession number: 1
First Name: Amos
Last Name: Bairoch
Course: Pottery 2000; Pottery 2001;
//
Accession number: 2
First Name: Dan
Last name: Graur
Course: Pottery 2000, Pottery 2001; Ballet 2001, Ballet 2002
//
Accession number 3:
First Name: John
Last name: Travolta
Course: Ballet 2001; Ballet 2002;
//
• Easy to manage: all the entries are visible at the same time !
54
Database: a « relational » example
Relational database (« table file »):
58
Some databases in the field of molecular biology…
EBI:
http://www.ebi.ac.uk/
DDBJ:
http://www.ddbj.nig.ac.jp/
61
Literature Databases:
Bookshelf: A collection of searchable biomedical books linked to
PubMed.
65
A search by subject: “mitochondrion evolution”
Type in a Query term
• Enter your search words in the
query box and hit the “Go” button
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/helpdoc.html#Searching
68
The Syntax …
1. Boolean operators: AND, OR, NOT must be entered in
UPPERCASE (e.g., promoters OR response elements). The default
is AND.
3. Quotation marks: The term inside the quotation marks is read as one
phrase (e.g. “public health” is different than public health, which will
also include articles on public latrines and their effect on health
workers).
4. Asterisk: Extends the search to all terms that start with the letters
before the asterisk. For example, dia* will include such terms as 69
diaphragm, dial, and diameter.
Refine the Query
• Often a search finds too many (or too few) sequences, so you
can go back and try again with more (or fewer) keywords in
your query
• The “History” feature allows you to combine any of your past
queries.
• The “Limits” feature allows you to limit a query to specific
organisms, sequences submitted during a specific period of
time, etc.
• [Many other features are designed to search for literature in
MEDLINE]
70
Related Items
You can search for a text term in sequence annotations or in
MEDLINE abstracts, and find all articles, DNA, and protein
sequences that mention that term.
Then from any article or sequence, you can move to "related
articles" or "related sequences".
•Relationships between sequences are computed with BLAST
•Relationships between articles are computed with "MESH" terms
(shared keywords)
•Relationships between DNA and protein sequences rely on accession
numbers
•Relationships between sequences and MEDLINE articles rely on both
shared keywords and the mention of accession numbers in the articles.
71
72
73
74
A search by authors: “Esser” [au] AND “martin” [au]
A search by title word: “Wolbachia pipientis” [ti]
Database Search Strategies
• General search principles - not limited
to sequence (or to biology).
• Start with broad keywords and narrow
the search using more specific terms.
• Try variants of spelling, numbers, etc.
• Search many databases.
• Be persistent!!
77
Searching PubMed
• Structureless searches
– Automatic term mapping
• Structured searches
– Tags, e.g. [au], [ta], [dp], [ti]
– Boolean operators, e.g. AND, OR, NOT, ()
• Additional features
– Subsets, limits
– Clipboard, history
78
Start working:
Search PubMed
1. cuban cigars
2. cuban OR cigars
3. “cuban cigars”
4. cuba* cigar*
5. (cuba* cigar*) NOT smok*
6. Fidel Castro
7. “fidel castro”
8. #6 NOT #7 79
“Details” and “History” in
PubMed
80
“Details” and “History” in
PubMed
81
The OMIM (Online Mendelian
Inheritance in Man)
82
MIM Number Prefixes
* gene with known sequence
+ gene with known sequence and
phenotype
# phenotype description, molecular
basis known
% mendelian phenotype or locus,
molecular basis unknown
no prefix other, mainly phenotypes with
suspected mendelian basis
83
Searching OMIM
• Search Fields
– Name of trait, e.g., hypertension
– Cytogenetic location, e.g., 1p31.6
– Inheritance, e.g., autosomal dominant
– Gene, e.g., coagulation factor VIII
84
OMIM search tags
85
86
Start working:
Search OMIM
87
Online Literature databases
89
90
Online Glossaries
Bioinformatics :
http://www.geocities.com/bioinformaticsweb/glossary.html
http://big.mcw.edu/
Genomics:
http://www.geocities.com/bioinformaticsweb/genomicglossary.html
Molecular Evolution:
http://workshop.molecularevolution.org/resources/glossary/
Biology dictionary:
http://www.biology-online.org/dictionary/satellite_cells
92
What is Google Scholar?
93
Use Google Scholar to find articles from a
wide variety of academic publishers,
professional societies, preprint repositories
and universities, as well as scholarly articles
available across the web.
94
Google Scholar
orders your
search results by
how relevant they
are to your query,
so the most
useful references
should appear at
the top of the
page
This relevance
ranking takes into
account the: full
text of each article.
the article's author,
the publication in
which the article
appeared and how
often it has been
cited in scholarly 95
literature.
What other DATA can we retrieve from the record?
96
97
98
5. Google Book Search
99
100
Start working:
101
6. Web of science
http://http://apps.webofknowledge.com.ezproxy.lib.uh.edu/WOS_GeneralSearch_input.do?product
=WOS&search_mode=GeneralSearch&SID=4FB7LbbLgDMhG9fDiLh&preferencesSaved=
102
103
104