Gene Finding and HMMS: 6.096 - Algorithms For Computational Biology - Lecture 7
Gene Finding and HMMS: 6.096 - Algorithms For Computational Biology - Lecture 7
Lecture 1 - Introduction
Lecture 2 - Hashing and BLAST
Lecture 3 - Combinatorial Motif Finding
Lecture 4 - Statistical Motif Finding
Lecture 5 - Sequence alignment and Dynamic Programming
Lecture 6 - RNA structure and Context Free Grammars
Lecture 7 - Gene finding and Hidden Markov Models
Challenges in Computational Biology
4 Genome Assembly
Comparative Genomics
TCATGCTAT
TCGTGATAA Database lookup
7 Evolutionary Theory TGAGGATAT
TTATCATAT
TTATGATTT
RNA folding
9 Gene expression analysis
RNA transcript
10 Cluster discovery Gibbs sampling
12 Protein network analysis
Outline
• Computational model
– Simple Markov Models
– Hidden Markov Models
aAT
A+ T+ A+ C+ G+ T+
aAC aGT
C+ G+ A: 1 A: 0 A: 0 A: 0
aGC C: 0 C: 1 C: 0 C: 0
G: 0 G: 0 G: 1 G: 0
T: 0 T: 0 T: 0 T: 1
Output: Only emitted symbols are observable by the system but not the
underlying random walk between states -> “hidden”
• Training Set:
aAT – set of DNA sequences w/ known CpG islands
A T • Derive two Markov chain models:
aAC aGT
– ‘+’ model: from the CpG islands
C a G – ‘-’ model: from the remainder of sequence
GC
• Transition probabilities for each model:
Probability of C following A
+ +
c c is the number of times
+ A C G T ast+ = st st
∑t' st'
+ letter t followed letter s
A .180 .274 .426 .120 c inside the CpG islands
.171 .368 .274 .188
C
G .161 .339 .375 .125 − −
.079 .355 .384 .182 ast− =
cst c
st
is the number of times
∑t' st'
T − letter t followed letter s
c outside the CpG islands
Using Markov Models for CpG classification
Q1: Given a short sequence x, does it come from CpG island (Yes-No question)
• To use these models for discrimination, calculate the log-odds ratio:
P(x|model + ) a +x i−1 x i
∑
L
S(x) = log = log
P(x|model − ) i =1 a −x i−1 x i
10
5
Non-
CpG CpG
islands
0
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4
Using Markov Models for CpG classification
A- C- G- T-
• Emission probabilities distinct for the ‘+’
and the ‘-’ states
A: 1 A: 0 A: 0 A: 0 – Infer most likely set of states, giving rise
C: 0 C: 1 C: 0 C: 0 to observed emissions
G: 0 G: 0 G: 1 G: 0
T: 0 T: 0 T: 0 T: 1 Î ‘Paint’ the sequence with + and - states
Finding most likely state path
T- T- T- T-
G- G- G- G-
C- C- C- C-
A- A- A- A-
T+ T+ T+ T+
start end
G+ G+ G+ G+
C+ C+ C+ C+
A+ A+ A+ A+
C G C G
G- G- G- G-
C- C- C- C-
A- A- A- A-
T+ T+ T+ T+
start end
G+ G+ G+ G+
C+ C+ C+ C+
A+ A+ A+ A+
C G C G
• Known observations: CGCG
• Known sequence path: C+, G-, C-, G+
Probability of given path p & observations x
G-
C- C-
start end
G+ G+
C+ C+
C G C G
• Known observations: CGCG
• Known sequence path: C+, G-, C-, G+
Probability of given path p & observations x
G- aG-,C-
C- C-
aC-,G+
aC+,G-
start end
a0,C+ G+ G+
aG+,0
C+ C+
C G C G
1. Evaluation
GIVEN
a HMM M, and a sequence x,
FIND
Prob[ x | M ]
2. Decoding
GIVEN
a HMM M, and a sequence x,
FIND
the sequence π of states that maximizes P[ x, π | M ]
3. Learning
and a sequence x,
FIND
parameters θ = (ei(.), aij) that maximize P[ x | θ ]
Problem 1: Decoding
GIVEN x = x1x2……xN
π* = argmaxπ P[ x, π ]
… … … …
K K K … K
We can use dynamic programming!
x1 x2 x3 xK
What is Vk(i+1)?
From definition,
Input: x = x1……xN
Initialization:
V0(0) = 1 (0 is the imaginary first position)
Vk(0) = 0, for all k > 0
Iteration:
Vj(i) = ej(xi) × maxk akj Vk(i-1)
Termination:
P(x, π*) = maxk Vk(N)
Traceback:
πN* = argmaxk Vk(N)
πi-1* = Ptrπi (i)
The Viterbi Algorithm
x1 x2 x3 ………………………………………..xN
State 1
2
Vj(i)
Time:
O(K2N)
Space:
O(KN)
Viterbi Algorithm – a practical detail
x = 123456123456…12345 6626364656…1626364656
FFF…………………...F LLL………………………...L
1 1 1 … 1
a02 2 2 2 … 2
0 … … … …
K K K … K
e2(x1)
x1 x2 x3 xn
A couple of questions
Given a sequence x,
Say x = 12341623162616364616234161221341
We want to calculate
fl(i) = P(x1…xi, πi
= l)
Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0
Iteration:
fl(i) = el(xi) Σk fk(i-1) akl
Termination:
P(x) = Σk fk(N) ak0
Where, ak0 is the probability that the terminating state is k (usually = a0k)
Relation between Forward and Viterbi
VITERBI FORWARD
Initialization: Initialization:
V0(0) = 1 f0(0) = 1
Vk(0) = 0, for all k > 0 fk(0) = 0, for all k > 0
Iteration:
Iteration:
fl(i) = el(xi) Σk fk(i-1) akl
Vj(i) = ej(xi) maxk Vk(i-1) akj
Termination:
Termination:
P(x) = Σk fk(N) ak0
P(x, π*) = maxk Vk(N)
Motivation for the Backward Algorithm
We want to compute
P(πi = k | x),
We start by computing
= P(x1…xi, πi = k) P(xi+1…xN | πi = k)
bk(i) = P(xi+1…xN | πi = k)
= Σl el
(xi+1) akl Σπi+1…πN P(xi+2, …, xN, πi+2, …, πN | πi+1 = l)
Initialization:
Iteration:
Termination:
What is the running time, and space required, for Forward, and Backward?
Time: O(K2N)
Space: O(KN)
fk(i) bk(i)
P(πi = k | x) = –––––––
P(x)
π*
– Why?
Examples:
GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islands
Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their composition
GIVEN: 10,000 rolls of the casino player, but we don’t see when he
changes dice
Given x = x1…xN
for which the true π = π1…πN is known,
Define:
Akl Ek(b)
akl = ––––– ek(b) = –––––––
Σi Aki Σc Ek(c)
Case 1. When the right answer is known
Drawback:
Given little data, there may be overfitting:
P(x|θ) is maximized, but θ is unreasonable
0 probabilities – VERY BAD
Example:
Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3
π = F, F, F, F, F, F, F, F, F, F
Then:
aFF = 1; aFL = 0
eF(1) = eF(3) = .2;
eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1
Pseudocounts
Add pseudocounts
Reasonable pseudocounts:
fair)
loaded)
Idea:
• We repeat
Case 2. When the right answer is unknown
Given x = x1…xN
for which the true π = π1…πN is unknown,
To estimate Akl:
So,
fk(i) akl el(xi+1) bl(i+1)
Similarly,
Similarly,
Initialization:
Pick the best-guess for model parameters
(or arbitrary)
Iteration:
1. Forward
2. Backward
3. Calculate Akl, Ek(b)
4. Calculate new model parameters akl, ek(b)
5. Calculate new log-likelihood P(x | θ)
Time Complexity:
# iterations × O(K2N)
Initialization: Same
Iteration:
1. Perform Viterbi, to find π*
2. Calculate Akl, Ek(b) according to π* + pseudocounts
3. Calculate the new parameters akl, ek(b)
Until convergence
Notes:
– Convergence is guaranteed – Why?
– Does not maximize P(x | θ)
– In general, worse performance than Baum-Welch
How to Build an HMM
• General Scheme:
– Architecture/topology design
– Learning/Training:
• Training Datasets
• Parameter Estimation
– Recognition/Classification:
• Testing Datasets
• Performance Evaluation
Parameter Estimation for HMMs (Case 1)
• Case 1: All the paths/labels in the set of training
sequences are known:
– Use the Maximum Likelihood (ML) estimators for:
Akl Ek ( x)
a kl = and ekx =
∑ l'
Akl ' ∑ x'
Ek ( x ' )
qi qj
HMM-based Gene Finding
Exon Intron
• J5’ – 5’ UTR
Begin Start Donor Accept Stop End
• EI – Initial Exon Sequence Translatio splice or Translatio Sequence
n site splice n
• E – Exon, Internal Exon site
• I – Intron
• EF – Final Exon
• ES – Single Exon
• J3’ – 3’UTR
Genscan Overview
• N - intergenic region
• P - promoter
E0+ E1+ E2+
• F - 5’ untranslated region
• Esngl – single exon (intronless)
(translation start -> stop codon)
• Einit – initial exon (translation start -
I0+ I1+ I2+
> donor splice site)
• Ek – phase k internal exon
(acceptor splice site -> donor splice
site) Einit+ Eterm+
• Eterm – terminal exon (acceptor
Esngl+
splice site -> stop codon) F+ T+
(5' UTR) (single-exon (3' UTR)
• Ik – phase k intron: 0 – between gene)
codons; 1 – after the first base of a
codon; 2 – after the second base of P+ A+
(promo (poly-A
a codon ter) signal)
Accuracy Measures
TP FP TN FN TP FN TN
Actual
Predicted
Actual
Coding / No Coding
Sn = TP CC =
TP FP ((TP+FN)*(TN+FP)*(TP+FP)*(TN+FN))1/2
Predicted
TP+FN
FN TN Sn = TP
TP+FP
AC = 1
(
TP
+
TP
+
TN
+
TN
2 TP+FN TP+FP TN+FP TN+FN
-1
(
Figure by MIT OCW.
•Sensitivity (Sn) Fraction of actual coding regions that are correctly predicted as
coding
•Specificity (Sp) Fraction of the prediction that is actually correct
•Correlation Combined measure of Sensitivity & Specificity
Coefficient (CC) Range: -1 (always wrong) Æ +1 (always right)
Test Datasets
• Gene Number
usually approximately correct, but may not
• Organism
primarily for human/vertebrate seqs; maybe lower accuracy for non-
vertebrates. ‘Glimmer’ & ‘GeneMark’ for prokaryotic or yeast seqs
¾ seqs>200kb were discarded; mRNA seqs and seqs containing pseudo genes or
optimum