0% found this document useful (0 votes)

48 views

Gene Finding and HMMS: 6.096 - Algorithms For Computational Biology - Lecture 7

The document discusses using hidden Markov models for gene finding and identifying CpG islands in DNA sequences. It describes using HMMs to represent coding and non-coding regions and calculate the most probable path of states that generate an observed DNA sequence, allowing identification of genes and CpG islands.

Uploaded by

fvenky

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

Gene Finding and HMMS: 6.096 - Algorithms For Computational Biology - Lecture 7

Uploaded by

fvenky

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

6.

096 – Algorithms for Computational Biology – Lecture 7

Gene Finding and HMMs

Lecture 1 - Introduction
Lecture 2 - Hashing and BLAST
Lecture 3 - Combinatorial Motif Finding
Lecture 4 - Statistical Motif Finding
Lecture 5 - Sequence alignment and Dynamic Programming
Lecture 6 - RNA structure and Context Free Grammars
Lecture 7 - Gene finding and Hidden Markov Models
Challenges in Computational Biology

4 Genome Assembly

Regulatory motif discovery 1 Gene Finding

DNA
Sequence alignment

Comparative Genomics
TCATGCTAT
TCGTGATAA Database lookup
7 Evolutionary Theory TGAGGATAT
TTATCATAT
TTATGATTT

RNA folding
9 Gene expression analysis

RNA transcript
10 Cluster discovery Gibbs sampling
12 Protein network analysis

13 Regulatory network inference

14 Emerging network properties

Outline

• Computational model
– Simple Markov Models
– Hidden Markov Models

• Working with HMMs

– Dynamic programming (Viterbi)
– Expectation maximization (Baum-Welch)

• Gene Finding in practice

– GENSCAN
– Performance Evaluation
Markov Chains & Hidden Markov Models

aAT
A+ T+ A+ C+ G+ T+
aAC aGT
C+ G+ A: 1 A: 0 A: 0 A: 0
aGC C: 0 C: 1 C: 0 C: 0
G: 0 G: 0 G: 1 G: 0
T: 0 T: 0 T: 0 T: 1

• Markov Chain • HMM

– Q: states – Q: states
– p: initial state probabilities – V: observations
– A: transition probabilities – p: initial state probabilities
– A: transition probabilities
– E: emission probabilities
Markov Chain

Definition: A Markov chain is a triplet (Q, p, A), where:

¾ Q is a finite set of states. Each state corresponds to a symbol in the
alphabet Σ
¾ p is the initial state probabilities.
¾ A is the state transition probabilities, denoted by ast for each s, t in Q.
¾ For each s, t in Q the transition probability is: ast ≡ P(xi = t|xi-1 = s)
Output: The output of the model is the set of states at each
instant time => the set of states are observable
Property: The probability of each symbol xi depends only on
the value of the preceding symbol xi-1 : P (xi | xi-1,…, x1) = P (xi | xi-1)

Formula: The probability of the sequence:

P(x) = P(xL,xL-1,…, x1) = P (xL | xL-1) P (xL-1 | xL-2)… P (x2 | x1) P(x1)
HMM (Hidden Markov Model)

Definition: An HMM is a 5-tuple (Q, V, p, A, E), where:

¾ Q is a finite set of states, |Q|=N

¾ V is a finite set of observation symbols per state, |V|=M
¾ p is the initial state probabilities.
¾ A is the state transition probabilities, denoted by ast for each s, t in Q.
¾ For each s, t in Q the transition probability is: ast ≡ P(xi = t|xi-1 = s)

¾ E is a probability emission matrix, esk ≡ P (vk at time t | qt = s)

Output: Only emitted symbols are observable by the system but not the
underlying random walk between states -> “hidden”

Property: Emissions and transitions are dependent on the current state

only and not on the past.
Typical HMM Problems

Annotation Given a model M and an observed string

S, what is the most probable path through M
generating S

Classification Given a model M and an observed

string S, what is the total probability of S under M

Consensus Given a model M, what is the string

having the highest probability under M

Training Given a set of strings and a model structure,

find transition and emission probabilities assigning
high probabilities to the strings
Example 1: Finding CpG islands

What are CpG islands?

• Regions of regulatory importance in promoters of many genes

– Defined by their methylation state (epigenetic information)
• Methylation process in the human genome:
– Very high chance of methyl-C mutating to T in CpG

Î CpG dinucleotides are much rarer

– BUT it is suppressed around the promoters of many genes

Î CpG dinucleotides are much more frequent than elsewhere
• Such regions are called CpG islands
• A few hundred to a few thousand bases long
• Problems:
– Given a short sequence, does it come from a CpG island or not?
– How to find the CpG islands in a long sequence
Training Markov Chains for CpG islands

• Training Set:
aAT – set of DNA sequences w/ known CpG islands
A T • Derive two Markov chain models:
aAC aGT
– ‘+’ model: from the CpG islands
C a G – ‘-’ model: from the remainder of sequence
GC
• Transition probabilities for each model:
Probability of C following A
+ +
c c is the number of times
+ A C G T ast+ = st st
∑t' st'
+ letter t followed letter s
A .180 .274 .426 .120 c inside the CpG islands
.171 .368 .274 .188
C
G .161 .339 .375 .125 − −
.079 .355 .384 .182 ast− =
cst c
st
is the number of times

∑t' st'
T − letter t followed letter s
c outside the CpG islands
Using Markov Models for CpG classification

Q1: Given a short sequence x, does it come from CpG island (Yes-No question)
• To use these models for discrimination, calculate the log-odds ratio:

P(x|model + ) a +x i−1 x i
∑
L
S(x) = log = log
P(x|model − ) i =1 a −x i−1 x i

Histogram of log odds scores

5
Non-
CpG CpG
islands
0
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4
Using Markov Models for CpG classification

Q2: Given a long sequence x, how do we find CpG islands in it

(Where question)
• Calculate the log-odds score for a window of, say, 100 nucleotides around every
nucleotide, plot it, and predict CpG islands as ones w/ positive values
• Drawbacks: Window size

Use a hidden state: CpG (+) or non-CpG (-)

HMM for CpG islands

A: 1 A: 0 A: 0 A: 0 • Build a single model that combines both

C: 0 C: 1 C: 0 C: 0
G: 0 G: 0 G: 1 G: 0 Markov chains:
T: 0 T: 0 T: 0 T: 1 – ‘+’ states: A+, C+, G+, T+
• Emit symbols: A, C, G, T in CpG islands
A+ C+ G+ T+
– ‘-’ states: A-, C-, G-, T
• Emit symbols: A, C, G, T in non-islands

A- C- G- T-
• Emission probabilities distinct for the ‘+’
and the ‘-’ states
A: 1 A: 0 A: 0 A: 0 – Infer most likely set of states, giving rise
C: 0 C: 1 C: 0 C: 0 to observed emissions
G: 0 G: 0 G: 1 G: 0
T: 0 T: 0 T: 0 T: 1 Î ‘Paint’ the sequence with + and - states
Finding most likely state path
T- T- T- T-

G- G- G- G-

C- C- C- C-

A- A- A- A-

T+ T+ T+ T+
start end
G+ G+ G+ G+

C+ C+ C+ C+

A+ A+ A+ A+

C G C G

• Given the observed emissions, what was the path?

Probability of given path p & observations x
T- T- T- T-

G- G- G- G-

C- C- C- C-

A- A- A- A-

T+ T+ T+ T+
start end
G+ G+ G+ G+

C+ C+ C+ C+

A+ A+ A+ A+

C G C G
• Known observations: CGCG
• Known sequence path: C+, G-, C-, G+
Probability of given path p & observations x

G-

C- C-

start end
G+ G+

C+ C+

C G C G
• Known observations: CGCG
• Known sequence path: C+, G-, C-, G+
Probability of given path p & observations x

G- aG-,C-
C- C-

aC-,G+
aC+,G-
start end
a0,C+ G+ G+
aG+,0
C+ C+

eC+(C) eG-(G) eC-(C) eG+(G)

C G C G

• P(p,x) = (a0,C+* 1) * (aC+,G-* 1) * (aG-,C-* 1) * (aC-,G+* 1) * (aG+,0)

But in general, we don’t know the path!

The three main questions on HMMs

1. Evaluation

GIVEN
a HMM M, and a sequence x,

FIND
Prob[ x | M ]

2. Decoding

GIVEN
a HMM M, and a sequence x,

FIND
the sequence π of states that maximizes P[ x, π | M ]

3. Learning

GIVEN a HMM M, with unspecified transition/emission probs.,

and a sequence x,

FIND
parameters θ = (ei(.), aij) that maximize P[ x | θ ]
Problem 1: Decoding

Find the best parse of a

sequence
Decoding

GIVEN x = x1x2……xN

We want to find π = π1, ……, πN,

1 1 1 … 1
such that P[ x, π ] is maximized
2 2 2 … 2

π* = argmaxπ P[ x, π ]
… … … …
K K K … K
We can use dynamic programming!

x1 x2 x3 xK

Let Vk(i) = max{π1,…,i-1} P[x1…xi-1, π1, …, πi-1, xi, πi = k]

= Probability of most likely sequence of states ending at

state πi = k
Decoding – main idea

Given that for all states k,

and for a fixed position i,

Vk(i) = max{π1,…,i-1} P[x1…xi-1, π1, …, πi-1, xi, πi = k]

What is Vk(i+1)?

From definition,

Vl(i+1) = max{π1,…,i}P[ x1…xi, π1, …, πi, xi+1, πi+1 = l ]

= max{π1,…,i}P(xi+1, πi+1 = l | x1…xi,π1,…, πi) P[x1…xi, π1,…, πi]

= max{π1,…,i}P(xi+1, πi+1 = l | πi ) P[x1…xi-1, π1, …, πi-1, xi, πi]
= maxk P(xi+1, πi+1 = l | πi = k) max{π1,…,i-1}P[x1…xi-1,π1,…,πi-1, xi,πi=k]
= el(xi+1) maxk akl Vk(i)
The Viterbi Algorithm

Input: x = x1……xN

Initialization:
V0(0) = 1 (0 is the imaginary first position)
Vk(0) = 0, for all k > 0

Iteration:
Vj(i) = ej(xi) × maxk akj Vk(i-1)

Ptrj(i) = argmaxk akj Vk(i-1)

Termination:
P(x, π*) = maxk Vk(N)

Traceback:
πN* = argmaxk Vk(N)
πi-1* = Ptrπi (i)
The Viterbi Algorithm

x1 x2 x3 ………………………………………..xN

State 1
2

Vj(i)

Similar to “aligning” a set of states to a sequence

Time:
O(K2N)
Space:
O(KN)
Viterbi Algorithm – a practical detail

Underflows are a significant problem

P[ x1,…., xi, π1, …, πi ] = a0π1 aπ1π2……aπi eπ1(x1)……eπi(xi

)

These numbers become extremely small – underflow

Solution: Take the logs of all values

Vl(i) = log ek(xi) + maxk [ Vk(i-1) + log akl ]

Example

Let x be a sequence with a portion of ~ 1/6 6’s, followed by a portion of ~ ½

6’s…

x = 123456123456…12345 6626364656…1626364656

Then, it is not hard to show that optimal parse is (exercise):

FFF…………………...F LLL………………………...L

6 nucleotides “123456” parsed as F, contribute .956×(1/6)6 = 1.6×10-5

parsed as L, contribute .956×(1/2)1×(1/10)5 = 0.4×10-5

“162636” parsed as F, contribute .956×(1/6)6 = 1.6×10-5

parsed as L, contribute .956×(1/2)3×(1/10)3 = 9.0×10-5
Problem 2: Evaluation

Find the likelihood a sequence

is generated by the model
Generating a sequence by the model

Given a HMM, we can generate a sequence of length n as follows:

1. Start at state π1 according to prob a0π1

2. Emit letter x1 according to prob eπ1(x1)
3. Go to state π2 according to prob aπ1π2
4. … until emitting xn

1 1 1 … 1
a02 2 2 2 … 2
0 … … … …
K K K … K
e2(x1)
x1 x2 x3 xn
A couple of questions

Given a sequence x,

• What is the probability that x was generated by the model?

• Given a position i, what is the most likely state that emitted

xi?

Example: the dishonest casino

Say x = 12341623162616364616234161221341

Most likely path: π = FF……F

However: marked letters more likely to be L than
unmarked letters
Evaluation

We will develop algorithms that allow us to compute:

P(x) Probability of x given the model

P(xi…xj) Probability of a substring of x given the model

P(πI = k | x) Probability that the ith state is k, given x

A more refined measure of which states x may be in

The Forward Algorithm

We want to calculate

P(x) = probability of x, given the HMM

Sum over all possible ways of generating x:

P(x) = Σπ P(x, π) = Σπ P(x | π) P(π)

To avoid summing over an exponential number of paths π,

define

fk(i) = P(x1…xi, πi = k) (the forward probability)

The Forward Algorithm – derivation

Define the forward probability:

fl(i) = P(x1…xi, πi
= l)

= Σπ1…πi-1 P(x1…xi-1, π1,…, πi-1, πi = l) el(xi)

= Σk Σπ1…πi-2 P(x1…xi-1, π1,…, πi-2, πi-1 = k) akl el(xi)

= el(xi) Σk fk(i-1) akl

The Forward Algorithm

We can compute fk(i) for all k, i, using dynamic programming!

Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0

Iteration:
fl(i) = el(xi) Σk fk(i-1) akl

Termination:
P(x) = Σk fk(N) ak0

Where, ak0 is the probability that the terminating state is k (usually = a0k)
Relation between Forward and Viterbi

VITERBI FORWARD

Initialization: Initialization:
V0(0) = 1 f0(0) = 1
Vk(0) = 0, for all k > 0 fk(0) = 0, for all k > 0

Iteration:
Iteration:
fl(i) = el(xi) Σk fk(i-1) akl
Vj(i) = ej(xi) maxk Vk(i-1) akj
Termination:
Termination:
P(x) = Σk fk(N) ak0
P(x, π*) = maxk Vk(N)
Motivation for the Backward Algorithm

We want to compute

P(πi = k | x),

the probability distribution on the ith position, given x

We start by computing

P(πi = k, x) = P(x1…xi, πi = k, xi+1…xN)

= P(x1…xi, πi = k) P(xi+1…xN | x1…xi, πi = k)

= P(x1…xi, πi = k) P(xi+1…xN | πi = k)

Forward, fk(i) Backward, bk(i)

The Backward Algorithm – derivation

Define the backward probability:

bk(i) = P(xi+1…xN | πi = k)

= Σπi+1…πN P(xi+1,xi+2, …, xN, πi+1, …, πN | πi = k)

= Σl Σπi+1…πN P(xi+1,xi+2, …, xN, πi+1 = l, πi+2, …, πN | πi = k)

= Σl el
(xi+1) akl Σπi+1…πN P(xi+2, …, xN, πi+2, …, πN | πi+1 = l)

= Σl el(xi+1) akl bl(i+1)

The Backward Algorithm

We can compute bk(i) for all k, i, using dynamic programming

Initialization:

bk(N) = ak0, for all k

Iteration:

bk(i) = Σl el(xi+1) akl bl(i+1)

Termination:

P(x) = Σl a0l el(x1) bl(1)

Computational Complexity

What is the running time, and space required, for Forward, and Backward?

Time: O(K2N)
Space: O(KN)

Useful implementation technique to avoid underflows

Viterbi: sum of logs

Forward/Backward: rescaling at each position by multiplying by a
constant
Posterior Decoding

We can now calculate

fk(i) bk(i)

P(πi = k | x) = –––––––

P(x)

Then, we can ask

What is the most likely state at position i of sequence x:

Define π^ by Posterior Decoding:

π^i = argmaxk P(πi = k | x)

Posterior Decoding

• For each state,

– Posterior Decoding gives us a curve of likelihood of

state for each position

– That is sometimes more informative than Viterbi path

π*

• Posterior Decoding may give an invalid sequence

of states

– Why?

Maximum Weight Trace

• Another approach is to find a sequence of states under

some constraint, and maximizing expected accuracy of state
assignments

– Aj(i) = maxk such that Condition(k, j) Ak(i-1) + P(πi = j | x)

• We will revisit this notion again

Problem 3: Learning

Re-estimate the parameters of the

model based on training data
Two learning scenarios

1. Estimation when the “right answer” is known

Examples:
GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islands

GIVEN: the casino player allows us to observe him one evening,

as he changes dice and produces 10,000 rolls

2. Estimation when the “right answer” is unknown

Examples:
GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their composition

GIVEN: 10,000 rolls of the casino player, but we don’t see when he
changes dice

QUESTION: Update the parameters θ of the model to maximize P(x|θ)

Case 1. When the right answer is known

Given x = x1…xN
for which the true π = π1…πN is known,

Define:

Akl = # times k→l transition occurs in π

Ek(b) = # times state k in π emits b in x

We can show that the maximum likelihood parameters θ are:

Akl Ek(b)
akl = ––––– ek(b) = –––––––
Σi Aki Σc Ek(c)
Case 1. When the right answer is known

Intuition: When we know the underlying states,

Best estimate is the average frequency of
transitions & emissions that occur in the training data

Drawback:
Given little data, there may be overfitting:
P(x|θ) is maximized, but θ is unreasonable
0 probabilities – VERY BAD

Example:
Given 10 casino rolls, we observe
x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3
π = F, F, F, F, F, F, F, F, F, F
Then:
aFF = 1; aFL = 0
eF(1) = eF(3) = .2;
eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1
Pseudocounts

Solution for small training sets:

Add pseudocounts

Akl = # times k→l transition occurs in π + rkl

Ek(b) = # times state k in π emits b in x + rk(b)

rkl, rk(b) are pseudocounts representing our prior belief

Larger pseudocounts ⇒ Strong priof belief

Small pseudocounts (ε < 1): just to avoid 0 probabilities

Pseudocounts

Example: dishonest casino

We will observe player for one day, 500 rolls

Reasonable pseudocounts:

r0F = r0L = rF0 = rL0 = 1;

rFL = rLF = rFF = rLL = 1;

rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is

fair)

rF(1) = rF(2) = … = rF(6) = 5 (wait and see for

loaded)

Above #s pretty arbitrary – assigning priors is an art

Case 2. When the right answer is unknown

We don’t know the true Akl, Ek(b)

Idea:

• We estimate our “best guess” on what Akl, Ek(b) are

• We update the parameters of the model, based on our guess

• We repeat
Case 2. When the right answer is unknown

Starting with our best guess of a model M, parameters θ:

Given x = x1…xN
for which the true π = π1…πN is unknown,

We can get to a provably more likely parameter set θ

Principle: EXPECTATION MAXIMIZATION

1. Estimate Akl, Ek(b) in the training data

2. Update θ according to Akl, Ek(b)
3. Repeat 1 & 2, until convergence
Estimating new parameters

To estimate Akl:

At each position i of sequence x,

Find probability transition k→l is used:

P(πi = k, πi+1 = l | x) = [1/P(x)] × P(πi = k, πi+1 = l, x1…xN) = Q/P(x)

where Q = P(x1…xi, πi = k, πi+1 = l, xi+1…xN) =

= P(πi+1 = l, xi+1…xN | πi = k) P(x1…xi, πi = k) =

= P(πi+1 = l, xi+1xi+2…xN | πi = k) fk(i) =

= P(xi+2…xN | πi+1 = l) P(xi+1 | πi+1 = l) P(πi+1 = l | πi = k) fk(i) =

= bl(i+1) el(xi+1) akl fk(i)

fk(i) akl el(xi+1) bl(i+1)

So: P(πi = k, πi+1 = l | x, θ) = ––––––––––––––––––
P(x | θ)
Estimating new parameters

So,
fk(i) akl el(xi+1) bl(i+1)

Akl = Σi P(πi = k, πi+1 = l | x, θ) = Σi –––––––––––––––––

P(x | θ)

Similarly,

Ek(b) = [1/P(x)]Σ {i | xi = b} fk(i) bk(i)

Estimating new parameters

If we have several training sequences, x1, …, xM, each of length N,

fk(i) akl el(xi+1) bl(i+1)

Akl = Σx Σi P(πi = k, πi+1 = l | x, θ) = Σx Σi ––––––––––––––––
P(x | θ)

Similarly,

Ek(b) = Σx (1/P(x))Σ {i | xi = b} fk(i) bk(i)

The Baum-Welch Algorithm

Initialization:
Pick the best-guess for model parameters

(or arbitrary)

Iteration:
1. Forward
2. Backward
3. Calculate Akl, Ek(b)
4. Calculate new model parameters akl, ek(b)
5. Calculate new log-likelihood P(x | θ)

GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION

Until P(x | θ) does not change much

The Baum-Welch Algorithm – comments

Time Complexity:

# iterations × O(K2N)

• Guaranteed to increase the log likelihood of the model

P(θ | x) = P(x, θ) / P(x) = P(x | θ) / ( P(x) P(θ) )

• Not guaranteed to find globally best parameters

Converges to local optimum, depending on initial conditions

• Too many parameters / too large model: Overtraining

Alternative: Viterbi Training

Initialization: Same

Iteration:
1. Perform Viterbi, to find π*
2. Calculate Akl, Ek(b) according to π* + pseudocounts
3. Calculate the new parameters akl, ek(b)
Until convergence

Notes:
– Convergence is guaranteed – Why?
– Does not maximize P(x | θ)
– In general, worse performance than Baum-Welch
How to Build an HMM

• General Scheme:
– Architecture/topology design
– Learning/Training:
• Training Datasets
• Parameter Estimation
– Recognition/Classification:

• Testing Datasets
• Performance Evaluation
Parameter Estimation for HMMs (Case 1)
• Case 1: All the paths/labels in the set of training
sequences are known:
– Use the Maximum Likelihood (ML) estimators for:
Akl Ek ( x)
a kl = and ekx =
∑ l'
Akl ' ∑ x'
Ek ( x ' )

– Where Akl and Ek(x) are the number of times each

transition or emission is used in training sequences
– Drawbacks of ML estimators:
• Vulnerable to overfitting if not enough data
• Estimations can be undefined if never used in training set
(add pseudocounts to reflect a prior biases about probability
values)
Parameter Estimation for HMMs (Case 2)
• Case 2: The paths/labels in the set of training
sequences are UNknown:
– Use Iterative methods (e.g., Baum-Welch):
1. Initialize akl and ekx (e.g., randomly)
2. Estimate Akl and Ek(x) using current values of akl and ekx
3. Derive new values for akl and ekx
4. Iterate Steps 2-3 until some stopping criterion is met (e.g.,
change in the total log-likelihood is small)
– Drawbacks of Iterative methods:
• Converge to local optimum
• Sensitive to initial values of akl and ekx (Step 1)
• Convergence problem is getting worse for large HMMs
HMM Architectural/Topology Design
• In general, HMM states and transitions are
designed based on the knowledge of the problem
under study
• Special Class: Explicit State Duration HMMs:
– Self-transition state to itself: aii ajj
qi qj
• The probability of staying in the state for d residues:
pi (d residues) = (aii)d-1(1-aii) – exponentially decaying
• Exponential state duration density is often inappropriate
⇒Need to explicitly model duration density in some form
– Specified state density:
pi(d) pj(d)
… …
• Used in GenScan

qi qj
HMM-based Gene Finding

• GENSCAN (Burge 1997)

• FGENESH (Solovyev 1997)
• HMMgene (Krogh 1997)
• GENIE (Kulp 1996)
• GENMARK (Borodovsky & McIninch 1993)
• VEIL (Henderson, Salzberg, & Fasman 1997)

VEIL: Viterbi Exon-Intron Locator

• Contains 9 hidden states or features

• Each state is a complex internal Markovian model of the feature
• Features:
– Exons, introns, intergenic regions, splice sites, etc.
Exon HMM Model
Upstream

Start Codon 3’ Splice Site

Exon Intron

Stop Codon 5’ Splice Site

Downstream 5’ Poly-A Site

• Enter: start codon or intron (3’ Splice Site)
VEIL Architecture
• Exit: 5’ Splice site or three stop codons
(taa, tag, tga)
Genie
• Uses a generalized HMM (GHMM)
• Edges in model are complete HMMs
• States can be any arbitrary program
• States are actually neural networks specially
designed for signal finding

• J5’ – 5’ UTR
Begin Start Donor Accept Stop End
• EI – Initial Exon Sequence Translatio splice or Translatio Sequence
n site splice n
• E – Exon, Internal Exon site

• I – Intron
• EF – Final Exon
• ES – Single Exon
• J3’ – 3’UTR
Genscan Overview

• Developed by Chris Burge (Burge 1997), in the research group of

Samuel Karlin, Dept of Mathematics, Stanford Univ.
• Characteristics:
– Designed to predict complete gene structures
• Introns and exons, Promoter sites, Polyadenylation signals
– Incorporates:
• Descriptions of transcriptional, translational and splicing signal
• Length distributions (Explicit State Duration HMMs)
• Compositional features of exons, introns, intergenic, C+G regions
– Larger predictive scope
• Deal w/ partial and complete genes
• Multiple genes separated by intergenic DNA in a seq
• Consistent sets of genes on either/both DNA strands
• Based on a general probabilistic model of genomic sequences
composition and gene structure
Genscan Architecture

• It is based on Generalized HMM

(GHMM)
• Model both strands at once
– Other models: Predict on one
strand first, then on the other strand
– Avoids prediction of overlapping Image removed due to copyright restrictions.
genes on the two strands (rare)
• Each state may output a string of
symbols (according to some
probability distribution).
• Explicit intron/exon length modeling
• Special sensors for Cap-site and
TATA-box
• Advanced splice site sensors

Fig. 3, Burge and Karlin 1997

GenScan States

• N - intergenic region
• P - promoter
E0+ E1+ E2+
• F - 5’ untranslated region
• Esngl – single exon (intronless)
(translation start -> stop codon)
• Einit – initial exon (translation start -
I0+ I1+ I2+
> donor splice site)
• Ek – phase k internal exon
(acceptor splice site -> donor splice
site) Einit+ Eterm+
• Eterm – terminal exon (acceptor
Esngl+
splice site -> stop codon) F+ T+
(5' UTR) (single-exon (3' UTR)
• Ik – phase k intron: 0 – between gene)
codons; 1 – after the first base of a
codon; 2 – after the second base of P+ A+
(promo (poly-A
a codon ter) signal)

Forward (+) strand N Forward (+) strand

(intergenic
Reverse (-) strand region) Reverse (-) strand

Figure by MIT OCW.

Accuracy Measures

Sensitivity vs. Specificity (adapted from Burset&Guigo 1996)

TP FP TN FN TP FN TN

Actual

Predicted

Actual
Coding / No Coding

(TP * TN) - (FN * FP)

No Coding / Coding

Sn = TP CC =
TP FP ((TP+FN)*(TN+FP)*(TP+FP)*(TN+FN))1/2
Predicted

TP+FN

FN TN Sn = TP
TP+FP
AC = 1
(
TP
+
TP
+
TN
+
TN
2 TP+FN TP+FP TN+FP TN+FN
-1

(
Figure by MIT OCW.
•Sensitivity (Sn) Fraction of actual coding regions that are correctly predicted as
coding
•Specificity (Sp) Fraction of the prediction that is actually correct
•Correlation Combined measure of Sensitivity & Specificity
Coefficient (CC) Range: -1 (always wrong) Æ +1 (always right)
Test Datasets

• Sample Tests reported by Literature

– Test on the set of 570 vertebrate gene seqs

(Burset&Guigo 1996) as a standard for comparison
of gene finding methods.

– Test on the set of 195 seqs of human, mouse or rat

origin (named HMR195) (Rogic 2001).
Results: Accuracy Statistics

Table: Relative Performance (adapted from Rogic 2001)

Complicating Factors for Comparison
• Gene finders were trained on data that
had genes homologous to test seq.
• Percentage of overlap is varied
• Some gene finders were able to tune
their methods for particular data

# of seqs - number of seqs effectively analyzed • Methods continue to be developed

by each program; in parentheses is the number
of seqs where the absence of gene was Needed
predicted;

• Train and test methods on the same data.

Sn -nucleotide level sensitivity; Sp
nucleotide
level specificity;
• Do cross-validation (10% leave-out)
CC - correlation coefficient;

ESn - exon level sensitivity; ESp

exon level
specificity

Why not Perfect?

• Gene Number
usually approximately correct, but may not

• Organism
primarily for human/vertebrate seqs; maybe lower accuracy for non-
vertebrates. ‘Glimmer’ & ‘GeneMark’ for prokaryotic or yeast seqs

• Exon and Feature Type

Internal exons: predicted more accurately than Initial or Terminal exons;
Exons: predicted more accurately than Poly-A or Promoter signals

• Biases in Test Set (Resulting statistics may not be representative)

The Burset/Guigó (1996) dataset:
¾ Biased toward short genes with relatively simple exon/intron structure
The Rogic (2001) dataset:
¾ DNA seqs: GenBank r-111.0 (04/1999 <- 08/1997);

¾ source organism specified;

¾ consider genomic seqs containing exactly one gene;

¾ seqs>200kb were discarded; mRNA seqs and seqs containing pseudo genes or

alternatively spliced genes were excluded.

What We Learned…

• Genes are complex structures which are difficult to

predict with the required level of
accuracy/confidence
• Different HMM-based approaches have been
successfully used to address the gene finding
problem:
– Building an architecture of an HMM is the hardest
part, it should be biologically sound & easy to
interpret
– Parameter estimation can be trapped in local

optimum

• Viterbi algorithm can be used to find the most

probable path/labels
• These approaches are still not perfect

Grade 11 Biology Reviewer
100% (3)
Grade 11 Biology Reviewer
2 pages
Bioinformatics HMM Updated
No ratings yet
Bioinformatics HMM Updated
28 pages
Bioinformatics-Lesson 07 - Hidden Markov Model
No ratings yet
Bioinformatics-Lesson 07 - Hidden Markov Model
28 pages
Hidden Markov Models: CH 3.2, 3.2 of DEKM
No ratings yet
Hidden Markov Models: CH 3.2, 3.2 of DEKM
27 pages
Hmms
No ratings yet
Hmms
59 pages
HMM in BI
No ratings yet
HMM in BI
37 pages
Hidden Markov Models: Modified From
No ratings yet
Hidden Markov Models: Modified From
32 pages
Lecture07 HMM S
No ratings yet
Lecture07 HMM S
26 pages
Hidden Markov Models
No ratings yet
Hidden Markov Models
41 pages
Markov Chain Models: BMI/CS 576 WWW - Biostat.wisc - Edu/bmi576/ Cdewey@biostat - Wisc.edu Fall 2010
No ratings yet
Markov Chain Models: BMI/CS 576 WWW - Biostat.wisc - Edu/bmi576/ Cdewey@biostat - Wisc.edu Fall 2010
36 pages
Markov Chains
No ratings yet
Markov Chains
22 pages
hw3 Solution
No ratings yet
hw3 Solution
7 pages
Computational Genomics Hidden Markov Models (HMMS)
No ratings yet
Computational Genomics Hidden Markov Models (HMMS)
55 pages
Lecture09 Review
No ratings yet
Lecture09 Review
51 pages
Hidden Markov Models
No ratings yet
Hidden Markov Models
51 pages
ML 5
No ratings yet
ML 5
28 pages
Artificial Intelligence and Learning Algorithms: Presented by Brian M. Frezza 12/1/05
No ratings yet
Artificial Intelligence and Learning Algorithms: Presented by Brian M. Frezza 12/1/05
67 pages
Hidden Markov Modelss
No ratings yet
Hidden Markov Modelss
59 pages
ViteRbi Algorithm
No ratings yet
ViteRbi Algorithm
19 pages
Title: Hidden Markov Model: Hidden Markov Model The States That Were Responsible For Emitting The Various Symbols Are
No ratings yet
Title: Hidden Markov Model: Hidden Markov Model The States That Were Responsible For Emitting The Various Symbols Are
5 pages
Recitation4 Notes
No ratings yet
Recitation4 Notes
6 pages
Lec20 PDF
No ratings yet
Lec20 PDF
7 pages
Unit - 4 Hidden Markov Models
No ratings yet
Unit - 4 Hidden Markov Models
39 pages
HMM Bioinformatics
No ratings yet
HMM Bioinformatics
30 pages
Machine Learning For Natural Language Processing: Hidden Markov Models
No ratings yet
Machine Learning For Natural Language Processing: Hidden Markov Models
33 pages
Session 6-Markov Slide
No ratings yet
Session 6-Markov Slide
68 pages
Probabilistic Models
No ratings yet
Probabilistic Models
34 pages
Lec19 PDF
No ratings yet
Lec19 PDF
9 pages
Pattern Rrecognition 2
No ratings yet
Pattern Rrecognition 2
7 pages
확통1 LectureNote08 on Markov Chains
No ratings yet
확통1 LectureNote08 on Markov Chains
77 pages
I2ml3e Chap15
No ratings yet
I2ml3e Chap15
22 pages
Tutorial Note 9 Hidden Markov Model
No ratings yet
Tutorial Note 9 Hidden Markov Model
25 pages
Lecture 11
No ratings yet
Lecture 11
55 pages
Hidden Markov Model (HMM) Architecture
No ratings yet
Hidden Markov Model (HMM) Architecture
15 pages
(Computational Biology, V. 2) Timo Koski - Hidden Markov Models For Bioinformatics-Kluwer (2001)
No ratings yet
(Computational Biology, V. 2) Timo Koski - Hidden Markov Models For Bioinformatics-Kluwer (2001)
404 pages
Unit 16: Hidden Markov Models: Computational Statistics With Application To Bioinformatics
No ratings yet
Unit 16: Hidden Markov Models: Computational Statistics With Application To Bioinformatics
24 pages
8.1 HMM
No ratings yet
8.1 HMM
50 pages
Artificial Intelligence and Learning Algorithms: Presented by Brian M. Frezza 12/1/05
No ratings yet
Artificial Intelligence and Learning Algorithms: Presented by Brian M. Frezza 12/1/05
67 pages
STAT 530 Hidden Markov Model: Outline
No ratings yet
STAT 530 Hidden Markov Model: Outline
17 pages
Introduction To Machine Learning CMU-10701: Hidden Markov Models
No ratings yet
Introduction To Machine Learning CMU-10701: Hidden Markov Models
30 pages
HMM
No ratings yet
HMM
25 pages
Markov Models: Current Next Transition Probabilities Current
100% (1)
Markov Models: Current Next Transition Probabilities Current
53 pages
Chapter 4 - Discrete Time Markov Chains
No ratings yet
Chapter 4 - Discrete Time Markov Chains
37 pages
Markov Chains
No ratings yet
Markov Chains
42 pages
HMM Lecture Notes
No ratings yet
HMM Lecture Notes
7 pages
(MTL106) Review Notes - Stochastic Processes (IITD)
No ratings yet
(MTL106) Review Notes - Stochastic Processes (IITD)
8 pages
Lecture 8: State-Space Models Based On Slides By: Probabilis C Graphical Models
No ratings yet
Lecture 8: State-Space Models Based On Slides By: Probabilis C Graphical Models
29 pages
HMM: Viterbi Algorithm - A Toy Example: Start
No ratings yet
HMM: Viterbi Algorithm - A Toy Example: Start
12 pages
Lec7_10_HMM Learning
No ratings yet
Lec7_10_HMM Learning
88 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
16 pages
BT302_L9_HMM
No ratings yet
BT302_L9_HMM
29 pages
Hidden Markov Models
No ratings yet
Hidden Markov Models
26 pages
Hidden Markov Model
No ratings yet
Hidden Markov Model
32 pages
09 - Hidden Markov Model
No ratings yet
09 - Hidden Markov Model
78 pages
Hidden Markov Models and Their Applications in Biological Sequence Analysis Byung-Jun Yoon
No ratings yet
Hidden Markov Models and Their Applications in Biological Sequence Analysis Byung-Jun Yoon
30 pages
Probability & Statistics 2: Robert Šámal January 29, 2024
No ratings yet
Probability & Statistics 2: Robert Šámal January 29, 2024
29 pages
Programación Dinamica Aplicacion Al Analisis de ADN
No ratings yet
Programación Dinamica Aplicacion Al Analisis de ADN
19 pages
1.1. An Example of A HMM For Protein Sequences: Output Prob
No ratings yet
1.1. An Example of A HMM For Protein Sequences: Output Prob
16 pages
Introduction To Hidden Markov Models
No ratings yet
Introduction To Hidden Markov Models
56 pages
GARCH Models: Structure, Statistical Inference and Financial Applications
From Everand
GARCH Models: Structure, Statistical Inference and Financial Applications
Christian Francq
5/5 (1)
String Theory Demystified
From Everand
String Theory Demystified
David McMahon
3/5 (4)
Unravelling The Soil Microbiome Perspectives For Environmental Sustainability
100% (2)
Unravelling The Soil Microbiome Perspectives For Environmental Sustainability
118 pages
035 Free Cell Basic Unit of Life Google Slides Themes PPT Template
No ratings yet
035 Free Cell Basic Unit of Life Google Slides Themes PPT Template
18 pages
Plant Promoters
No ratings yet
Plant Promoters
43 pages
Compare & Contrast Graphic Organizer
No ratings yet
Compare & Contrast Graphic Organizer
3 pages
Biology Ii: KARUNDURA TEST 2021 - 2022 District:Rwamagana Subject: Biology (Theory)
50% (2)
Biology Ii: KARUNDURA TEST 2021 - 2022 District:Rwamagana Subject: Biology (Theory)
8 pages
Chapter 1 - Foundations of Biology Revision
No ratings yet
Chapter 1 - Foundations of Biology Revision
2 pages
Pendekatan SOAP Farmasi RS
No ratings yet
Pendekatan SOAP Farmasi RS
40 pages
IGCSE Topical Past Papers Biology P4 C1 - C7
No ratings yet
IGCSE Topical Past Papers Biology P4 C1 - C7
63 pages
Kimmel Et Al-1995-Developmental Dynamics PDF
No ratings yet
Kimmel Et Al-1995-Developmental Dynamics PDF
58 pages
Biopunk - Description
No ratings yet
Biopunk - Description
2 pages
BABS2213 Tutorial
No ratings yet
BABS2213 Tutorial
12 pages
Lahat NG Sagot Sa GBIO Andito Na
100% (1)
Lahat NG Sagot Sa GBIO Andito Na
36 pages
Chapter 2 Organisation of The Organism IGCSE Biology
No ratings yet
Chapter 2 Organisation of The Organism IGCSE Biology
28 pages
Setting Specifications - How-Where
No ratings yet
Setting Specifications - How-Where
21 pages
Lecture 01 Notes 5.07
No ratings yet
Lecture 01 Notes 5.07
7 pages
Association of Molecular Pathology Vs Myriad Genetics
No ratings yet
Association of Molecular Pathology Vs Myriad Genetics
22 pages
Biologia
No ratings yet
Biologia
345 pages
GeneFluidics Announces CE-IVD Marking of UtiMax™ Uropathogen Identification (ID) and Antimicrobial Susceptibility Testing (AST)
No ratings yet
GeneFluidics Announces CE-IVD Marking of UtiMax™ Uropathogen Identification (ID) and Antimicrobial Susceptibility Testing (AST)
3 pages
Polyvinylidene Fluorid
No ratings yet
Polyvinylidene Fluorid
8 pages
Clinical Laboratories
88% (8)
Clinical Laboratories
24 pages
12 Biology Semester Exam
No ratings yet
12 Biology Semester Exam
15 pages
CENTRAL DOGMAN BIOMOL ARTICULO ORGINAL HH
No ratings yet
CENTRAL DOGMAN BIOMOL ARTICULO ORGINAL HH
4 pages
3rd Quarter W4 LECTURE On Protein Synthesis in DNA and Mutation
No ratings yet
3rd Quarter W4 LECTURE On Protein Synthesis in DNA and Mutation
53 pages
Cell Cycle and Mitosis
No ratings yet
Cell Cycle and Mitosis
2 pages
Instant Access to Medical Biotechnology 1st Edition Bernard R. Glick ebook Full Chapters
100% (5)
Instant Access to Medical Biotechnology 1st Edition Bernard R. Glick ebook Full Chapters
61 pages
Undergraduates Program: IU Program Twinning Program Postgraduates Text Books Staffs Contact
No ratings yet
Undergraduates Program: IU Program Twinning Program Postgraduates Text Books Staffs Contact
2 pages
Biological Factors in Child Development
No ratings yet
Biological Factors in Child Development
55 pages
Microbiology PPT Week 1
No ratings yet
Microbiology PPT Week 1
11 pages
Activity Genetic Code
No ratings yet
Activity Genetic Code
3 pages