Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Smoothing N-gram Models
K.A.S.H. Kulathilake
B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
Smoothing
• What do we do with words that are in our vocabulary (they are not
unknown words) but appear in a test set in an unseen context (for
example they appear after a word they never appeared after in
training)?
• To keep a language model from assigning zero probability to these
unseen events, we’ll have to shave off a bit of probability mass from
some more frequent events and give it to the events we’ve never
seen.
• This modification is called smoothing or discounting.
• There are variety of ways to do smoothing:
– Add-1 smoothing
– Add-k smoothing
– Good-Turing Discounting
– Stupid backoff
– Kneser-Ney smoothing and many more
Laplace Smoothing / Add 1 Smoothing
• The simplest way to do smoothing is to add one to all
the bigram counts, before we normalize them into
probabilities.
• All the counts that used to be zero will now have a
count of 1, the counts of 1 will be 2, and so on.
• This algorithm is called Laplace smoothing.
• Laplace smoothing does not perform well enough to be
used in modern N-gram models, but it usefully
introduces many of the concepts that we see in other
smoothing algorithms, gives a useful baseline, and is
also a practical smoothing algorithm for other tasks like
text classification.
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Let’s start with the application of Laplace smoothing to
unigram probabilities.
• Recall that the unsmoothed maximum likelihood estimate
of the unigram probability of the word wi is its count ci
normalized by the total number of word tokens N:
𝑃 𝑊𝑖 =
𝐶𝑖
𝑁
• Laplace smoothing merely adds one to each count.
• Since there are V words in the vocabulary and each one
was incremented, we also need to adjust the denominator
to take into account the extra V observations.
𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒 𝑊𝑖 =
𝐶𝑖 + 1
𝑁 + 𝑉
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Instead of changing both the numerator and
denominator, it is convenient to describe how a
smoothing algorithm affects the numerator, by defining
an adjusted count C*.
• This adjusted count is easier to compare directly with
the MLE counts and can be turned into a probability
like an MLE count by normalizing by N.
• To define this count, since we are only changing the
numerator in addition to adding 1 we’ll also need to
multiply by a normalization factor
𝑁
𝑁+𝑉
:
𝐶𝑖
∗
= (𝐶𝑖 + 1)
𝑁
𝑁 + 𝑉
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• A related way to view smoothing is as discounting
(lowering) some non-zero counts in order to get
the probability mass that will be assigned to the
zero counts.
• Thus, instead of referring to the discounted
counts c, we might describe a smoothing
algorithm in terms of a relative discount dc, the
ratio of the discounted counts to the original
counts:
𝑑 𝑐 =
𝐶∗
𝐶
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• let’s smooth our Berkeley Restaurant Project
bigrams.
• Let’s take the Berkeley Restaurant project
bigrams;
Laplace Smoothing / Add 1 Smoothing
(Cont…)
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Recall that normal bigram probabilities are computed by
normalizing each row of counts by the unigram count:
𝑃 𝑊𝑛 𝑊𝑛−1 =
𝐶(𝑊𝑛−1 𝑊𝑛)
𝐶(𝑊𝑛−1)
• For add-one smoothed bigram counts, we need to augment the
unigram count by the number of total word types in the vocabulary
V:
𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒
∗
𝑊𝑛 𝑊𝑛−1 =
𝐶 𝑊𝑛−1 𝑊𝑛 + 1
𝐶 𝑊𝑛−1 + 𝑉
• Thus, each of the unigram counts given in the previous table will
need to be augmented by V =1446.
• Ex:𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒
∗
𝑡𝑜 𝐼 =
𝐶 𝐼,𝑡𝑜 +1
𝐶 𝐼 +𝑉
=
0+1
2500+1446
=
1
3946
= 0.000253
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Following table shows the add-one smoothed
probabilities for the bigrams:
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• It is often convenient to reconstruct the count
matrix so we can see how much a smoothing
algorithm has changed the original counts.
• These adjusted counts can be computed by
using following equation:
𝐶∗
(𝑊𝑛−1 𝑊𝑛 ) =
[𝐶 𝑊𝑛−1 𝑊𝑛 ] × 𝐶(𝑊𝑛−1)
𝐶 𝑊𝑛−1 + 𝑉
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Following table shows the reconstructed
counts:
Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Note that add-one smoothing has made a very big change to the
counts.
• C(want to) changed from 608 to 238!
• We can see this in probability space as well: P(to|want) decreases
from .66 in the unsmoothed case to .26 in the smoothed case.
• Looking at the discount d (the ratio between new and old counts)
shows us how strikingly the counts for each prefix word have been
reduced; the discount for the bigram want to is .39, while the
discount for Chinese food is .10, a factor of 10!
• The sharp change in counts and probabilities occurs because too
much probability mass is moved to all the zeros.
Add-K Smoothing
• One alternative to add-one smoothing is to move a bit
less of the probability mass from the seen to the
unseen events.
• Instead of adding 1 to each count, we add a fractional
count k (.5? .05? .01?).
• This algorithm is therefore called add-k smoothing.
𝑃𝐴𝑑𝑑−𝑘
∗
𝑊𝑛 𝑊𝑛−1 =
𝐶 𝑊𝑛−1 𝑊𝑛 + 𝑘
𝐶 𝑊𝑛−1 + 𝑘𝑉
• Add-k smoothing requires that we have a method for
choosing k; this can be done, for example, by
optimizing on a development set (devset).
Good-Turing Discounting
• The basic insight of Good-Turing smoothing is to re-estimate the amount of
probability mass to assign to N-grams with zero or low counts by looking at the
number of N-grams with higher counts.
• In other words, we examine Nc, the number of N-grams that occur c times.
• We refer to the number of N-grams that occur c times as the frequency of
frequency c.
• So applying the idea to smoothing the joint probability of bigrams, N0 is the
number of bigrams b of count 0, N1 the number of bigrams with count 1, and so
on:
𝑁𝐶 =
𝑏:𝑐 𝑏 =𝑐
1
• The Good-Turing estimate gives a smoothed count c* based on the set of Nc for all
c, as follows:
𝐶∗ = (𝐶 + 1)
𝑁𝐶 + 1
𝑁𝐶
https://www.youtube.com/watch?v=z1bq4C8hFEk
Backoff and Interpolation
• In the backoff model, like the deleted interpolation model,
we build an N-gram model based on an (N-1)-gram model.
• The difference is that in backoff, if we have non-zero
trigram counts, we rely solely on the trigram counts and
don’t interpolate the bigram and unigram counts at all.
• We only ‘back off’ to a lower-order N-gram if we have zero
evidence for a higher-order N-gram.
Reference
• https://www.youtube.com/watch?v=z1bq4C8
hFEk

More Related Content

NLP_KASHK:Smoothing N-gram Models

  • 1. Smoothing N-gram Models K.A.S.H. Kulathilake B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
  • 2. Smoothing • What do we do with words that are in our vocabulary (they are not unknown words) but appear in a test set in an unseen context (for example they appear after a word they never appeared after in training)? • To keep a language model from assigning zero probability to these unseen events, we’ll have to shave off a bit of probability mass from some more frequent events and give it to the events we’ve never seen. • This modification is called smoothing or discounting. • There are variety of ways to do smoothing: – Add-1 smoothing – Add-k smoothing – Good-Turing Discounting – Stupid backoff – Kneser-Ney smoothing and many more
  • 3. Laplace Smoothing / Add 1 Smoothing • The simplest way to do smoothing is to add one to all the bigram counts, before we normalize them into probabilities. • All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so on. • This algorithm is called Laplace smoothing. • Laplace smoothing does not perform well enough to be used in modern N-gram models, but it usefully introduces many of the concepts that we see in other smoothing algorithms, gives a useful baseline, and is also a practical smoothing algorithm for other tasks like text classification.
  • 4. Laplace Smoothing / Add 1 Smoothing (Cont…) • Let’s start with the application of Laplace smoothing to unigram probabilities. • Recall that the unsmoothed maximum likelihood estimate of the unigram probability of the word wi is its count ci normalized by the total number of word tokens N: 𝑃 𝑊𝑖 = 𝐶𝑖 𝑁 • Laplace smoothing merely adds one to each count. • Since there are V words in the vocabulary and each one was incremented, we also need to adjust the denominator to take into account the extra V observations. 𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒 𝑊𝑖 = 𝐶𝑖 + 1 𝑁 + 𝑉
  • 5. Laplace Smoothing / Add 1 Smoothing (Cont…) • Instead of changing both the numerator and denominator, it is convenient to describe how a smoothing algorithm affects the numerator, by defining an adjusted count C*. • This adjusted count is easier to compare directly with the MLE counts and can be turned into a probability like an MLE count by normalizing by N. • To define this count, since we are only changing the numerator in addition to adding 1 we’ll also need to multiply by a normalization factor 𝑁 𝑁+𝑉 : 𝐶𝑖 ∗ = (𝐶𝑖 + 1) 𝑁 𝑁 + 𝑉
  • 6. Laplace Smoothing / Add 1 Smoothing (Cont…) • A related way to view smoothing is as discounting (lowering) some non-zero counts in order to get the probability mass that will be assigned to the zero counts. • Thus, instead of referring to the discounted counts c, we might describe a smoothing algorithm in terms of a relative discount dc, the ratio of the discounted counts to the original counts: 𝑑 𝑐 = 𝐶∗ 𝐶
  • 7. Laplace Smoothing / Add 1 Smoothing (Cont…) • let’s smooth our Berkeley Restaurant Project bigrams. • Let’s take the Berkeley Restaurant project bigrams;
  • 8. Laplace Smoothing / Add 1 Smoothing (Cont…)
  • 9. Laplace Smoothing / Add 1 Smoothing (Cont…) • Recall that normal bigram probabilities are computed by normalizing each row of counts by the unigram count: 𝑃 𝑊𝑛 𝑊𝑛−1 = 𝐶(𝑊𝑛−1 𝑊𝑛) 𝐶(𝑊𝑛−1) • For add-one smoothed bigram counts, we need to augment the unigram count by the number of total word types in the vocabulary V: 𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒 ∗ 𝑊𝑛 𝑊𝑛−1 = 𝐶 𝑊𝑛−1 𝑊𝑛 + 1 𝐶 𝑊𝑛−1 + 𝑉 • Thus, each of the unigram counts given in the previous table will need to be augmented by V =1446. • Ex:𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒 ∗ 𝑡𝑜 𝐼 = 𝐶 𝐼,𝑡𝑜 +1 𝐶 𝐼 +𝑉 = 0+1 2500+1446 = 1 3946 = 0.000253
  • 10. Laplace Smoothing / Add 1 Smoothing (Cont…) • Following table shows the add-one smoothed probabilities for the bigrams:
  • 11. Laplace Smoothing / Add 1 Smoothing (Cont…) • It is often convenient to reconstruct the count matrix so we can see how much a smoothing algorithm has changed the original counts. • These adjusted counts can be computed by using following equation: 𝐶∗ (𝑊𝑛−1 𝑊𝑛 ) = [𝐶 𝑊𝑛−1 𝑊𝑛 ] × 𝐶(𝑊𝑛−1) 𝐶 𝑊𝑛−1 + 𝑉
  • 12. Laplace Smoothing / Add 1 Smoothing (Cont…) • Following table shows the reconstructed counts:
  • 13. Laplace Smoothing / Add 1 Smoothing (Cont…) • Note that add-one smoothing has made a very big change to the counts. • C(want to) changed from 608 to 238! • We can see this in probability space as well: P(to|want) decreases from .66 in the unsmoothed case to .26 in the smoothed case. • Looking at the discount d (the ratio between new and old counts) shows us how strikingly the counts for each prefix word have been reduced; the discount for the bigram want to is .39, while the discount for Chinese food is .10, a factor of 10! • The sharp change in counts and probabilities occurs because too much probability mass is moved to all the zeros.
  • 14. Add-K Smoothing • One alternative to add-one smoothing is to move a bit less of the probability mass from the seen to the unseen events. • Instead of adding 1 to each count, we add a fractional count k (.5? .05? .01?). • This algorithm is therefore called add-k smoothing. 𝑃𝐴𝑑𝑑−𝑘 ∗ 𝑊𝑛 𝑊𝑛−1 = 𝐶 𝑊𝑛−1 𝑊𝑛 + 𝑘 𝐶 𝑊𝑛−1 + 𝑘𝑉 • Add-k smoothing requires that we have a method for choosing k; this can be done, for example, by optimizing on a development set (devset).
  • 15. Good-Turing Discounting • The basic insight of Good-Turing smoothing is to re-estimate the amount of probability mass to assign to N-grams with zero or low counts by looking at the number of N-grams with higher counts. • In other words, we examine Nc, the number of N-grams that occur c times. • We refer to the number of N-grams that occur c times as the frequency of frequency c. • So applying the idea to smoothing the joint probability of bigrams, N0 is the number of bigrams b of count 0, N1 the number of bigrams with count 1, and so on: 𝑁𝐶 = 𝑏:𝑐 𝑏 =𝑐 1 • The Good-Turing estimate gives a smoothed count c* based on the set of Nc for all c, as follows: 𝐶∗ = (𝐶 + 1) 𝑁𝐶 + 1 𝑁𝐶 https://www.youtube.com/watch?v=z1bq4C8hFEk
  • 16. Backoff and Interpolation • In the backoff model, like the deleted interpolation model, we build an N-gram model based on an (N-1)-gram model. • The difference is that in backoff, if we have non-zero trigram counts, we rely solely on the trigram counts and don’t interpolate the bigram and unigram counts at all. • We only ‘back off’ to a lower-order N-gram if we have zero evidence for a higher-order N-gram.