The document discusses various techniques for smoothing N-gram language models, including Laplace smoothing, Good-Turing discounting, and backoff models. Laplace smoothing involves adding one to all counts to address unseen events. Good-Turing smoothing estimates probabilities for low counts based on the frequency of higher counts. Backoff models first use the highest available order N-gram model and only fallback to a lower order if the current count is zero.
2. Smoothing
• What do we do with words that are in our vocabulary (they are not
unknown words) but appear in a test set in an unseen context (for
example they appear after a word they never appeared after in
training)?
• To keep a language model from assigning zero probability to these
unseen events, we’ll have to shave off a bit of probability mass from
some more frequent events and give it to the events we’ve never
seen.
• This modification is called smoothing or discounting.
• There are variety of ways to do smoothing:
– Add-1 smoothing
– Add-k smoothing
– Good-Turing Discounting
– Stupid backoff
– Kneser-Ney smoothing and many more
3. Laplace Smoothing / Add 1 Smoothing
• The simplest way to do smoothing is to add one to all
the bigram counts, before we normalize them into
probabilities.
• All the counts that used to be zero will now have a
count of 1, the counts of 1 will be 2, and so on.
• This algorithm is called Laplace smoothing.
• Laplace smoothing does not perform well enough to be
used in modern N-gram models, but it usefully
introduces many of the concepts that we see in other
smoothing algorithms, gives a useful baseline, and is
also a practical smoothing algorithm for other tasks like
text classification.
4. Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Let’s start with the application of Laplace smoothing to
unigram probabilities.
• Recall that the unsmoothed maximum likelihood estimate
of the unigram probability of the word wi is its count ci
normalized by the total number of word tokens N:
𝑃 𝑊𝑖 =
𝐶𝑖
𝑁
• Laplace smoothing merely adds one to each count.
• Since there are V words in the vocabulary and each one
was incremented, we also need to adjust the denominator
to take into account the extra V observations.
𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒 𝑊𝑖 =
𝐶𝑖 + 1
𝑁 + 𝑉
5. Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Instead of changing both the numerator and
denominator, it is convenient to describe how a
smoothing algorithm affects the numerator, by defining
an adjusted count C*.
• This adjusted count is easier to compare directly with
the MLE counts and can be turned into a probability
like an MLE count by normalizing by N.
• To define this count, since we are only changing the
numerator in addition to adding 1 we’ll also need to
multiply by a normalization factor
𝑁
𝑁+𝑉
:
𝐶𝑖
∗
= (𝐶𝑖 + 1)
𝑁
𝑁 + 𝑉
6. Laplace Smoothing / Add 1 Smoothing
(Cont…)
• A related way to view smoothing is as discounting
(lowering) some non-zero counts in order to get
the probability mass that will be assigned to the
zero counts.
• Thus, instead of referring to the discounted
counts c, we might describe a smoothing
algorithm in terms of a relative discount dc, the
ratio of the discounted counts to the original
counts:
𝑑 𝑐 =
𝐶∗
𝐶
9. Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Recall that normal bigram probabilities are computed by
normalizing each row of counts by the unigram count:
𝑃 𝑊𝑛 𝑊𝑛−1 =
𝐶(𝑊𝑛−1 𝑊𝑛)
𝐶(𝑊𝑛−1)
• For add-one smoothed bigram counts, we need to augment the
unigram count by the number of total word types in the vocabulary
V:
𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒
∗
𝑊𝑛 𝑊𝑛−1 =
𝐶 𝑊𝑛−1 𝑊𝑛 + 1
𝐶 𝑊𝑛−1 + 𝑉
• Thus, each of the unigram counts given in the previous table will
need to be augmented by V =1446.
• Ex:𝑃𝐿𝑎𝑝𝑙𝑎𝑐𝑒
∗
𝑡𝑜 𝐼 =
𝐶 𝐼,𝑡𝑜 +1
𝐶 𝐼 +𝑉
=
0+1
2500+1446
=
1
3946
= 0.000253
10. Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Following table shows the add-one smoothed
probabilities for the bigrams:
11. Laplace Smoothing / Add 1 Smoothing
(Cont…)
• It is often convenient to reconstruct the count
matrix so we can see how much a smoothing
algorithm has changed the original counts.
• These adjusted counts can be computed by
using following equation:
𝐶∗
(𝑊𝑛−1 𝑊𝑛 ) =
[𝐶 𝑊𝑛−1 𝑊𝑛 ] × 𝐶(𝑊𝑛−1)
𝐶 𝑊𝑛−1 + 𝑉
12. Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Following table shows the reconstructed
counts:
13. Laplace Smoothing / Add 1 Smoothing
(Cont…)
• Note that add-one smoothing has made a very big change to the
counts.
• C(want to) changed from 608 to 238!
• We can see this in probability space as well: P(to|want) decreases
from .66 in the unsmoothed case to .26 in the smoothed case.
• Looking at the discount d (the ratio between new and old counts)
shows us how strikingly the counts for each prefix word have been
reduced; the discount for the bigram want to is .39, while the
discount for Chinese food is .10, a factor of 10!
• The sharp change in counts and probabilities occurs because too
much probability mass is moved to all the zeros.
14. Add-K Smoothing
• One alternative to add-one smoothing is to move a bit
less of the probability mass from the seen to the
unseen events.
• Instead of adding 1 to each count, we add a fractional
count k (.5? .05? .01?).
• This algorithm is therefore called add-k smoothing.
𝑃𝐴𝑑𝑑−𝑘
∗
𝑊𝑛 𝑊𝑛−1 =
𝐶 𝑊𝑛−1 𝑊𝑛 + 𝑘
𝐶 𝑊𝑛−1 + 𝑘𝑉
• Add-k smoothing requires that we have a method for
choosing k; this can be done, for example, by
optimizing on a development set (devset).
15. Good-Turing Discounting
• The basic insight of Good-Turing smoothing is to re-estimate the amount of
probability mass to assign to N-grams with zero or low counts by looking at the
number of N-grams with higher counts.
• In other words, we examine Nc, the number of N-grams that occur c times.
• We refer to the number of N-grams that occur c times as the frequency of
frequency c.
• So applying the idea to smoothing the joint probability of bigrams, N0 is the
number of bigrams b of count 0, N1 the number of bigrams with count 1, and so
on:
𝑁𝐶 =
𝑏:𝑐 𝑏 =𝑐
1
• The Good-Turing estimate gives a smoothed count c* based on the set of Nc for all
c, as follows:
𝐶∗ = (𝐶 + 1)
𝑁𝐶 + 1
𝑁𝐶
https://www.youtube.com/watch?v=z1bq4C8hFEk
16. Backoff and Interpolation
• In the backoff model, like the deleted interpolation model,
we build an N-gram model based on an (N-1)-gram model.
• The difference is that in backoff, if we have non-zero
trigram counts, we rely solely on the trigram counts and
don’t interpolate the bigram and unigram counts at all.
• We only ‘back off’ to a lower-order N-gram if we have zero
evidence for a higher-order N-gram.