Natural Language Processing CS 1462: N-Grams and Conventional Language Models
Natural Language Processing CS 1462: N-Grams and Conventional Language Models
Natural Language Processing CS 1462: N-Grams and Conventional Language Models
CS 1462
1
Some slides borrows from Carl Sable
N-grams and N-gram Models
2
N-grams and N-gram Models
3
Applications of N-gram Models
4
Corpora
Obtaining counts of things in a natural language can rely on a corpus (the plural is
corpora)
For example, the Brown corpus is a million-word collection of 500 written texts from
different genres (newspaper, fiction, non-fiction, academic, etc.)
The Brown corpus was assembled at Brown University in 1963-1964
The Brown corpus has 61,805 wordform types, 37,851 lemma types, 1 million wordform
tokens
An example for spoken English is the Switchboard corpus of telephone conversations
between strangers was collected in the early 1990s
The Switchboard corpus contains 2,430 conversations averaging 6 minute each
The textbook distinguishes between the number of types, or distinct words, in a corpus
(a.k.a. the vocabulary size) from the number of tokens in the corpus (i.e., the total count
of wordforms)
The Switchboard corpus has about 20,000 wordform types and about 3 million wordform
tokens (both audio files and transcribed text files with time alignment and other
metadata are available)
The previous edition of the textbook mentions a study by Gale and Church estimating
that the vocabulary size grows with at least the square root of the number of tokens
In later topics, we will learn about the Penn Treebank (which Cooper now also has access5
to!)
Words Revisited
6
Estimating N-gram Probabilities
Recall that an N-gram model computes the probabilities of each possible final token
of an N-gram given the previous tokens
One simple method for computing such N-gram probabilities (the probability of a
word given some history) is to take a simple ratio of counts
Example:
C("its water is so transparent that the")
P("the" | "its water is so transparent that")=
C("its water is so transparent that")
The book suggests using a web search engine to compute an estimate for this
example
However, the result of doing this would give you an estimate close to 1 because,
many of the hits relate to this example from the book!
Book: "While this method of estimating probabilities directly from counts works fine
in many cases, it turns out that even the web isn't big enough to give us good
estimates in most cases."
In general, the bigger N is, the more data you need to achieve reasonable N-gram
estimates
7
The Chain Rule
8
The Chain Rule
9
Applying N-gram Models
11
Maximum Likelihood Estimates
The simplest way to learn bigram, or more general N-gram, probabilities from a
training corpus is called maximum likelihood estimation (MLE)
For bigrams of words, the equation that can be used is:
C(wn−1 , wn ) C(wn−1 , wn )
P wn wn−1 = =
σw C(wn−1 , w) C(wn−1 )
The example we saw earlier is a specific example for a larger N-gram
More generally, MLE sets the parameters of a model in such a way as to maximize
the probability of observations
It is possible to derive the formula above, for bigram models, given this definition
For more general N-grams, the MLE formula becomes:
C(wn−N+1:n−1 , wn )
P wn wn−N+1:n−1 =
C(wn−N+1:n−1 )
When computing probabilities for N-grams, it is important that your training set
reflect the nature of the data that you are likely to see later (this is true for
parameter estimation and ML in general)
If you do not know in advance what future data will look like, you may want to use
training data that mixes different genres (e.g., newspaper, fiction, telephone
conversation, web pages, etc.)
12
Start-of-sentence & End-of-sentence Tokens
Although the MLE formula may seem simple and intuitive, there are a few things I want to
point out
First, as pointed out in the textbook, the final simplification occurs because the sum of all
bigram counts that start with a given word must be equal to the unigram count for that word
This may seem to ignore the possibility that wn-1 matches the final word of the sequence we are
using to compute statistics
However, as the current draft of the textbook points out, for things to work out appropriately,
we must include both start-of-sentence tokens ("<s>") and end-of-sentence tokens ("</s>")
It is common to process one sentence at a time in NLP; if we are not doing this, we could more
generally use special start and end symbols at the start and end of each document or span of
text being considered
With these extra symbols, wn-1 (the second to last word in a bigram or other N-gram) will never
be the same as the final word of the sequence being used to compute probabilities
These extra symbols will ensure that the probability estimates for all possible words ending an
N-gram, starting with some particular sequence, add up to one
Additionally, this avoids the need to use a unigram probability at the start of the sequence when
we compute its probability
13
Example of Bigram Estimates
14
Bigram example
I + want and want + to have big numbers whereas spend to has very less
15
Normalization
Is occurrence of two words together / occurrence of first word
16
Natural Language Generation
17
NLG: Trained on Shakespeare
18
NLG: Trained on the Wall Street Journal
19
Language Models
One way to evaluate the performance of a language model (the best way, according
to the textbook) is to embed it in another application (e.g., a machine translation
system) and evaluate its performance
This approach is called extrinsic evaluation; extrinsic evaluation can involve a lot
of time and effort
Alternatively, intrinsic evaluation uses a metric to evaluate the quality of a model
(in this case, a language model) independent of any application
Often, improvements using an intrinsic measure correlate with improvements using
an extrinsic measure
An intrinsic evaluation is based on a test set
Ideally, the test set should not be looked at before the evaluation, and the model
being evaluated should only be applied to the test set once
As mentioned earlier, if we need to tune hyperparameters, we can use a validation
set (that can be used multiple times)
In principle, the way to evaluate a language model with intrinsic evaluation is to
calculate the probability assigned by the model to the test set (although we will soon
discuss pragmatic complications related to this)
This can be compared to probabilities assigned to the test set by other language
models (in theory, better language models will assign higher probabilities)
22
Log Probabilities
One issue is that if many probabilities are multiplied together, the result
would likely be stored as 0 due to finite precision and numerical underflow
Instead, we therefore add log probabilities
The base doesn’t really matter, as long as it is consistent, but let's assume it
is e
Also recall that the log of a product is equal to the sum of the log of its parts
Using log probabilities avoids 0’s that may otherwise result due to precision
issues
Additionally, it also makes computation quicker, since we are doing
additions instead of multiplications
The precomputed probability estimates for N-grams can be stored as log
probabilities directly
Therefore, there is no need to compute the logarithms when evaluating the
language model
If we want to report an actual probability at the end, we can raise e to the
power of the final log probability; e.g.,
p1 * p2 * p3 * p4 = e(log p1 + log p2 + log p3 + log p4)
23
Normalization
Is occurrence of two words together / occurrence of first word
24
Perplexity
i 1 i−1
The higher the probability of the word sequence, the lower the
perplexity
25
Perplexity Example
26
Smoothing
MLE produces poor estimates when counts are small due to sparse data,
especially when counts are zero
When evaluating a language model on a dataset, if any N-gram occurs that
never occurred in the training set, it would be assigned 0 probability (and
the log probability would be undefined)
Smoothing techniques are modifications that address poor estimates due
to sparse data
Laplace smoothing (e.g., add-one smoothing) is a simple method that adds
a constant (e.g., 1) to all counts
For example, for unigram probability estimates, we have: PLaplace(wi) = (ci +
1) / (N + V)
We can view smoothing as discounting (lowering) some non-zero counts in
order to obtain the probability mass that will be assigned to the zero counts
The changes due to add-one smoothing are too big in practice
You could try adding a fraction less-than one (this is a more general form of
Laplace smoothing), but choosing an appropriate fraction is difficult (in
practice, it can be chosen based on a development set)
27
Smoothing
28
Smoothing
29
Smoothing
30
Backoff and Interpolation
32