NLP Exp03

Department of Computer Engineering
Semester B.E. Semester VII – Computer Engineering

Subject NLP Lab
Subject Professor In- Prof. Suja Jayachandran
charge
Assisting Teachers Prof. Suja Jayachandran
Student Name Sapana angad survase

Roll Number 20102B2005
Grade and Subject Teacher’s
Signature
Experiment Number 3
Experiment Title To study N Gram Model to calculate bigrams from a given corpus and
calculate probability of a sentence.
Resources / Hardware: Computer System Programming language:
Apparatus Required Web IDE: Goggle colab python
Description A combination of words forms a sentence. However, such a formation

is meaningful only when the words are arranged in some order.
Eg: Sit I car in the
Such a sentence is not grammatically acceptable. However, some

perfectly grammatical sentences can be nonsensical too!
Eg: Colorless green ideas sleep furiously
One easy way to handle such unacceptable sentences is by assigning

probabilities to the strings of words i.e, how likely the sentence is.
Probability of a sentence
If we consider each word occurring in its correct location as an

independent event, the probability of the sentences is : P(w(1), w(2)...,
w(n-1), w(n))
Using chain rule: = P(w(1)) * P(w(2) | w(1)) * P(w(3) |

w(1)w(2)) ... P(w(n) | w(1)w(2) ... w(n-1))
Bigrams
We can avoid this very long calculation by approximating that the

probability of a given word depends only on the probability of its
previous words. This assumption is called Markov assumption and such
a model is called Markov model- bigrams. Bigrams can be generalized
to the n-gram which looks at (n-1) words in the past. A bigram is a first-
order Markov model.
Therefore, P(w(1), w(2)..., w(n-1), w(n)) = P(w(2)|w(1)) P(w(3)|

w(2)) ... P(w(n)|w(n-1))
We use (eos) tag to mark the beginning and end of a sentence.
A bigram table for a given corpus can be generated and used as a

lookup table for calculating probability of sentences.
Eg: Corpus - (eos) You book a flight (eos) I read a book (eos) You read
(eos)
Bigram Table:
P((eos) you read a book (eos))

= P(you|eos) * P(read|you) * P(a|read) * P(book|a) * P(eos|book)
= 0.33 * 0.5 * 0.5 * 0.5 * 0.5
=.020625
Program import re
bigramProbability = []
uniqueWords = []
def preprocess(corpus):
corpus = 'eos ' + corpus.lower()
corpus = corpus.replace('.', ' eos')
return corpus
def generate_tokens(corpus):
corpus = re.sub(r'[^a-zA-Z0-9\s]', ' ', corpus)
tokens = [token for token in corpus.split(" ") if token != ""]
return tokens
def generate_word_counts(wordList):
wordCount = {}
for word in wordList:
if word not in wordCount:
wordCount.update({word: 1})
else:wordCount[word] += 1
return(wordCount)
def generate_ngrams(tokens):
ngrams = zip(*[tokens[i:] for i in range(2)])
return [" ".join(ngram) for ngram in ngrams]
def print_probability_table():
print('\nBigram Probability Table:\n')
for word in uniqueWords:
print('\t', word, end = ' ')
print()
for i in range(len(uniqueWords)):
print(uniqueWords[i], end = ' ')
probabilities = bigramProbability[i]
for probability in probabilities:
print('\t', probability, end = ' ')
print()
def generate_bigram_table(corpus):
corpus= preprocess(corpus)
tokens = generate_tokens(corpus)
wordCount = generate_word_counts(tokens)
uniqueWords.extend(list(wordCount.keys()))
bigrams = generate_ngrams(tokens)
print (bigrams)
for firstWord in uniqueWords:
probabilityList = []
for secondWord in uniqueWords:
bigram = firstWord + ' ' + secondWord
probability = bigrams.count(bigram) / wordCount[firstWord]
probabilityList.append(probability)
bigramProbability.append(probabilityList)
print_probability_table()
def get_probability(sentence):
corpus = preprocess(sentence)
tokens = generate_tokens(corpus)
probability = 1
for token in range(len(tokens) -1):
firstWord = tokens[token]
secondWord = tokens[token + 1]
pairProbability= bigramProbability[uniqueWords.index(firstWord)]
[uniqueWords.index(secondWord)]
print('Probability: {1} | {0} = {2}'.format(firstWord, secondWord, pa
irProbability))
probability *= pairProbability
print('Probability of Sentence:', probability)
corpus = 'You book a flight. I read a book. You read.'
example = 'You read a book.'
print('Corpus:', corpus)
generate_bigram_table(corpus)
print('\nSentence:', example)
get_probability(example)
Output
Conclusion  An n-gram model is a type of probabilistic language model for

predicting the next item in such a sequence in the form of a (n −
1)–order Markov model.
 n-gram models are now widely used in probability,
communication theory, computational linguistics (for
instance, statistical natural language processing),
computational biology (for instance, biological sequence
analysis), and data compression.
 Two benefits of n-gram models (and algorithms that use them)
are simplicity and scalability
 With a larger n, a model can store more context with a
well-understood space–time trade off, enabling small
experiments to scale up efficiently.

NLP Exp03

Uploaded by

Copyright:

Available Formats

NLP Exp03

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Exp03

Uploaded by

Copyright:

Available Formats

Department of Computer Engineering

Semester B.E. Semester VII – Computer Engineering

Student Name Sapana angad survase

Description A combination of words forms a sentence. However, such a formation

Eg: Sit I car in the

Such a sentence is not grammatically acceptable. However, some

Eg: Colorless green ideas sleep furiously

One easy way to handle such unacceptable sentences is by assigning

If we consider each word occurring in its correct location as an

Using chain rule: = P(w(1)) * P(w(2) | w(1)) * P(w(3) |

We can avoid this very long calculation by approximating that the

Therefore, P(w(1), w(2)..., w(n-1), w(n)) = P(w(2)|w(1)) P(w(3)|

We use (eos) tag to mark the beginning and end of a sentence.

A bigram table for a given corpus can be generated and used as a

P((eos) you read a book (eos))

Conclusion  An n-gram model is a type of probabilistic language model for

You might also like