Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

NLP Exp03

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Department of Computer Engineering

Semester B.E. Semester VII – Computer Engineering


Subject NLP Lab
Subject Professor In- Prof. Suja Jayachandran
charge
Assisting Teachers Prof. Suja Jayachandran

Student Name Sapana angad survase


Roll Number 20102B2005
Grade and Subject Teacher’s
Signature

Experiment Number 3
Experiment Title To study N Gram Model to calculate bigrams from a given corpus and
calculate probability of a sentence.
Resources / Hardware: Computer System Programming language:
Apparatus Required Web IDE: Goggle colab python

Description A combination of words forms a sentence. However, such a formation


is meaningful only when the words are arranged in some order.

Eg: Sit I car in the

Such a sentence is not grammatically acceptable. However, some


perfectly grammatical sentences can be nonsensical too!

Eg: Colorless green ideas sleep furiously

One easy way to handle such unacceptable sentences is by assigning


probabilities to the strings of words i.e, how likely the sentence is.

Probability of a sentence

If we consider each word occurring in its correct location as an


independent event, the probability of the sentences is : P(w(1), w(2)...,
w(n-1), w(n))

Using chain rule: = P(w(1)) * P(w(2) | w(1)) * P(w(3) |


w(1)w(2)) ... P(w(n) | w(1)w(2) ... w(n-1))
Department of Computer Engineering

Bigrams

We can avoid this very long calculation by approximating that the


probability of a given word depends only on the probability of its
previous words. This assumption is called Markov assumption and such
a model is called Markov model- bigrams. Bigrams can be generalized
to the n-gram which looks at (n-1) words in the past. A bigram is a first-
order Markov model.

Therefore, P(w(1), w(2)..., w(n-1), w(n)) = P(w(2)|w(1)) P(w(3)|


w(2)) ... P(w(n)|w(n-1))

We use (eos) tag to mark the beginning and end of a sentence.

A bigram table for a given corpus can be generated and used as a


lookup table for calculating probability of sentences.

Eg: Corpus - (eos) You book a flight (eos) I read a book (eos) You read
(eos)

Bigram Table:

P((eos) you read a book (eos))


= P(you|eos) * P(read|you) * P(a|read) * P(book|a) * P(eos|book)
= 0.33 * 0.5 * 0.5 * 0.5 * 0.5
=.020625
Department of Computer Engineering

Program import re

bigramProbability = []
uniqueWords = []

def preprocess(corpus):
  corpus = 'eos ' + corpus.lower()
  corpus = corpus.replace('.', ' eos')
  return corpus

def generate_tokens(corpus):
  corpus = re.sub(r'[^a-zA-Z0-9\s]', ' ', corpus)
  tokens = [token for token in corpus.split(" ") if token != ""]
  return tokens

def generate_word_counts(wordList):
  wordCount = {}
  for word in wordList:
    if word not in wordCount:
      wordCount.update({word: 1})
    else:wordCount[word] += 1
  return(wordCount)

def generate_ngrams(tokens):
  ngrams = zip(*[tokens[i:] for i in range(2)])
  return [" ".join(ngram) for ngram in ngrams]

def print_probability_table():
  print('\nBigram Probability Table:\n')
  for word in uniqueWords:
    print('\t', word, end = ' ')
  print()
  for i in range(len(uniqueWords)):
    print(uniqueWords[i], end = ' ')
    probabilities = bigramProbability[i]
    for probability in probabilities:
      print('\t', probability, end = ' ')
    print()

def generate_bigram_table(corpus):
  corpus= preprocess(corpus)
  tokens = generate_tokens(corpus)
Department of Computer Engineering

  wordCount = generate_word_counts(tokens)
  uniqueWords.extend(list(wordCount.keys()))
  bigrams = generate_ngrams(tokens)
  print (bigrams)
  for firstWord in uniqueWords:
    probabilityList = []
    for secondWord in uniqueWords:
      bigram = firstWord + ' ' + secondWord
      probability = bigrams.count(bigram) / wordCount[firstWord]
      probabilityList.append(probability)
    bigramProbability.append(probabilityList)
  print_probability_table()

def get_probability(sentence):
  corpus = preprocess(sentence)
  tokens = generate_tokens(corpus)
  probability = 1
  for token in range(len(tokens) -1):
    firstWord = tokens[token]
    secondWord = tokens[token + 1]
    pairProbability= bigramProbability[uniqueWords.index(firstWord)]
[uniqueWords.index(secondWord)]
    print('Probability: {1} | {0} = {2}'.format(firstWord, secondWord, pa
irProbability))
    probability *= pairProbability
  print('Probability of Sentence:', probability)

corpus = 'You book a flight. I read a book. You read.'
example = 'You read a book.'

print('Corpus:', corpus)
generate_bigram_table(corpus)

print('\nSentence:', example)
get_probability(example)
Department of Computer Engineering

Output

Conclusion  An n-gram model is a type of probabilistic language model for


predicting the next item in such a sequence in the form of a (n −
1)–order Markov model.
 n-gram models are now widely used in probability,
communication theory, computational linguistics (for
instance, statistical natural language processing),
computational biology (for instance, biological sequence
analysis), and data compression.
 Two benefits of n-gram models (and algorithms that use them)
are simplicity and scalability
 With a larger n, a model can store more context with a
well-understood space–time trade off, enabling small
experiments to scale up efficiently.

You might also like