Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 32

INSTITUTE OF TECHNICAL EDUCATION

AND RESEARCH
(SOA Deemed to be University)

Lab Assignment Report


Information Retrieval (CSE 4053)

Submitted By
Name: SANDEEP BEHERA
Registration No.: 1841012480
Branch :CSE
Semester:7th Section :D
Content

Sl. No. Name of the assignment Signature


1 Assignment 1/practice questions
2 Assignment 2
3 Assignment 3
4 Assignment 4
NAME -ROHAN PANDA

IR LAB ASSIGNMENT 1 /PRACTICE QUESTIONS

REGD NO - 1841012123

BRANCH AND SEC - CSE -'D'

#Q1 FIND THE GREATEST AMONG THREE NUMBERS


num1 = 10
num2 = 14
num3 = 12

# uncomment following lines to take three numbers from user #num1 = float(input("Enter first number: "))
#num2 = float(input("Enter second number: ")) #num3 = float(input("Enter third number: "))

if (num1 >= num2) and (num1 >= num3): largest = num1


elif (num2 >= num1) and (num2 >= num3): largest = num2
else:
largest = num3

print("The largest number is", largest)

The largest number is 14

#Q2 Program to display the Fibonacci sequence up to n-th term

nterms = int(input("How many terms? "))

# first two terms


n1, n2 = 0, 1
count = 0

# check if the number of terms is valid


if nterms <= 0:
print("Please enter a positive integer") # if there is only one term, return n1 elif nterms == 1:
print("Fibonacci sequence upto",nterms,":") print(n1)
# generate fibonacci sequence
else:
print("Fibonacci sequence:")
while count < nterms: print(n1)
nth = n1 + n2 # update values n1 = n2
n2 = nth count += 1

How many terms? 3


Fibonacci sequence:
0
1
1

#Q3 print your name 10 times


n=1
while n<=10:
print("Rohan") n=n+1
Rohan
Rohan
Rohan
Rohan
Rohan
Rohan
Rohan
Rohan
Rohan
Rohan

#Q4 solve the equation s=ut+1/2at^2 for values u= 0.2 t =5 a=3


u= 0.2
t =5 a=3
s=(u*t+1/2*a*t*t) print(s)

38.5

#q5 initialize two matrices and perform the following operations over them using switch transpose ,addition,subt
rows = int(input("Enter the Number of rows : " )) column = int(input("Enter the Number of Columns: "))

print("Enter the elements of First Matrix:")


matrix_a= [[int(input()) for i in range(column)] for i in range(rows)] print("First Matrix is: ")
for n in matrix_a: print(n)

print("Enter the elements of Second Matrix:")


matrix_b= [[int(input()) for i in range(column)] for i in range(rows)]
for n in matrix_b: print(n)

result=[[0 for i in range(column)] for i in range(rows)]

for i in range(rows):
for j in range(column):
result[i][j] = matrix_a[i][j]+matrix_b[i][j]

def switch():
print("Press 1 for Addittion \nPress 2 for Subtraction \nPress 3 for Multiplication \nPress 4 for Division")
option = int(input("Enter your option: "))
if option == 1:
result = matrix_a+matrix_b print("Addition : ", result)
elif option == 2:
result = matrix_a-matrix_b print("Subtraction : ",result)
elif option == 3:
result = matrix_a*matrix_b print("Multiplication : ", result)
elif option == 4:
result = matrix_a/matrix_b print("Division : ",result)
else:
print("Invalid Value") switch()

Enter the Number of rows : 2


Enter the Number of Columns: 2
Enter the elements of First Matrix:
2
2
3
4
First Matrix is:
[2, 2]
[3, 4]
Enter the elements of Second Matrix:
4
5
6
7
[4, 5]
[6, 7]
Press 1 for Addittion
Press 2 for Subtraction
Press 3 for Multiplication
Press 4 for Division Enter
your option: 1
Addition : [[2, 2], [3, 4], [4, 5], [6, 7]]
#Q6 create a list and do the operations of adding at a particular ,add ,append it
List = ["Are", "You", "stupid"] print(List) List.insert(3,"idiot") print(List)
List.append("Mad") print(List)

['Are', 'You', 'stupid']


['Are', 'You', 'stupid', 'idiot']
['Are', 'You', 'stupid', 'idiot', 'Mad']

#Q7 split a string into two substring text = 'Ms Dhoni is my inspiration' print(text.split())

['Ms', 'Dhoni', 'is', 'my', 'inspiration']

#q8 Write a Python program to count the number of strings where the string length is 2 or more and the first and
def match_words(words): ctr = 0

for word in words:


if len(word) > 1 and word[0] == word[-1]: ctr += 1
return ctr

print(match_words(['abc', 'xyz', 'aba', '1221']))

#Q9 Write a Python program to get a list, sorted in increasing order by the last element in each tuple from a gi
def last(n): return n[-1]

def sort_list_last(tuples):
return sorted(tuples, key=last)

print(sort_list_last([(2, 5), (1, 2), (4, 4), (2, 3), (2, 1)]))
[(2, 1), (1, 2), (2, 3), (4, 4), (2, 5)]

#Q10 Write a Python program to generate and print a list of first and last 5 elements where the values are squar
def printValues(): l = list()
for i in range(1,21): l.append(i**2)
print(l[:5])
print(l[-5:])

printValues()

[1, 4, 9, 16, 25]


[256, 289, 324, 361, 400]

#Q11 Write a Python program to convert a list of multiple integers into a single integer.
L = [11, 33, 50]
print("Original List: ",L)
x = int("".join(map(str, L))) print("Single Integer: ",x)

Original List: [11, 33, 50]


Single Integer: 113350
#Q12 Write a Python program to sort (ascending and descending) a dictionary by value.
import operator
d = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
print('Original dictionary : ',d)
sorted_d = sorted(d.items(), key=operator.itemgetter(1)) print('Dictionary in ascending order by value :
',sorted_d)
sorted_d = dict( sorted(d.items(), key=operator.itemgetter(1),reverse=True)) print('Dictionary in descending
order by value : ',sorted_d)

Original dictionary : {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}


Dictionary in ascending order by value : [(0, 0), (2, 1), (1, 2), (4, 3), (3, 4)]
Dictionary in descending order by value : {3: 4, 4: 3, 1: 2, 2: 1, 0: 0}

#Q13 Write a Python script to generate and print a dictionary that contains a number (between 1 and n) in the f
n=int(input("Input a number ")) d = dict()

for x in range(1,n+1): d[x]=x*x

print(d)

Input a number 7
{1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49}

#Q14 Write a Python program to iterate over dictionaries using for loops.
d = {'Red': 1, 'Green': 2, 'Blue': 3}
for color_key, value in d.items():
print(color_key, 'corresponds to ', d[color_key])

Red corresponds to 1
Green corresponds to 2
Blue corresponds to 3

#Q15 Write a Python program to create and display all combinations of letters, selecting each letter from a diff
import itertools
d ={'1':['a','b'], '2':['c','d']}
for combo in itertools.product(*[d[k] for k in sorted(d.keys())]): print(''.join(combo))

ac
ad
bc
bd

#Q16 Write a Python program to unzip a list of tuples into individual lists.
l = [(1,2), (3,4), (8,9)]
print(list(zip(*l)))

[(1, 3, 8), (2, 4, 9)]

#Q17 Write a Python program to remove an empty tuple(s) from a list of tuples
L = [(), (), ('',), ('a', 'b'), ('a', 'b', 'c'), ('d')]
L = [t for t in L if t] print(L)

[('',), ('a', 'b'), ('a', 'b', 'c'), 'd']

#Q18 Write a Python function that accepts a string and calculate the number of upper case letters and lower case
def string_test(s): d={"UPPER_CASE":0, "LOWER_CASE":0}
for c in s:
if c.isupper(): d["UPPER_CASE"]+=1
elif c.islower(): d["LOWER_CASE"]+=1
else:
pass
print ("Original String : ", s)
print ("No. of Upper case characters : ", d["UPPER_CASE"]) print ("No. of Lower case Characters : ",
d["LOWER_CASE"])
string_test('Heyy My Rowdy Boys And Girls')

Original String : Heyy My Rowdy Boys And Girls


No. of Upper case characters : 6
No. of Lower case Characters : 17

Loading [MathJa x]/jax/output/CommonHTML/fonts/TeX/fontdata.js


NAME ROHAN PANDA

IR LAB ASSIGNMENT 2

REGD NO- 1841012123

BRANCH AND SEC - CSE 'D'

# qno 1 INitialize the following term incidence matrix process the query

doc=["Anthony & Cleopatra" , "Julis Ceaser" , "The Tempest" , "Hamlet" , "Othello" , "Macbeth"] fin=["Antony" ,
"Brutue" , "Ceaser" , "Calpumia" , "Cleopatra" , "Mercy" , "Worser"]
mat=[] final_matrix=[]
mat.append([1,1,0,0,0,1])
mat.append([1,1,0,1,0,0])
mat.append([1,1,0,1,1,1])
mat.append([0,1,0,0,0,0])
mat.append([1,0,0,0,0,0])
mat.append([1,0,1,1,1,1])
mat.append([1,0,1,1,1,0])
mat.append([1,0,0,1,0,0])
for x in mat:
print(x)
for x in range(0,6):
if mat[1][x]==1 and mat[2][x] and mat[3][x]!=1: final_matrix.append(1)
else:
final_matrix.append(0) print("The Matrix is : ") print(final_matrix) print("Final document is : ") for x in range
(6):
if final_matrix[x]==1: print(doc[x])
#qno 2 Given the four documents #Doc1="Breakthrough drug for Schizophernia" #Doc2="New Schizophemia drug"
#Doc3="New approach for treatment of Schizophernia" #Doc4="New hopes for Schizophernia patients"
# GENERATE THE TERM DOCUMENT INCIDENCE MATRIX
Doc1="Breakthrough drug for Schizophernia" Doc2="New Schizophemia drug"
Doc3="New approach for treatment of Schizophernia" Doc4="New hopes for Schizophernia patients" d1=Doc1.split(" ")
d2=Doc2.split(" ") d3=Doc3.split(" ") d4=Doc4.split(" ") words=[]
for i in d1 :
words.append(i)
for i in d2 :
words.append(i)
for i in d3 :
words.append(i)
for i in d4 :
words.append(i) dset = set(words) words = list(dset)
words = sorted(words, key=str.lower)

mat = []
for x in range(len(words)): temp = [] temp.append(words[x])
if (Doc1.find(words[x]) == -1): temp.append(0)
else:
temp.append(1)
if (Doc2.find(words[x]) == -1): temp.append(0)
else:
temp.append(1)
if (Doc3.find(words[x]) == -1): temp.append(0)
else:
temp.append(1)
if (Doc4.find(words[x]) == -1): temp.append(0)
else:
temp.append(1) mat.append(temp)
print("Term Document Incident Matrix is :\n") print('{:<15s}{:^8s}{:^8s}{:^8s}{:^8s}'.format("", "Doc1", "Doc2",
"Doc3", "Doc4")) for x in mat:
print('{:<15s}{:^8d}{:^8d}{:^8d}{:^8d}'.format(x[0], x[1], x[2], x[3], x[4]))

Term Document Incident Matrix is :

Doc1 Doc2 Doc3 Doc4


approach 0 0 1 0
Breakthrough 1 0 0 0
drug 1 1 0 0
for 1 0 1 1
hopes 0 0 0 1
New 0 1 1 1
of 0 0 1 0
patients 0 0 0 1
Schizophemia 0 1 0 0
Schizophernia 1 0 1 1
treatment 0 0 1 0

#q3 Construct the inverted index for the document given below
Doc1="Breakthrough drug for Schizophernia" Doc2="New Schizophemia drug"
Doc3="New approach for treatment of Schizophernia" Doc4="New hopes for Schizophernia patients" d1=Doc1.split(" ")
d2=Doc2.split(" ") d3=Doc3.split(" ") d4=Doc4.split(" ") words=[]
for i in d1 :
words.append(i)
for i in d2 :
words.append(i)
for i in d3 :
words.append(i)
for i in d4 :
words.append(i) dset = set(words) words = list(dset)
words = sorted(words, key=str.lower)

mat = []
for x in range(len(words)): temp = [] temp.append(words[x])
if (Doc1.find(words[x]) != -1): temp.append("1")
if (Doc2.find(words[x]) != -1): temp.append("2")
if (Doc3.find(words[x]) != -1): temp.append("3")
if (Doc4.find(words[x]) != -1): temp.append("4")
mat.append(temp) print("Inverted Index is :\n") for x in mat:
print(x)

Inverted Index is :

['approach', '3']
['Breakthrough', '1']
['drug', '1', '2']
['for', '1', '3', '4']
['hopes', '4']
['New', '2', '3', '4']
['of', '3']
['patients', '4']
['Schizophemia', '2']
['Schizophernia', '1', '3', '4']
['treatment', '3']

#q4 construct the sorting based inveted matrix for following


Doc1="Breakthrough drug for Schizophernia" Doc2="New Schizophemia drug"
Doc3="New approach for treatment of Schizophernia" d1=Doc1.split(" ")
d2=Doc2.split(" ") d3=Doc3.split(" ") d4=Doc4.split(" ") words=[]
for i in d1 :
words.append(i)
for i in d2 :
words.append(i)
for i in d3 :
words.append(i)
for i in d4 :
words.append(i) dset = set(words) words = list(dset)
words = sorted(words, key=str.lower)

mat = []
for x in range(len(words)): temp = [] temp.append(words[x]) c=0
if (Doc1.find(words[x]) != -1): temp.append(1)
c=c+1
if (Doc2.find(words[x]) != -1): temp.append(2)
c=c+1
if (Doc3.find(words[x]) != -1): temp.append(3)
c=c+1
if (Doc4.find(words[x]) != -1): temp.append(4)
c=c+1
temp[0] = temp[0] + " [" + str(c) + "] -> " mat.append(temp)
print("Sorted Inverted Index is :\n")
for x in mat:
for y in x:
print(y, end=" ") print()

Sorted Inverted Index is :

approach [1] -> 3


Breakthrough [1] -> 1
drug [2] -> 1 2
for [3] -> 1 3 4
hopes [1] -> 4
New [3] -> 2 3 4
of [1] -> 3
patients [1] -> 4
Schizophemia [1] -> 2
Schizophernia [3] -> 1 3 4
treatment [1] -> 3

#q5 process the query brutus and calpurnia using intersect algorithm
brutus = [1,2,4,11,31,45,173,174]
calpurnia = [2,31,54,101]

def intersect(brutus,calpurnia): ans = []


b = 0
c = 0
while b<len(brutus) and c<len(calpurnia):
if brutus[b]==calpurnia[c]: ans.append(brutus[b]) b=b+1
c=c+1
elif brutus[b]<calpurnia[c]: b=b+1
else:
c=c+1
return ans

print(intersect(brutus,calpurnia))
[2, 31]
#q6 implement the intersection algorithm by using skip pointers for the two posting list
T1 = [8, 16, 19, 23, 25, 28, 43, 71, 81]
T2 = [8, 41, 57, 60, 71] ANSWER = []
i, ip, j, jp, k = 0, 0, 0, 0, 3
while i < len(T1) and j < len(T2):
if T1[i] == T2[j]:
ANSWER.append(T1[i]) i += 1
ip += 1
j += 1
jp += 1
elif T1[i] < T2[j]: ip = ip + k
if ip < len(T1) and T1[ip] <= T2[j]: ip = ip + k
while ip < len(T1) and T1[ip] <= T2[j]: i = ip
ip = ip + k
else:
i += 1
else:
jp = jp + k
if jp < len(T2) and T2[jp] <= T1[i]: jp = jp + k
while jp < len(T2) and T2[jp] <= T1[i]: j = jp
jp = jp + k
else:
j += 1 print("ANSWER =", ANSWER)

ANSWER = [8, 71]

#q7 implement the intersection algorithm by using skip pointers for the above specified lists(consider k values
import math
T1 = [8, 16, 19, 23, 25, 28, 43, 71, 81]
T2 = [8, 41, 57, 60, 71] ANSWER = []
i, ip, j, jp, k1, k2 = 0, 0, 0, 0, int(math.sqrt(len(T1))), int(math.sqrt(len(T2)))
while i < len(T1) and j < len(T2):
if T1[i] == T2[j]:
ANSWER.append(T1[i]) i += 1
ip += 1
j += 1
jp += 1
elif T1[i] < T2[j]: ip = ip + k1
if ip < len(T1) and T1[ip] <= T2[j]: ip = ip + k1
while ip < len(T1) and T1[ip] <= T2[j]: i = ip
ip = ip + k1
else:
i += 1
else:
jp = jp + k2
if jp < len(T2) and T2[jp] <= T1[i]: jp = jp + k2
while jp < len(T2) and T2[jp] <= T1[i]: j = jp
jp = jp + k2
else:
j += 1 print("ANSWER =", ANSWER)

ANSWER = [8, 71]

Loading [MathJa x]/jax/output/CommonHTML/fonts/TeX/fontdata.js


NAME - ROHAN

PANDA IR LAB

ASSIGNMENT 3 REGD

NO - 1841012123

BRANCH AND SECTION - CSE 'D'

#q1 Find the permuterms of a given term of the dictionary. i.e., “hello” and store the posting list of each perm
# retrieve its corresponding term.
str="hello" str1="$" permuterm= []
for i in range (len(str) +1): res = str[i:]+ str1 +str[:i] permuterm.append(res)
print("All Permutems of string ", str, "is:") print(permuterm)
print()

wild = "hel*o" wildcard = []


wildcard.append(str1+wild) k=0
for i in reversed(range(len(wild))): k+=1
res = wild[i:] + str1 + wild[:i] wildcard.append(res)
if wildcard[k][-1] == "*":
break
print("Trailing Wildcard Query : ", wildcard[len(wildcard) - 1])

All Permutems of string hello is:


['hello$', 'ello$h', 'llo$he', 'lo$hel', 'o$hell', '$hello']

Trailing Wildcard Query : o$hel*

# q2 Find possible bigram/trigram for the term “hello”.


str="hello"
str1 = "$" + str + "$" bigram=[]
for i in range(len(str1)-1): bigram.append(str1[i:i+2])
print("Bigrams for '" + str + "' are : ", bigram) trigram=[]
for i in range(len(str1)-2): trigram.append(str1[i:i+3])
print("Trigrams for '" + str + "' are : ", trigram)
#q3 Construct a bigram index using k-gram indexing method by considering few terms.
str=input("Enter a string :") str1 = "$" + str + "$" bigram=[]
for i in range(len(str1)-1): bigram.append(str1[i:i+2])
print("Bigrams for '" + str + "' are : ", bigram) bigramx = []
for x in range(len(bigram)): bigramx.append(bigram[x].replace("$",""))
print("Bigramx : ", bigramx)
words = ["mace","madden","among","amortize","along","moon"] words.sort()
print("Dictionary is: ", words) list=[]
for x in bigramx: list.append([])
c = 0
for x in range(len(bigramx)): list[c].append(bigram[x]+" -> ") for y in words:
if(y.find(bigramx[x]) != -1):
list[c].append(" " + y + " ")
c=c+1
for x in list:
for y in x:
print(y, end=" ") print()

Enter a string :spiderman


Bigrams for 'spiderman' are : ['$s', 'sp', 'pi', 'id', 'de', 'er', 'rm', 'ma', 'an', 'n$']
Bigramx : ['s', 'sp', 'pi', 'id', 'de', 'er', 'rm', 'ma', 'an', 'n']
Dictionary is: ['along', 'among', 'amortize', 'mace', 'madden', 'moon']
$s ->
sp ->
pi ->
id ->
de -> madden
er ->
rm ->
ma -> mace madden
an ->
n$ -> along among madden moon

#q4 Calculate Jaccard’s co-efficient score between “november” and “december”using tri-gram index method.
import nltk
from nltk.util import ngrams
def intersection(lst1, lst2):
lst3 = [value for value in lst1 if value in lst2]
return lst3
def Union(lst1, lst2):
final_list = list(set(lst1) | set(lst2))
return final_list
p=list(ngrams("november", 3)) #november trigrams q=list(ngrams("december", 3)) #december trigrams
jackard_coeffiencent_score=float(len(intersection(p,q)) / len(Union(p,q))) print(jackard_coeffiencent_score)

0.3333333333333333

#q5 Calculate edit distance between “cat” and “catcat”.


a = "cat"
b = "catcat"
ls = lambda a, b: len(b) if not a else len(a) if not b \
else min(ls(a[1:],b[1:]) + (a[0]!=b[0]), ls(a[1:],b) + 1,
ls(a,b[1:]) + 1)
## The Result
print(ls(a,b))
print(ls(b,a))

3
3

#q6 Write a program to generate soundex code for a term.


list = ["Spiderman", "Ironman", "Captain_america", "Thor", "Hulk", "Hawkeye"]

print("NAME\t\tSOUNDEX")
for name in list:
print("%s\t\t%s" % (name, get_soundex(name)))
NAME SOUNDEX
Spiderman S136
Ironman I655
Captain_america C135
Thor T600
Hulk H420
Hawkeye H200

#q7 Write a program to find two differently spelled proper nouns #whose soundex codes are the same.
def get_soundex(name): name = name.upper() soundex = "" soundex += name[0]
dictionary = {"BFPV": "1", "CGJKQSXZ":"2", "DT":"3", "L":"4", "MN":"5","R":"6", "AEIOUHWY":"."}
for char in name[1:]:
for key in dictionary.keys():
if char in key:
code = dictionary[key]
if code != soundex[-1]: soundex += code
soundex = soundex.replace(".", "") soundex = soundex[:4].ljust(4, "0") return soundex
print("Two differently spelled proper nouns whose soundex codes are the same are: ") l = ["Google", "Goggle"]
print("NAME\t\tSOUNDEX")
for name in l:
print("%s\t\t%s" % (name, get_soundex(name)))

Two differently spelled proper nouns whose soundex codes are the same
are: NAME SOUNDEX
Google G240
Goggle G240

# q8 Write a program to find two phonetically similar proper nouns whose soundex codes are different.
def get_soundex(name): name = name.upper() soundex = "" soundex += name[0]
dictionary = {"BFPV": "1", "CGJKQSXZ":"2", "DT":"3", "L":"4","MN":"5", "R":"6", "AEIOUHWY":"."}
for char in name[1:]:
for key in dictionary.keys():
if char in key:
code = dictionary[key]
if code != soundex[-1]: soundex += code
soundex = soundex.replace(".", "") soundex = soundex[:4].ljust(4, "0") return soundex
print("Two phonetically similar proper nouns whose soundex codes are different are: ") l = ["Chebyshev",
"Tchebycheff"]
print("NAME\t\t\tSOUNDEX")
for name in l:
print("%s\t\t%s" % (name, get_soundex(name)))

Loading [MathJa x]/jax/output/CommonHTML/fonts/TeX/fontdata.js


NAME - ROHAN

PANDA IR LAB

ASSIGNMENT 4 REGD

NO 1841012123

BRANCH AND SECTION -CSE -'D'

#Q1 Write a program to calculate distinct terms present within term collection using Heap’s law.
#AIR CONDITIONING

# My telephone receiver slams down on its cradle. I'm


# upset. I am soaked to the skin, sweat runs from my
brow. # The air conditioner that I so naively entrusted
to the # Yellow Pages Repair shop is delayed another two
weeks.

# I could have it back tomorrow, I was told, if I happen


# to have a compressor relief control valve sensor
assembly, # part number 3B25189927.4A, in my pocket. The
repairman is # a funny fellow.

# Very funny.

# "Its a bit stuffy in here," my secretary says, in an


# attempt to explain her entering my office. This is
# obvious of course as nary a breeze wafts through the
# three-foot square hole in my wall that appeared in
# synchronization with the air conditioner's disappearance.
# She goes to the thermostat, checks the temperature, and
#adjusts its setting for the fourth time this morning.
# Shaking my head in frustration, I again try to
decipher
# the overdue report that is now blurred into illegibility
# by my
# sweat.

# An excellent typist, she's the best secretary I've


#ever had. Completely fulfilling her secretarial duties,
#she otherwise keeps to herself. Although I am by nature a
#curious man, personal matters between us have never been
#discussed. However, with the increase in temperature, her
# attire has of late become remarkable as to its increasing
# skimpiness.

# As to the hole in my wall, I have attempted to fill it


# with wadded papers and rags and such. This has proven
#ineffective, no thanks to the active flocks of nesting
#pigeons in the neighborhood.

# Last spring I reeceived a bill from the local


office # supply. It was rather badly smeared, but I did
notice
#something about furniture. A bill from the local office
# supply shop recently gave me a clue about my
secretary's # personal life.

#Her more recent change to now quite revealing


attire #confirms my suspicions.

# She obviously spends every non-working hour in


# thorough personal exploration of all things
culinary.

# In desperation, I reach for the phone.


import re
import string
from nltk.tokenize import word_tokenize
import nltk
from operator import itemgetter
import matplotlib.pyplot as pyplot
import numpy as np
from random import shuffle#for
shuffling from scipy.optimize import
curve_fit def heaps_formula (n,k,bt):
return k*(n**bt)
def tokenizer(l):#give the file name here
f = open(l,'r')
strl = f.read()
list_of_words = nltk.wordpunct_tokenize(strl)
return list_of_words
def heap_law(tokens_):
token_sample = [tokens_[0:i] for i in range(0,len(tokens_))] type_samples = [len(set(j)) for j in token_sample]
x = [len(tokens_[0:i]) for i in range(0,len(tokens_))] y = type_samples
parameters = curve_fit(heaps_formula, x, y)#paramter order - function that returns y, x, practical y
k,bt = parameters[0]
print("Complete Document link:") print("Document Size(without punctuation):",len(tokens))
print ("Estimated k value:",k) print("Estimated beta:",bt)
print("Estimated number of distinct words in the document as per heaps law:",round(k*(len(tokens)**bt)))
print("Heaps law plot:")
pyplot.plot(x,k*(x**bt), label = "frequency of words") pyplot.legend()
pyplot.show() tokens=tokenizer(r"E:\data\heapslaw.txt")
tokens = [x for x in tokens if not re.fullmatch('[' + string.punctuation +']+', x)] heap_law(tokens)

Complete Document link:http://textfiles.com/stories/7oldsamr.txt


Document Size(without punctuation): 334
Estimated k value: 1.770000015972175
Estimated beta: 0.827011930952926
Estimated number of distinct words in the document as per heaps law: 216
Heaps law plot:

#Q2 Consider any text document, pre-process it (steps like tokenization, stop word removal, stemming) and calcul
#1 TOKENIZATION
import nltk
doc_trump = "Mr Trump became president after winning the political election. Though he lost the support of some
nltk_tokens = word_tokenize(doc_trump)
print (nltk_tokens)

['Mr', 'Trump', 'became', 'president', 'after', 'winning', 'the', 'political', 'election', '.', 'Though', 'he', '
lost', 'the', 'support', 'of', 'some', 'republican', 'friends', ',', 'Trump', 'is', 'friends', 'with', 'President
', 'Putin']

#2 STOP WORD REMOVAL


from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

doc_trump = """Mr trump became president after winning the political election,
Though he lost the support of some republican friends, Trump is friends with President Putin""" stop_words =
set(stopwords.words('english'))
word_tokens = word_tokenize(doc_trump)

filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words] filtered_sentence = []


for w in word_tokens:
if w not in stop_words: filtered_sentence.append(w)

print(word_tokens) print(filtered_sentence)
['Mr', 'trump', 'became', 'president', 'after', 'winning', 'the', 'political', 'election', ',', 'Though', 'he', '
lost', 'the', 'support', 'of', 'some', 'republican', 'friends', ',', 'Trump', 'is', 'friends', 'with', 'President
', 'Putin']
['Mr', 'trump', 'became', 'president', 'winning', 'political', 'election', ',', 'Though', 'lost', 'support', 'rep
ublican', 'friends', ',', 'Trump', 'friends', 'President', 'Putin']

#3 STEMMING
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize ps = PorterStemmer()
doc_trump = "Mr Trump became president after winning the political election Though he lost the support of some r
words = word_tokenize(doc_trump)

for w in words:
print(w, " : ", ps.stem(w))

Mr : mr Trump
: trump
became : becam
president : presid
after : after
winning : win
the : the
political : polit
election : elect
Though : though
he : he
lost : lost
the : the
support : support
of : of
some : some
republican : republican
friends : friend
, : ,
Trump : trump
is : is
friends : friend
with : with
President : presid
Putin : putin

#4 CALCULATE TD-IDF SCORE


from sklearn.feature_extraction.text import TfidfVectorizer

# assign documents
doc_trump = "Mr. Trump became president after winning the political election. Though he lost the support of some
doc_election = "President Trump says Putin had no political interference is the election outcome. He says it was
doc_putin = "Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime

# merge documents into a single corpus


string = [doc_trump, doc_election, doc_putin]

# create object
tfidf = TfidfVectorizer()

# get tf-df values


result = tfidf.fit_transform(string)

# get idf values


print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names(), tfidf.idf_): print(ele1, ':', ele2)

# get indexing print('\nWord indexes:') print(tfidf.vocabulary_)

# display tf-idf values print('\ntf-idf value:') print(result)

# in matrix form
print('\ntf-idf values in matrix form:') print(result.toarray())
idf values:
after : 1.6931471805599454
as : 1.6931471805599454
became : 1.2876820724517808
by : 1.6931471805599454
career : 1.6931471805599454
claimed : 1.6931471805599454
do : 1.6931471805599454
earlier : 1.6931471805599454
election : 1.2876820724517808
elections : 1.6931471805599454
friend : 1.6931471805599454
friends : 1.6931471805599454
had : 1.2876820724517808
he : 1.2876820724517808
his : 1.6931471805599454
in : 1.6931471805599454
interference : 1.6931471805599454
is : 1.2876820724517808
it : 1.6931471805599454
lost : 1.6931471805599454
minister : 1.6931471805599454
mr : 1.6931471805599454
no : 1.6931471805599454
nothing : 1.6931471805599454
of : 1.2876820724517808
outcome : 1.6931471805599454
parties : 1.6931471805599454
political : 1.0
post : 1.6931471805599454
president : 1.0
prime : 1.6931471805599454
putin : 1.0
republican : 1.6931471805599454
russia : 1.6931471805599454
says : 1.6931471805599454
served : 1.6931471805599454
some : 1.6931471805599454
support : 1.6931471805599454
the : 1.0
though : 1.6931471805599454
to : 1.6931471805599454
trump : 1.2876820724517808
vladimir : 1.6931471805599454
was : 1.6931471805599454
who : 1.6931471805599454
winning : 1.6931471805599454
witchhunt : 1.6931471805599454
with : 1.2876820724517808

Word indexes:
{'mr': 21, 'trump': 41, 'became': 2, 'president': 29, 'after': 0, 'winning': 45, 'the': 38, 'political': 27, 'ele
ction': 8, 'though': 39, 'he': 13, 'lost': 19, 'support': 37, 'of': 24, 'some': 36, 'republican': 32, 'friends':
11, 'is': 17, 'with': 47, 'putin': 31, 'says': 34, 'had': 12, 'no': 22, 'interference': 16, 'outcome': 25, 'it':
18, 'was': 43, 'witchhunt': 46, 'by': 3, 'parties': 26, 'claimed': 5, 'friend': 10, 'who': 44, 'nothing': 23, 'to
': 40, 'do': 6, 'post': 28, 'elections': 9, 'vladimir': 42, 'russia': 33, 'served': 35, 'as': 1, 'prime': 30, 'mi
nister': 20, 'earlier': 7, 'in': 15, 'his': 14, 'career': 4}

tf-idf value:
(0, 31) 0.12805554413157658
(0, 47) 0.164894828456289
(0, 17) 0.164894828456289
(0, 11) 0.4336337670028971
(0, 32) 0.21681688350144854
(0, 36) 0.21681688350144854
(0, 24) 0.164894828456289
(0, 37) 0.21681688350144854
(0, 19) 0.21681688350144854
(0, 13) 0.164894828456289
(0, 39) 0.21681688350144854
(0, 8) 0.164894828456289
(0, 27) 0.12805554413157658
(0, 38) 0.25611108826315315
(0, 45) 0.21681688350144854
(0, 0) 0.21681688350144854
(0, 29) 0.25611108826315315
(0, 2) 0.164894828456289
(0, 41) 0.329789656912578
(0, 21) 0.21681688350144854
(1, 6) 0.17151768417912747
(1, 40) 0.17151768417912747
(1, 23) 0.17151768417912747
(1, 44) 0.17151768417912747
(1, 10) 0.17151768417912747
(1, 27) 0.20260221456046645
(1, 38) 0.20260221456046645
(1, 29) 0.20260221456046645
(1, 41) 0.1304436197642709
(2, 4) 0.24095705423107247
(2, 14) 0.24095705423107247
(2, 15) 0.24095705423107247
(2, 7) 0.24095705423107247
(2, 20) 0.24095705423107247
(2, 30) 0.24095705423107247
(2, 1) 0.24095705423107247
(2, 35) 0.24095705423107247
(2, 33) 0.24095705423107247
(2, 42) 0.24095705423107247
(2, 9) 0.24095705423107247
(2, 28) 0.24095705423107247
(2, 12) 0.18325405052000932
(2, 31) 0.28462623568423023
(2, 24) 0.18325405052000932
(2, 27) 0.14231311784211512
(2, 38) 0.14231311784211512
(2, 29) 0.28462623568423023
(2, 2) 0.18325405052000932

tf-idf values in matrix form:


[[0.21681688 0. 0.16489483 0. 0. 0.
0. 0. 0.16489483 0. 0. 0.43363377
0. 0.16489483 0. 0. 0. 0.16489483
0. 0.21681688 0. 0.21681688 0. 0.
0.16489483 0. 0. 0.12805554 0. 0.25611109
0. 0.12805554 0.21681688 0. 0. 0.
0.21681688 0.21681688 0.25611109 0.21681688 0. 0.32978966
0. 0. 0. 0.21681688 0. 0.16489483]
[0. 0. 0. 0.17151768 0. 0.17151768
0.17151768 0. 0.26088724 0. 0.17151768 0.
0.26088724 0.26088724 0. 0. 0.17151768 0.26088724
0.17151768 0. 0. 0. 0.17151768 0.17151768
0. 0.17151768 0.17151768 0.20260221 0. 0.20260221
0. 0.20260221 0. 0. 0.34303537 0.
0. 0. 0.20260221 0. 0.17151768 0.13044362
0. 0.17151768 0.17151768 0. 0.17151768 0.13044362]
[0. 0.24095705 0.18325405 0. 0.24095705 0.
0. 0.24095705 0. 0.24095705 0. 0.
0.18325405 0. 0.24095705 0.24095705 0. 0.
0. 0. 0.24095705 0. 0. 0.
0.18325405 0. 0. 0.14231312 0.24095705 0.28462624
0.24095705 0.28462624 0. 0.24095705 0. 0.24095705
0. 0. 0.14231312 0. 0. 0.
0.24095705 0. 0. 0. 0. 0. ]]

#Q3 Write a program to calculate cosine similarity between any two text documents. # Define the documents
doc_trump = "Mr. Trump became president after winning the political election. Though he lost the support of some
doc_election = "President Trump says Putin had no political interference is the election outcome. He says it was
doc_putin = "Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime
documents = [doc_trump, doc_election, doc_putin]
# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Create the Document Term Matrix


count_vectorizer = CountVectorizer(stop_words='english') count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(documents)

# OPTIONAL: Convert Sparse Matrix to Pandas Dataframe if you want to see the word frequencies.
doc_term_matrix = sparse_matrix.todense() df = pd.DataFrame(doc_term_matrix,
columns=count_vectorizer.get_feature_names(), index=['doc_trump', 'doc_election', 'doc_putin'])
df
after as became by career claimed do earlier election elections ... the though to trump vladimir was who winning

doc_trump 1 0 1 0 0 0 0 0 1 0 ... 2 1 0 2 0 0 0 1

doc_election 0 0 0 1 0 1 1 0 2 0 ... 2 0 1 1 0 1 1 0
doc_putin 0 1 1 0 1 0 0 1 0 1 ... 1 0 0 0 1 0 0 0
3 rows × 48 columns

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js

You might also like