Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
166 views

NLP Assignment Anand1

This document contains sample code and questions for an assignment on natural language processing (NLP). It includes code snippets and explanations for tasks like finding collocations in text, converting lists to strings and splitting strings into lists of words, finding word indices, computing word vocabularies across sentences, extracting word slices from text, finding words by length or characteristics, looping through words and applying conditions, defining functions for vocabulary size and word percentage frequency. The assignment covers a range of basic NLP concepts and techniques.

Uploaded by

naman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
166 views

NLP Assignment Anand1

This document contains sample code and questions for an assignment on natural language processing (NLP). It includes code snippets and explanations for tasks like finding collocations in text, converting lists to strings and splitting strings into lists of words, finding word indices, computing word vocabularies across sentences, extracting word slices from text, finding words by length or characteristics, looping through words and applying conditions, defining functions for vocabulary size and word percentage frequency. The assignment covers a range of basic NLP concepts and techniques.

Uploaded by

naman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Assignment 1-NLP

Anand Prakash Singh


Coe2-101503027

Q1: Find the collocations in text5

text5.collocations()

sorted(set([i for i in text5 if i.startswith('b')]))

Q2: Define a variable my_sent to be a list of words. Convert my_sent into string and then split it as list of
words.

>>>my_sent=[‘Anand’,’Prakash’]

>>>a=’ ’.join(my_sent)

>>>a

‘Anand Prakash’

>>>a.split(‘ ’)

[‘Anand’,’Prakash’]

Q3: Find the index of the word sunset in text9.

>>>text9.index(‘sunset’)

629

Q4:Compute the vocabulary of the sentences sent1 ... sent8

running = set(sent1)

running.update(sent2, sent3, sent4, sent5, sent6, sent7, sent8)

running = set([w.lower() for w in running])

sorted(list(running))

Q5: What is the difference between the following two lines: >>> sorted(set([w.lower() for w in text1]))
>>> sorted([w.lower() for w in set(text1)])
>>> sorted(set([w.lower() for w in text1]))

In this firstly every word will convert in lower case then set will be created. So there will be no

repetition.

>>>sorted([w.lower() for w in set(text1)])

In this firstly set of words will be created. So, lower as well as upper case characters will also be

present. So, after conversion there can be repetition of words.

For example-

>>> a=’aaaaaa bbbbbbbbnnnnnnllllKKKKaAAAasSS mmmmmm’;

>>>sorted([w.lower() for w in set(a)])

[‘ ’,’a’,’a’,’b’,’k’,’l’,’m’,’n’,’s’,’s’]

>>>sorted(set([w.lower() for w in a]))

[‘ ’,’a’,’b’,’k’,’l’,’m’,’n’,’s’]

Q6: Write the slice expression that extracts the last two words of text2

text2[-2:]

In [1]: text2[-2:]

Out[1]: ['THE', 'END']

Q7: Find all the four-letter words in the Chat Corpus (text5). With the help of a frequency distribution
(FreqDist), show these words in decreasing order of frequency

a = set([word for word in text5 if len(word) == 4])

f = FreqDist(text5)

reversed_pairs = [(v, k) for k, v in f.items()]

list(reversed(sorted(reversed_pairs)))

Q8: Use a combination of for and if statements to loop over the words of the movie script for Monty
Python and the Holy Grail (text6) and print all the uppercase words

all_uppers = set([w for w in text6 if w.isupper()])


for i in all_uppers:

print i

Q9: Write expressions for finding all words in text6 that meet the following conditions. a. Ending in ize b.
Containing the letter z c. Containing the sequence of letters pt d. All lowercase letters except for an
initial capital (i.e., titlecase)

End with ize

In [1]: [w for w in text6 if len(w) > 4 and w[-3:] == ('ize')]

Out[1]: []

Containing the letter z

In [1]: list(set([w for w in text6 if w.lower().find('z') != -1]))

Out[1]:

['zhiv',

'zone',

'frozen',

'amazes',

'zoo',

'zoop',

'zoosh',

'AMAZING',

'ZOOT',

'Zoot',

'Fetchez']

c. Containing the sequence of letters pt

In [1]: list(set([w for w in text6 if w.lower().find('pt') != -1]))

Out[1]:

['Chapter',
'temptress',

'temptation',

'excepting',

'Thppt',

'Thppppt',

'Thpppt',

'ptoo',

'Thpppppt',

'aptly',

'empty']

d. All lowercase letters except for an initial capital (i.e., titlecase)

list(set([w for w in text6 if w[0].isupper() and w[1:].islower()]))

Q10: Define sent to be the list of words ['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore']. Now write
code to perform the following tasks: a. Print all words beginning with sh. b. Print all words longer than
four characters

In [1]: [w for w in sent if w[0:2] == 'sh']

Out[1]: ['she', 'shells', 'shore']

Q11: What does the following Python code do? sum([len(w) for w in text1]) Can you use it to work out
the average word length of a text?

It returns the sum total of the lengths of all "words" in text1.

Yes, we can Use it

avg_w_len = sum([len(w) for w in text1]) / float(len(text1))

Q12: Define a function called vocab_size(text) that has a single parameter for the text, and which
returns the vocabulary size of the text.

def vocab_size(text):

distinct = set([w.lower() for w in text])


return len(distinct)

Q13: Define a function percent(word, text) that calculates how often a given word occurs in a text and
expresses the result as a percentage.

def percent(word, text):

total = len(text)

occurs = text.count(word)

return 100 * occurs / floac(total)

You might also like