Natural Language Processing
Natural Language Processing
Natural Language Processing
import nltk
Step 1: Tokenize text
The topic of a Wikipedia article will be used to show how this algorithm
works. It is chosen one from Linux (https://en.wikipedia.org/wiki/Linux), so
it is expected the algorithm will return as a topic the word Linux or at least
something relatable.
The module urlib.request is used to read the web page and obtain the pure
html content.
import urllib.request
response = urllib.request.urlopen(‘https://en.wikipedia.org/wiki/Linux’)
html = response.read()
print(html)
b’<!DOCTYPE html>\n<html class=”client-nojs” lang=”en” dir=”ltr”>\
n<head>\n<meta charset=”UTF–8”/>\n<title>Linux – Wikipedia</
title>\n...
Notice the b character at the beginning of the html output. It indicates
that this is a binary text. To clean HTML tags and to a general cleaning of the
raw text we use BeautifulSoup library, as shown in the code below.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, “html5lib”)
text = soup.get_text(strip = True)
print (text[:100])
Linux – Wikipediadocument.documentElement.className=document.
documentElement.className.replace(/(^|\
Convert the text into tokens by using split() method in Python, which
splits a string into whitespaces. Notice that up to this point nltk is not being
used.
tokens = [t for t in text.split()]
print (tokens)
Fundamentals of Machine Learning using Python
the 465
of 269
and 219
Linux 197
on 193
to 166
a 159
for 113
original 110
in 100
is 99
with 71
as 66
operating 57
from 57
system 52
software 51
that 51
distributions 50
also 48
Introduction to Natural Language Processing
Notice that, understandably, words which are not very meaningful but
important to create links in sentences are the ones with highest frequencies.
It would be necessary to look on the above table to understand further in its
topic due to the presence of many so-called “stop words,” i.e., words that are
important to create links and bring sense, but not relevant to extract the topic
of the text. Fortunately, nltk library has a functionality to remove such “stop
stopwords functionality. We use this to remove such words from the original
text, so they won’t appear in the frequency count table.
clean_tokens = tokens[:]
if token in stopwords.words(‘english’):
clean_tokens.remove(token)
With the new list of tokens without stop words (clean_tokens), build
again the frequency table and visualize the words which appear with highest
frequency in the text.
freq = nltk.FreqDist(clean_tokens)
df = pd.DataFrame.from_dict(freq,orient=’index’)
df.sort_values(by=0,ascending=False).head(20)
Linux 197
original 110
Fundamentals of Machine Learning using Python
operating 57
system 52
software 51
distributions 50
also 48
fromthe 42
Archived 41
originalon 40
use 32
used 31
kernel 29
July 27
RetrievedJune 26
desktop 25
GNU 24
December 24
distribution 24
September 23
next words are “original,” “operating” and “system” which tells us some
information about what is the article about. That is a very simple, if not the
simplest NLP algorithm one can use.
mytext = “Good morning, Alex, how are you? I hope everything is well.
Tomorrow will be a nice day, see you buddy.”
print(sent_tokenize(mytext))
[‘Good morning, Alex, how are you?’, ‘I hope everything is well.’,
‘Tomorrow will be a nice day, see you buddy.’]
To use this algorithm, it is necessary to download the
PunktSentenceTokenizer which is part of the nltk.tokenize.punkt module.
This is done through the command: nltk.download(‘punkt’).
The whole text, as a single string, is divided into sentences by a recognition
of the dots in nltk. Notice how the following text, which incorporates the tile
“Mr.” (with a dot) is still correctly divided.
import nltk
nltk.download(‘punkt’)
mytext = “Good morning, Mr. Alex, how are you? I hope everything is well.
Tomorrow will be a nice day.”
print(sent_tokenize(mytext))
[‘Good morning, Mr. Alex, how are you?’, ‘I hope everything is well.’,
‘Tomorrow will be a nice day.’]
Similarly, words can be tokenized by using the word_tokenize
functionality from nltk library. Notice the result as this is applied to the
sentence presented above.
import nlt
Fundamentals of Machine Learning using Python
nltk.download(‘punkt’)
mytext = “Good morning, Mr. Alex, how are you? I hope everything is well.
Tomorrow will be a nice day.”
print(word_tokenize(mytext))
[‘Good’, ‘morning’, ‘,’, ‘Mr.’, ‘Alex’, ‘,’, ‘how’, ‘are’, ‘you’, ‘?’, ‘I’, ‘hope’,
‘everything’, ‘is’, ‘well’, ‘.’, ‘Tomorrow’, ‘will’, ‘be’, ‘a’, ‘nice’, ‘day’, ‘.’]
Notice how this algorithm recognizes that the word “Mr.” contains the
dot at the end, thus not removing it.
NLTK library does not work only with English natural language. In
the following example, sentence tokenizer is used to split a sentence in
Portuguese.
import nlt
nltk.download(‘punkt’)
mytext = “Bom dia, Sr. Alex, como o senhor está? Espero que esteja bem.
Amanhã será um ótimo dia.”
print(sent_tokenize(mytext))
[‘Bom dia, Sr. Alex, como o senhor está?’, ‘Espero que esteja bem.’,
‘Amanhã será um ótimo dia.’]
NLTK is able to automatically recognize the language being used in
the example above. This is evident since it does not split at the word “Sr.”
(meaning Mr. in English).
syn = wordnet.synsets(“patient”)
print(syn[0].definition())
print(syn[0].examples())
The output of the above code is,
a person who requires medical care
[‘the number of emergency patients has grown rapidly’]
To obtain synonyms from a words using WordNet, one can use the
synsets(word_to_be_inquired) function from wordnet module and obtain
each lemma, as shown in the code below.
from nltk.corpus import wordnet
synonyms = []
synonyms.append(lemma.name())
print(synonyms)
[‘car’, ‘auto’, ‘automobile’, ‘machine’, ‘motorcar’, ‘car’, ‘railcar’, ‘railway_
car’, ‘railroad_car’, ‘car’, ‘gondola’, ‘car’, ‘elevator_car’, ‘cable_car’,
‘car’]
In a similar way, antonyms can be retrieved by doing a slight modification
of the code above.
from nltk.corpus import wordnet
Fundamentals of Machine Learning using Python
antonyms = []
for l in syn.lemmas():
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(antonyms)
[‘ugly’]
print(PorterStemmer().stem(‘building’))
The result is,
build
Other stemming algorithm worth mentioning is the Lancaster stemming
algorithm. There are slightly different results from both algorithms in
different words.
NLTK also supports stemming of languages other than English, using
the SnowballStemmer class. The supported languages can be visualized by
checking the SnowballStemmer.languages property.
from nltk.stem import SnowballStemmer
Introduction to Natural Language Processing
print(SnowballStemmer.languages)
(‘arabic’, ‘danish’, ‘dutch’, ‘english’, ‘finnish’, ‘french’, ‘german’,
‘hungarian’, ‘italian’, ‘norwegian’, ‘porter’, ‘portuguese’, ‘romanian’,
‘russian’, ‘spanish’, ‘swedish’)
The code below shows an example usage of SnowballStemmer to stem word
from a non-English language, for example Portuguese.
from nltk.stem import SnowballStemmer
portuguese_stemmer = SnowballStemmer(‘portuguese’)
print(portuguese_stemmer.stem(“trabalhando”))
trabalh
import nltk
nltk.download(‘wordnet’)
lemmatizer = WordNetLemmatizer()