Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Natural Language Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

 Fundamentals of Machine Learning using Python

In essence, Natural Language Processing (NLP) is a technology used to


transform human’s natural language into something understandable for
machines. Though it may appear simple, it is actually a complex task, when
one thinks in terms of different languages worldwide, sentence structures,
grammar, etc. This chapter provides a simple introduction to NLP and how
it works.

17.1. DEFINITION OF NATURAL LANGUAGE


PROCESSING
A field of artificial intelligence, NLP provides algorithms to aid the
interaction between computers and humans using natural language. The
objective of the algorithms is to read, decipher, understand and extract sense
in a valuable manner.
The majority of techniques used in NLP derive from machine learning.
The following steps illustrate a general human-machine interaction
using natural language.
Step 1: A human talks to the machine;
Step 2. The audio is captured by the machine;
Step 3. Machine converts the audio to text;
Step 4. The text is processed;
Step 5: The machine output data is converted to audio;
Step 6. Audio output simulates that machine is talking.

17.2. USAGE OF NLP


The following are some common applications of NLP:
Language translation services (Google Translate, Bing Translate);
Grammar checking and correction in word processors (Microsoft
Word, Grammarly);
Interactive Voice Response (IVR) services (e.g., call centers);
Personal Assistant applications (OK Google, Siri, Cortana)
Introduction to Natural Language Processing 

17.3. OBSTACLES IN NLP


The nature of human language presents a natural difficulty for Natural
Language Processing. This is due the fact that human language is composed
by complex structures. It may contain emotional traits, such as sarcasm to
pass certain information. Besides that, there are literally thousands of human
languages worldwide, and each language requires a different treatment.
The rules used to pass information to machines may have different

understandable requires a high-level rule with elevate abstraction. On the


other hand, the meaning of item plurality can be easily interpreted in many
languages by the addition of an “s” at the end of the word items.
According Garbade (2018), a comprehensive understanding of the
human language is only possible by undertanding the linking of words and
concepts, and how such links are used to deliver the message.

17.4. TECHNIQUES USED IN NLP


The two main techniques used in NLP are syntactic and semantic analysis.
Syntactic analysis refers to the way words are arranged in a sentence so
that it is grammatically correct and it makes sense. In NLP, this concept is
used to retrieve the agreement of natural language with grammatical rules.
To do so, algorithms apply grammatical rules to group of words and obtain
meaning from them.
The following is a list of syntax techniques that are currently used:
Lemmatization
unique form for easy analysis.
Morphological segmentation: Division of words into individual
units (morphemes).
Word segmentation: Division of large continuous text into
smaller units.
Part-of-speech (POS) tagging
word (such as verb, adjective, etc.).
Parsing: Sentence grammatical analysis.
Sentence boundary disambiguation: Also known as Sentence
breaking, consists in detecting boundaries on text.
Stemming
 Fundamentals of Machine Learning using Python

Semantic analysis is used to retrieve the meaning of a text. Being one

In machine learning, it involved using algorithms that extract meaning of


words and the structure of sentences.
The following is a list of semantic techniques that are currently used:
Named entity recognition (NER)
of words according its meaning. Examples are names of people
and names of places.
Word sense disambiguation: Retrieve meaning of word based
on the context. One example in English is the word “spirit,”
which may indicate alcoholic drink, or supernatural entity.
Natural language generation: Express semantic intentions into
human language. Requires a database to store all the relevant
semantic intentions that the machine may express.

17.5. NLP LIBRARIES


There is currently a great variety of Natural Language Processing libraries.
Some of the most popular are (Ebrahim (2017)):
Natural language toolkit (NLTK) in Python;
Apache OpenNLP;
Stanford NLP suite;
Gate NLP library.
From the ones above, NLTK is by far the most popular one. According
Ebrahim (2017), it is easy to learn and use, being probably the easiest NLP
library.

17.6. PROGRAMMING EXERCISE: SUBJECT/TOPIC


EXTRACTION USING NLP
In this exercise, NLTK library is used to retrieve the topic/subject of a certain
text. In summary this simple algorithm consists in, after preprocessing the
text, extract the words with highest frequency and assumed these words
refer to the main topic of the page. For instance, for a text about World War
II, we would expect to retrieve the words “war,” “conflict” or something
similar which means that the text talks about war.
To start, import nltk in Python terminal.
Introduction to Natural Language Processing 

import nltk
Step 1: Tokenize text
The topic of a Wikipedia article will be used to show how this algorithm
works. It is chosen one from Linux (https://en.wikipedia.org/wiki/Linux), so
it is expected the algorithm will return as a topic the word Linux or at least
something relatable.
The module urlib.request is used to read the web page and obtain the pure
html content.
import urllib.request

response = urllib.request.urlopen(‘https://en.wikipedia.org/wiki/Linux’)

html = response.read()

print(html)
b’<!DOCTYPE html>\n<html class=”client-nojs” lang=”en” dir=”ltr”>\
n<head>\n<meta charset=”UTF–8”/>\n<title>Linux – Wikipedia</
title>\n...
Notice the b character at the beginning of the html output. It indicates
that this is a binary text. To clean HTML tags and to a general cleaning of the
raw text we use BeautifulSoup library, as shown in the code below.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, “html5lib”)
text = soup.get_text(strip = True)
print (text[:100])
Linux – Wikipediadocument.documentElement.className=document.
documentElement.className.replace(/(^|\
Convert the text into tokens by using split() method in Python, which
splits a string into whitespaces. Notice that up to this point nltk is not being
used.
tokens = [t for t in text.split()]
print (tokens)
 Fundamentals of Machine Learning using Python

[‘Linux’, ‘-’, ‘Wikipediadocument.documentElement.className=document.


documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,”$1client-
js$2”);...
With the words already separated, one can use the FreqDist() function
from nltk library to count the frequency of words in a list.
freq = nltk.FreqDist(tokens)
We create a table using pandas library to visualize the frequency of
words. For better readability, we can sort the frequencies descending
(highest to lowest) using sort_values(..., ascending = False) functionality
from pandas Dataframe. The code is used to print the 20 words with highest
frequencies in the text.
import pandas as pd
df = pd.DataFrame.from_dict(freq,orient = ‘index’)
df.sort_values(by = 0,ascending = False).head(20)

the 465
of 269
and 219
Linux 197
on 193
to 166
a 159
for 113
original 110
in 100
is 99
with 71
as 66
operating 57
from 57
system 52
software 51
that 51
distributions 50
also 48
Introduction to Natural Language Processing 

Notice that, understandably, words which are not very meaningful but
important to create links in sentences are the ones with highest frequencies.
It would be necessary to look on the above table to understand further in its
topic due to the presence of many so-called “stop words,” i.e., words that are
important to create links and bring sense, but not relevant to extract the topic
of the text. Fortunately, nltk library has a functionality to remove such “stop

from the language under interest (English in this case).


from nltk.corpus import stopwords
nltk.download(‘stopwords’)
stopwords.words(‘english’)[:10]
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, “you’re”]

stopwords functionality. We use this to remove such words from the original
text, so they won’t appear in the frequency count table.
clean_tokens = tokens[:]

for token in tokens:

if token in stopwords.words(‘english’):

clean_tokens.remove(token)
With the new list of tokens without stop words (clean_tokens), build
again the frequency table and visualize the words which appear with highest
frequency in the text.
freq = nltk.FreqDist(clean_tokens)
df = pd.DataFrame.from_dict(freq,orient=’index’)
df.sort_values(by=0,ascending=False).head(20)

Linux 197
original 110
 Fundamentals of Machine Learning using Python

operating 57
system 52
software 51
distributions 50
also 48
fromthe 42
Archived 41
originalon 40
use 32
used 31
kernel 29
July 27
RetrievedJune 26
desktop 25
GNU 24
December 24
distribution 24
September 23

next words are “original,” “operating” and “system” which tells us some
information about what is the article about. That is a very simple, if not the
simplest NLP algorithm one can use.

17.7. TEXT TOKENIZE USING NLTK


Tokenization means splitting a text or a sentence into smaller parts. A text
cannot be processed without this step. In the exercise above the text was
divided into tokens using the split built-in Python function. In this part you
will see how NLTK can be used for that task.
Paragraphs can be tokenize to sentences and sentences tokenize
into words depending on the problem. In any case, NLTK contains both
functionalities (sentence tokenizer and word tokenizer).
Let a text to be analyzed as below.
“Good morning, Alex, how are you? I hope everything is well. Tomorrow
will be a nice day, see you buddy.”
Introduction to Natural Language Processing 

The above text can be tokenized into sentences using sentence_tokenizer


functionality from nltk.
import nltk
nltk.download(‘punkt’)

from nltk.tokenize import sent_tokenize

mytext = “Good morning, Alex, how are you? I hope everything is well.
Tomorrow will be a nice day, see you buddy.”

print(sent_tokenize(mytext))
[‘Good morning, Alex, how are you?’, ‘I hope everything is well.’,
‘Tomorrow will be a nice day, see you buddy.’]
To use this algorithm, it is necessary to download the
PunktSentenceTokenizer which is part of the nltk.tokenize.punkt module.
This is done through the command: nltk.download(‘punkt’).
The whole text, as a single string, is divided into sentences by a recognition
of the dots in nltk. Notice how the following text, which incorporates the tile
“Mr.” (with a dot) is still correctly divided.
import nltk
nltk.download(‘punkt’)

from nltk.tokenize import sent_tokenize

mytext = “Good morning, Mr. Alex, how are you? I hope everything is well.
Tomorrow will be a nice day.”

print(sent_tokenize(mytext))
[‘Good morning, Mr. Alex, how are you?’, ‘I hope everything is well.’,
‘Tomorrow will be a nice day.’]
Similarly, words can be tokenized by using the word_tokenize
functionality from nltk library. Notice the result as this is applied to the
sentence presented above.
import nlt
 Fundamentals of Machine Learning using Python

nltk.download(‘punkt’)

from nltk.tokenize import word_tokenize

mytext = “Good morning, Mr. Alex, how are you? I hope everything is well.
Tomorrow will be a nice day.”

print(word_tokenize(mytext))
[‘Good’, ‘morning’, ‘,’, ‘Mr.’, ‘Alex’, ‘,’, ‘how’, ‘are’, ‘you’, ‘?’, ‘I’, ‘hope’,
‘everything’, ‘is’, ‘well’, ‘.’, ‘Tomorrow’, ‘will’, ‘be’, ‘a’, ‘nice’, ‘day’, ‘.’]
Notice how this algorithm recognizes that the word “Mr.” contains the
dot at the end, thus not removing it.
NLTK library does not work only with English natural language. In
the following example, sentence tokenizer is used to split a sentence in
Portuguese.
import nlt
nltk.download(‘punkt’)

from nltk.tokenize import sent_tokenize

mytext = “Bom dia, Sr. Alex, como o senhor está? Espero que esteja bem.
Amanhã será um ótimo dia.”

print(sent_tokenize(mytext))
[‘Bom dia, Sr. Alex, como o senhor está?’, ‘Espero que esteja bem.’,
‘Amanhã será um ótimo dia.’]
NLTK is able to automatically recognize the language being used in
the example above. This is evident since it does not split at the word “Sr.”
(meaning Mr. in English).

17.8. SYNONYMS FROM WORDNET


According Ebrahim (2017), WordNet consists in a database created and
maintained exclusively for natural language processing. It contains sets of
Introduction to Natural Language Processing 

synonyms with their definitions. The following code shows an example of


extracting synonyms of a word using WordNet.
from nltk.corpus import wordnet

syn = wordnet.synsets(“patient”)

print(syn[0].definition())

print(syn[0].examples())
The output of the above code is,
a person who requires medical care
[‘the number of emergency patients has grown rapidly’]
To obtain synonyms from a words using WordNet, one can use the
synsets(word_to_be_inquired) function from wordnet module and obtain
each lemma, as shown in the code below.
from nltk.corpus import wordnet

synonyms = []

for syn in wordnet.synsets(‘Car’):

for lemma in syn.lemmas():

synonyms.append(lemma.name())

print(synonyms)
[‘car’, ‘auto’, ‘automobile’, ‘machine’, ‘motorcar’, ‘car’, ‘railcar’, ‘railway_
car’, ‘railroad_car’, ‘car’, ‘gondola’, ‘car’, ‘elevator_car’, ‘cable_car’,
‘car’]
In a similar way, antonyms can be retrieved by doing a slight modification
of the code above.
from nltk.corpus import wordnet
 Fundamentals of Machine Learning using Python

antonyms = []

for syn in wordnet.synsets(“beautiful”):

for l in syn.lemmas():

if l.antonyms():

antonyms.append(l.antonyms()[0].name())

print(antonyms)
[‘ugly’]

17.9. STEMMING WORDS WITH NLTK


Stemming consists in obtaining the root from a word containing affixes.
For example, the stem of building is build. This is a common technique
used by search engines when indexing pages. In this way, even when people
write different versions of the same word, all of them are stemmed, thus
converging to the same root word.
Among the different algorithms available for stemming, Porter stemming
algorithm is one of the most used. NLTK incorporates such algorithm in the
class PorterStemmer, as shown in the code below.
from nltk.stem import PorterStemmer

print(PorterStemmer().stem(‘building’))
The result is,
build
Other stemming algorithm worth mentioning is the Lancaster stemming
algorithm. There are slightly different results from both algorithms in
different words.
NLTK also supports stemming of languages other than English, using
the SnowballStemmer class. The supported languages can be visualized by
checking the SnowballStemmer.languages property.
from nltk.stem import SnowballStemmer
Introduction to Natural Language Processing 

print(SnowballStemmer.languages)
(‘arabic’, ‘danish’, ‘dutch’, ‘english’, ‘finnish’, ‘french’, ‘german’,
‘hungarian’, ‘italian’, ‘norwegian’, ‘porter’, ‘portuguese’, ‘romanian’,
‘russian’, ‘spanish’, ‘swedish’)
The code below shows an example usage of SnowballStemmer to stem word
from a non-English language, for example Portuguese.
from nltk.stem import SnowballStemmer
portuguese_stemmer = SnowballStemmer(‘portuguese’)

print(portuguese_stemmer.stem(“trabalhando”))
trabalh

17.10. LEMMATIZATION USING NLTK


Similar to stemming, lemmatization returns the root of a word. However,
the difference is, while stemming may return not always return a real word
as root, lemmatization returns always a real word. The default word is noun,
but different part of speeches can be returned, by using the pos argument
when using the lemmatize function of WordNetLemmatizer as shown in the
following example.

import nltk
nltk.download(‘wordnet’)

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize(‘panting’, pos = “v”)) 


paint
Notice that a word with 7 letters is reduced to 5 letters, a compression of
almost 30%. In many cases, the overall text compression can reach 50 to 60
/% by using lemmatization.
Other than a verb (pos = v), the lemmatization result can be noun (pos =
n), adjective (pos = a) or adverb (pos = r)

You might also like