8Python Web Scraping Dealing with Text
8Python Web Scraping Dealing with Text
https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_dealing_with_text.htm
Copyright © tutorialspoint.com
Advertisements
In the previous chapter, we have seen how to deal with videos and images that we obtain as a part of web scraping
content. In this chapter we are going to deal with text analysis by using Python library and will learn about this in
detail.
Introduction
You can perform text analysis in by using Python library called Natural Language Tool Kit N LT K . Before
proceeding into the concepts of NLTK, let us understand the relation between text analysis and web scraping.
Analyzing the words in the text can lead us to know about which words are important, which words are unusual,
how words are grouped. This analysis eases the task of web scraping.
Installing NLTK
If you are using Anaconda, then a conda package for NLTK can be built by using the following command −
mport nltk
Now, with the help of following command NLTK data can be downloaded −
nltk.download()
Installation of all available packages of NLTK will take some time, but it is always recommended to install all the
packages.
gensim − A robust semantic modeling library which is useful for many applications. It can be installed by the
following command −
pattern − Used to make gensim package work properly. It can be installed by the following command −
Tokenization
The Process of breaking the given text, into the smaller units called tokens, is called tokenization. These tokens can
be the words, numbers or punctuation marks. It is also called word segmentation.
Example
NLTK module provides different packages for tokenization. We can use these packages as per our requirement.
Some of the packages are described here −
sent_tokenize package − This package will divide the input text into sentences. You can use the following
command to import this package −
word_tokenize package − This package will divide the input text into words. You can use the following
command to import this package −
WordPunctTokenizer package − This package will divide the input text as well as the punctuation marks into
words. You can use the following command to import this package −
Stemming
In any language, there are different forms of a words. A language includes lots of variations due to the
grammatical reasons. For example, consider the words democracy, democratic, and democratization. For
machine learning as well as for web scraping projects, it is important for machines to understand that these
different words have the same base form. Hence we can say that it can be useful to extract the base forms of the
words while analyzing the text.
This can be achieved by stemming which may be defined as the heuristic process of extracting the base forms of
the words by chopping off the ends of words.
NLTK module provides different packages for stemming. We can use these packages as per our requirement.
Some of these packages are described here −
PorterStemmer package − Porter’s algorithm is used by this Python stemming package to extract the base
form. You can use the following command to import this package −
For example, after giving the word ‘writing’ as the input to this stemmer, the output would be the word ‘write’
after stemming.
LancasterStemmer package − Lancaster’s algorithm is used by this Python stemming package to extract the
base form. You can use the following command to import this package −
For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word
‘writ’ after stemming.
SnowballStemmer package − Snowball’s algorithm is used by this Python stemming package to extract the
base form. You can use the following command to import this package −
For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘write’
after stemming.
Lemmatization
An other way to extract the base form of words is by lemmatization, normally aiming to remove inflectional
endings by using vocabulary and morphological analysis. The base form of any word after lemmatization is called
lemma.
WordNetLemmatizer package − It will extract the base form of the word depending upon whether it is used as
noun as a verb. You can use the following command to import this package −
Chunking
Chunking, which means dividing the data into small chunks, is one of the important processes in natural language
processing to identify the parts of speech and short phrases like noun phrases. Chunking is to do the labeling of
tokens. We can get the structure of the sentence with the help of chunking process.
Example
In this example, we are going to implement NounPhrase chunking by using NLTK Python module. NP chunking is
a category of chunking which will find the noun phrases chunks in the sentence.
We need to follow the steps given below for implementing nounphrase chunking −
In the first step we will define the grammar for chunking. It would consist of the rules which we need to follow.
Now, we will create a chunk parser. It would parse the grammar and give the output.
import nltk
Next, we need to define the sentence. Here DT: the determinant, VBP: the verb, JJ: the adjective, IN: the
preposition and NN: the noun.
grammar = "NP:{<DT>?<JJ>*<NN>}"
Now, next line of code will define a parser for parsing the grammar.
parser_chunking = nltk.RegexpParser(grammar)
parser_chunking.parse(sentence)
Output = parser_chunking.parse(sentence)
With the help of following code, we can draw our output in the form of a tree as shown below.
output.draw()
Bag of Word BoW Model Extracting and converting the Text into Numeric Form
Bag of Word BoW , a useful model in natural language processing, is basically used to extract the features from
text. After extracting the features from the text, it can be used in modeling in machine learning algorithms because
raw data cannot be used in ML applications.
Initially, model extracts a vocabulary from all the words in the document. Later, using a document term matrix, it
would build a model. In this way, BoW model represents the document as a bag of words only and the order or
structure is discarded.
Example
Now, by considering these two sentences, we have the following 14 distinct words −
This
is
an
example
bag
of
words
model
we
can
extract
features
by
using
Output
{
'this': 10, 'is': 7, 'an': 0, 'example': 4, 'of': 9,
'bag': 1, 'words': 13, 'model': 8, 'we': 12, 'can': 3,
'extract': 5, 'features': 6, 'by': 2, 'using':11
}
Text Classification
Classification can be improved by topic modeling because it groups similar words together rather than using each
word separately as a feature.
Recommender Systems
Latent Dirichlet AllocationLDA − It is one of the most popular algorithm that uses the probabilistic
graphical models for implementing topic modeling.
Latent Semantic AnalysisLDA or Latent Semantic IndexingLS I − It is based upon Linear Algebra and
uses the concept of SVD S ingularV alueDecomposition on document term matrix.
NonNegative Matrix Factorization NMF − It is also based upon Linear Algebra as like LDA.