0% found this document useful (0 votes)

19 views

8Python Web Scraping Dealing with Text

Uploaded by

David Osei

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

8Python Web Scraping Dealing with Text

Uploaded by

David Osei

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

PYTHON WEB SCRAPING DEALING WITH TEXT

https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_dealing_with_text.htm
Copyright © tutorialspoint.com

Getting started with NLTK

The Natural language toolkit N LT K is collection of Python libraries which is designed especially for identifying
and tagging parts of speech found in the text of natural language like English.

Installing NLTK

You can use the following command to install NLTK in Python −

pip install nltk

If you are using Anaconda, then a conda package for NLTK can be built by using the following command −

conda install ‐c anaconda nltk

Downloading NLTK’s Data

After installing NLTK, we have to download preset text repositories. But before downloading text preset
repositories, we need to import NLTK with the help of import command as follows −

mport nltk

Now, with the help of following command NLTK data can be downloaded −

nltk.download()

Installation of all available packages of NLTK will take some time, but it is always recommended to install all the
packages.

Installing Other Necessary packages

We also need some other Python packages like gensim and pattern for doing text analysis as well as building
building natural language processing applications by using NLTK.

gensim − A robust semantic modeling library which is useful for many applications. It can be installed by the
following command −

pip install gensim

pattern − Used to make gensim package work properly. It can be installed by the following command −

pip install pattern

Tokenization
The Process of breaking the given text, into the smaller units called tokens, is called tokenization. These tokens can
be the words, numbers or punctuation marks. It is also called word segmentation.

Example

NLTK module provides different packages for tokenization. We can use these packages as per our requirement.
Some of the packages are described here −

sent_tokenize package − This package will divide the input text into sentences. You can use the following
command to import this package −

from nltk.tokenize import sent_tokenize

word_tokenize package − This package will divide the input text into words. You can use the following
command to import this package −

from nltk.tokenize import word_tokenize

WordPunctTokenizer package − This package will divide the input text as well as the punctuation marks into
words. You can use the following command to import this package −

from nltk.tokenize import WordPuncttokenizer

Stemming
In any language, there are different forms of a words. A language includes lots of variations due to the
grammatical reasons. For example, consider the words democracy, democratic, and democratization. For
machine learning as well as for web scraping projects, it is important for machines to understand that these
different words have the same base form. Hence we can say that it can be useful to extract the base forms of the
words while analyzing the text.
This can be achieved by stemming which may be defined as the heuristic process of extracting the base forms of
the words by chopping off the ends of words.

NLTK module provides different packages for stemming. We can use these packages as per our requirement.
Some of these packages are described here −

PorterStemmer package − Porter’s algorithm is used by this Python stemming package to extract the base
form. You can use the following command to import this package −

from nltk.stem.porter import PorterStemmer

For example, after giving the word ‘writing’ as the input to this stemmer, the output would be the word ‘write’
after stemming.

LancasterStemmer package − Lancaster’s algorithm is used by this Python stemming package to extract the
base form. You can use the following command to import this package −

from nltk.stem.lancaster import LancasterStemmer

For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word
‘writ’ after stemming.

SnowballStemmer package − Snowball’s algorithm is used by this Python stemming package to extract the
base form. You can use the following command to import this package −

from nltk.stem.snowball import SnowballStemmer

For example, after giving the word ‘writing’ as the input to this stemmer then the output would be the word ‘write’
after stemming.

Lemmatization
An other way to extract the base form of words is by lemmatization, normally aiming to remove inflectional
endings by using vocabulary and morphological analysis. The base form of any word after lemmatization is called
lemma.

NLTK module provides following packages for lemmatization −

WordNetLemmatizer package − It will extract the base form of the word depending upon whether it is used as
noun as a verb. You can use the following command to import this package −

from nltk.stem import WordNetLemmatizer

Chunking
Chunking, which means dividing the data into small chunks, is one of the important processes in natural language
processing to identify the parts of speech and short phrases like noun phrases. Chunking is to do the labeling of
tokens. We can get the structure of the sentence with the help of chunking process.

Example
In this example, we are going to implement NounPhrase chunking by using NLTK Python module. NP chunking is
a category of chunking which will find the noun phrases chunks in the sentence.

Steps for implementing noun phrase chunking

We need to follow the steps given below for implementing nounphrase chunking −

Step 1 − Chunk grammar definition

In the first step we will define the grammar for chunking. It would consist of the rules which we need to follow.

Step 2 − Chunk parser creation

Now, we will create a chunk parser. It would parse the grammar and give the output.

Step 3 − The Output

In this last step, the output would be produced in a tree format.

First, we need to import the NLTK package as follows −

import nltk

Next, we need to define the sentence. Here DT: the determinant, VBP: the verb, JJ: the adjective, IN: the
preposition and NN: the noun.

sentence = [("a", "DT"),("clever","JJ"),("fox","NN"),("was","VBP"),("jumping","VBP"),

("over","IN"),("the","DT"),("wall","NN")]

Next, we are giving the grammar in the form of regular expression.

grammar = "NP:{<DT>?<JJ>*<NN>}"

Now, next line of code will define a parser for parsing the grammar.

parser_chunking = nltk.RegexpParser(grammar)

Now, the parser will parse the sentence.

parser_chunking.parse(sentence)

Next, we are giving our output in the variable.

Output = parser_chunking.parse(sentence)

With the help of following code, we can draw our output in the form of a tree as shown below.

output.draw()
Bag of Word BoW Model Extracting and converting the Text into Numeric Form
Bag of Word BoW , a useful model in natural language processing, is basically used to extract the features from
text. After extracting the features from the text, it can be used in modeling in machine learning algorithms because
raw data cannot be used in ML applications.

Working of BoW Model

Initially, model extracts a vocabulary from all the words in the document. Later, using a document term matrix, it
would build a model. In this way, BoW model represents the document as a bag of words only and the order or
structure is discarded.

Example

Suppose we have the following two sentences −

Sentence1 − This is an example of Bag of Words model.

Sentence2 − We can extract features by using Bag of Words model.

Now, by considering these two sentences, we have the following 14 distinct words −

This
is
an
example
bag
of
words
model
we
can
extract
features
by
using

Building a Bag of Words Model in NLTK

Let us look into the following Python script which will build a BoW model in NLTK.

First, import the following package −

from sklearn.feature_extraction.text import CountVectorizer

Next, define the set of sentences −

Sentences=['This is an example of Bag of Words model.', ' We can extract

features by using Bag of Words model.']
vector_count = CountVectorizer()
features_text = vector_count.fit_transform(Sentences).todense()
print(vector_count.vocabulary_)

Output

It shows that we have 14 distinct words in the above two sentences −

{
'this': 10, 'is': 7, 'an': 0, 'example': 4, 'of': 9,
'bag': 1, 'words': 13, 'model': 8, 'we': 12, 'can': 3,
'extract': 5, 'features': 6, 'by': 2, 'using':11
}

Topic Modeling: Identifying Patterns in Text Data

Generally documents are grouped into topics and topic modeling is a technique to identify the patterns in a text
that corresponds to a particular topic. In other words, topic modeling is used to uncover abstract themes or
hidden structure in a given set of documents.

You can use topic modeling in following scenarios −

Text Classification

Classification can be improved by topic modeling because it groups similar words together rather than using each
word separately as a feature.

Recommender Systems

We can build recommender systems by using similarity measures.

Topic Modeling Algorithms
We can implement topic modeling by using the following algorithms −

Latent Dirichlet AllocationLDA − It is one of the most popular algorithm that uses the probabilistic
graphical models for implementing topic modeling.

Latent Semantic AnalysisLDA or Latent Semantic IndexingLS I − It is based upon Linear Algebra and
uses the concept of SVD S ingularV alueDecomposition on document term matrix.

NonNegative Matrix Factorization NMF − It is also based upon Linear Algebra as like LDA.

The above mentioned algorithms would have the following elements −

Number of topics: Parameter

DocumentWord Matrix: Input
WTM W ordT opicM atrix & TDM T opicDocumentM atrix: Output

Python: Learn Python in 24 Hours
From Everand
Python: Learn Python in 24 Hours
Alex Nordeen
4/5 (12)
Update to Modern C++
From Everand
Update to Modern C++
James Raynard
No ratings yet
Getting Started with Simulink
From Everand
Getting Started with Simulink
Luca Zamboni
4.5/5 (4)
Python for Mechanical and Aerospace Engineering
From Everand
Python for Mechanical and Aerospace Engineering
Alexander Kenan
No ratings yet
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
From Everand
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
Jason Scotts
4/5 (55)
Pyqt6 101: A Beginner’s Guide to PyQt6
From Everand
Pyqt6 101: A Beginner’s Guide to PyQt6
Edward Chang
No ratings yet
Mastering Python Programming: A Comprehensive Guide: The IT Collection
From Everand
Mastering Python Programming: A Comprehensive Guide: The IT Collection
Christopher Ford
5/5 (1)
Fresher PyQt5: A Beginner’s Guide to PyQt5
From Everand
Fresher PyQt5: A Beginner’s Guide to PyQt5
Edward Chang
No ratings yet
Daniel Arbuckle’s Mastering Python
From Everand
Daniel Arbuckle’s Mastering Python
Daniel Arbuckle
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Python: Programming For Intermediates: Learn The Basics Of Python In 7 Days!
From Everand
Python: Programming For Intermediates: Learn The Basics Of Python In 7 Days!
Maurice J. Thompson
No ratings yet
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
Python Programming: Your Beginner’s Guide To Easily Learn Python in 7 Days
From Everand
Python Programming: Your Beginner’s Guide To Easily Learn Python in 7 Days
i Code Academy
2.5/5 (3)
Python Pranks and Mischief with NLP
From Everand
Python Pranks and Mischief with NLP
Edward Franklin
No ratings yet
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
The Project Gutenberg RST Manual
From Everand
The Project Gutenberg RST Manual
Marcello Perathoner
No ratings yet
NLTK Installation Guide
No ratings yet
NLTK Installation Guide
13 pages
Beginner's guide to mastering python
From Everand
Beginner's guide to mastering python
Xilis
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
NLTK
No ratings yet
NLTK
16 pages
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
The 1 Page Python Book
From Everand
The 1 Page Python Book
Barani Kumar
2/5 (1)
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Easy Programming for Everyone
From Everand
Easy Programming for Everyone
Umar Asghar
No ratings yet
Python and SQLite Development
From Everand
Python and SQLite Development
Agus Kurniawan
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Python Programming: Learn, Code, Create
From Everand
Python Programming: Learn, Code, Create
Sachin Naha
No ratings yet
Useful Python
From Everand
Useful Python
Stuart Langridge
No ratings yet
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
Rick Spair
No ratings yet
Natural Language Processing With Python's NLTK Package – Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package – Real Python
27 pages
Instant StyleCop Code Analysis How-to
From Everand
Instant StyleCop Code Analysis How-to
Franck LEVEQUE
No ratings yet
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
janya lo
No ratings yet
Introduction to Python 2018 Edition
From Everand
Introduction to Python 2018 Edition
Mark Lassoff
4/5 (4)
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
Puppet for Containerization
From Everand
Puppet for Containerization
Scott Coulton
No ratings yet
Unleashing the Power of Astro
From Everand
Unleashing the Power of Astro
Tamas Piros
No ratings yet
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
NLTK Documentation: Release 3.2.5
No ratings yet
NLTK Documentation: Release 3.2.5
87 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
6 pages
Mastering Embedded Linux Programming - Second Edition
From Everand
Mastering Embedded Linux Programming - Second Edition
Chris Simmonds
4/5 (2)
Troubleshooting Puppet
From Everand
Troubleshooting Puppet
Uphill Thomas
No ratings yet
UNIT-V-NLP Using NLTK
No ratings yet
UNIT-V-NLP Using NLTK
19 pages
Spring Boot Intermediate Microservices: Resilient Microservices with Spring Boot 2 and Spring Cloud
From Everand
Spring Boot Intermediate Microservices: Resilient Microservices with Spring Boot 2 and Spring Cloud
Jens Boje
No ratings yet
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Python Programming Reference Guide: A Comprehensive Guide for Beginners to Master the Basics of Python Programming Language with Practical Coding & Learning Tips
From Everand
Python Programming Reference Guide: A Comprehensive Guide for Beginners to Master the Basics of Python Programming Language with Practical Coding & Learning Tips
Coleman Newton
No ratings yet
Python Algorithms Step by Step: A Practical Guide with Examples
From Everand
Python Algorithms Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Matplotlib for Python Developers
From Everand
Matplotlib for Python Developers
Sandro Tosi
3/5 (1)
PYTHON FOR BEGINNERS: A Comprehensive Guide to Learning Python Programming from Scratch (2023)
From Everand
PYTHON FOR BEGINNERS: A Comprehensive Guide to Learning Python Programming from Scratch (2023)
Denton Freeman
No ratings yet
#1 Book on Python Programming
From Everand
#1 Book on Python Programming
Minhaj
No ratings yet
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
From Everand
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
Mark Chan
5/5 (4)
Mastering Python in 7 Days
From Everand
Mastering Python in 7 Days
Alex Wood
No ratings yet
Angular 2 Components
From Everand
Angular 2 Components
Nir Kaufman
No ratings yet
Introduction to Python Programming: Do your first steps into programming with python
From Everand
Introduction to Python Programming: Do your first steps into programming with python
Greytower Corp
No ratings yet
NLTK Cheatsheet
No ratings yet
NLTK Cheatsheet
27 pages
Objective-C Programming Nuts and bolts
From Everand
Objective-C Programming Nuts and bolts
Keith Lee
No ratings yet
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
NLTK: The Natural Language Toolkit: Steven Bird Edward Loper
No ratings yet
NLTK: The Natural Language Toolkit: Steven Bird Edward Loper
4 pages
4Python Heat Maps
No ratings yet
4Python Heat Maps
1 page
Python Web Scraping Data Extraction
No ratings yet
Python Web Scraping Data Extraction
4 pages
3Python Web Scraping Getting Started with Python
No ratings yet
3Python Web Scraping Getting Started with Python
4 pages
11Python Web Scraping Testing with Scrapers
No ratings yet
11Python Web Scraping Testing with Scrapers
5 pages
2Python Web Scraping Introduction
No ratings yet
2Python Web Scraping Introduction
4 pages
10Python Web Scraping Form based Websites
No ratings yet
10Python Web Scraping Form based Websites
3 pages
Reference Consent Form
No ratings yet
Reference Consent Form
1 page
Learning Theories 2016-1 (Notes)
No ratings yet
Learning Theories 2016-1 (Notes)
69 pages
Relationships Internet Nouns and Verbs - A2 English Vocabulary Test
No ratings yet
Relationships Internet Nouns and Verbs - A2 English Vocabulary Test
4 pages
There Is, There Are - Comparative, Superlative Adjectives - Possessive S
0% (1)
There Is, There Are - Comparative, Superlative Adjectives - Possessive S
5 pages
Compiler Construction Lecture 12 Predictive Parsing-Step1
No ratings yet
Compiler Construction Lecture 12 Predictive Parsing-Step1
24 pages
9th English Fa-4
No ratings yet
9th English Fa-4
2 pages
Headway I - Unit 7 y 8 2023.
No ratings yet
Headway I - Unit 7 y 8 2023.
3 pages
Auxiliary Verb
No ratings yet
Auxiliary Verb
5 pages
002 Adjectives and Adverbs - Comparisons Handout Answer Key
No ratings yet
002 Adjectives and Adverbs - Comparisons Handout Answer Key
13 pages
Adverbs of Degree: Professional English I Judit Saraí Sarmiento Párraga
No ratings yet
Adverbs of Degree: Professional English I Judit Saraí Sarmiento Párraga
35 pages
Icebreakers-Warm Ups-Lead in Activities 1
No ratings yet
Icebreakers-Warm Ups-Lead in Activities 1
5 pages
Unit 2. Life in The Country
No ratings yet
Unit 2. Life in The Country
5 pages
Inglés. 4º Primaria Repaso Primer Trimestre. Present Continuous Nombre
No ratings yet
Inglés. 4º Primaria Repaso Primer Trimestre. Present Continuous Nombre
74 pages
Common Grammar Slips
No ratings yet
Common Grammar Slips
6 pages
Goodwin, W.W.: Syntax of Moods and Tenses of The Greek Verb
No ratings yet
Goodwin, W.W.: Syntax of Moods and Tenses of The Greek Verb
498 pages
Gerunds, Participles, and Infinitives
No ratings yet
Gerunds, Participles, and Infinitives
11 pages
Hunspell 4
No ratings yet
Hunspell 4
17 pages
Processes of Word Formation I Definition
No ratings yet
Processes of Word Formation I Definition
19 pages
Diana - Basic Catalan Grammar
No ratings yet
Diana - Basic Catalan Grammar
13 pages
Academic Style - Rules
No ratings yet
Academic Style - Rules
2 pages
Open Cloze+ Rephrasing p27
No ratings yet
Open Cloze+ Rephrasing p27
1 page
Grammar: Complete The Sentences. Use The Correct Form of The Past Simple
No ratings yet
Grammar: Complete The Sentences. Use The Correct Form of The Past Simple
2 pages
Addition Linkers
No ratings yet
Addition Linkers
2 pages
L 7 - Reported Speech & Sequence of Tenses
No ratings yet
L 7 - Reported Speech & Sequence of Tenses
11 pages
GR 8 Spell Bee Words Set 3
No ratings yet
GR 8 Spell Bee Words Set 3
3 pages
The Sanskrit GR and Ref Book Hard Cover Preview
0% (1)
The Sanskrit GR and Ref Book Hard Cover Preview
9 pages
Forming Plural Nouns Ending in o
No ratings yet
Forming Plural Nouns Ending in o
13 pages
грам
No ratings yet
грам
5 pages
Sol3e Int U4 Short Test 2a
No ratings yet
Sol3e Int U4 Short Test 2a
2 pages
Final Our Project of Discourse
No ratings yet
Final Our Project of Discourse
18 pages
Vocab Green_answerkey
No ratings yet
Vocab Green_answerkey
19 pages
Verb Phrase
No ratings yet
Verb Phrase
4 pages

8Python Web Scraping Dealing with Text

Uploaded by

8Python Web Scraping Dealing with Text

Uploaded by

PYTHON WEB SCRAPING ­ DEALING WITH TEXT

Getting started with NLTK

You can use the following command to install NLTK in Python −

pip install nltk

conda install ‐c anaconda nltk

Downloading NLTK’s Data

Installing Other Necessary packages

pip install gensim

pip install pattern

from nltk.tokenize import sent_tokenize

from nltk.tokenize import word_tokenize

from nltk.tokenize import WordPuncttokenizer

from nltk.stem.porter import PorterStemmer

from nltk.stem.lancaster import LancasterStemmer

from nltk.stem.snowball import SnowballStemmer

NLTK module provides following packages for lemmatization −

from nltk.stem import WordNetLemmatizer

Steps for implementing noun phrase chunking

Step 1 − Chunk grammar definition

Step 2 − Chunk parser creation

Step 3 − The Output

First, we need to import the NLTK package as follows −

sentence = [("a", "DT"),("clever","JJ"),("fox","NN"),("was","VBP"),("jumping","VBP"),

Next, we are giving the grammar in the form of regular expression.

Now, the parser will parse the sentence.

Next, we are giving our output in the variable.

Working of BoW Model

Suppose we have the following two sentences −

Sentence1 − This is an example of Bag of Words model.

Sentence2 − We can extract features by using Bag of Words model.

Building a Bag of Words Model in NLTK

First, import the following package −

from sklearn.feature_extraction.text import CountVectorizer

Next, define the set of sentences −

Sentences=['This is an example of Bag of Words model.', ' We can extract

It shows that we have 14 distinct words in the above two sentences −

Topic Modeling: Identifying Patterns in Text Data

You can use topic modeling in following scenarios −

We can build recommender systems by using similarity measures.

The above mentioned algorithms would have the following elements −

Number of topics: Parameter

You might also like

PYTHON WEB SCRAPING DEALING WITH TEXT