Module2.4 Text Processing

Uploaded by

sudothearkknight

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Module2.4 Text Processing

Uploaded by

sudothearkknight

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Course Code: CSA3002

MACHINE LEARNING ALGORITHMS

Course Type: LPC – 2-2-3

Course Objectives
• The objective of the course is to familiarize the learners with
the concepts of Machine Learning Algorithms and attain
Skill Development through Experiential Learning
techniques.
Course Outcomes
At the end of the course, students should be able to
1. Understanding of training and testing the datasets using machine
Learning techniques.
2. Apply optimization and parameter tuning techniques for machine
Learning algorithms.
3. Apply a machine learning model to solve various problems using
machine learning algorithms.
4. Apply machine learning algorithm to create models.
Text Processing Techniques
• Text processing, also known as text manipulation or text analysis, is
the computational process of extracting, transforming, and interpreting
textual data.
• It involves various techniques and methods to work with unstructured
text, enabling the extraction of meaningful information, patterns, and
insights from text documents.
• Text processing is a fundamental component of natural language
processing (NLP) and is used in a wide range of applications,
including information retrieval, sentiment analysis, text classification,
and text mining.
Text processing can be used for a variety of tasks, including:

• Sentiment analysis: Identifying the sentiment of a piece of text, such

as whether it is positive, negative, or neutral.
• Topic modeling: Identifying the main topics of a piece of text.
• Named entity recognition: Identifying named entities in a piece of text,
such as people, places, and organizations.
• Machine translation: Translating a piece of text from one language to
another.
• Spam filtering: Identifying and filtering out spam emails.
• Text summarization: Generating a summary of a piece of text.
Text processing is used in a variety of
industries, including:
• Customer service: To analyze customer feedback and identify areas for
improvement.
• Marketing: To analyze social media posts and other forms of customer
data to better understand customer needs and preferences.
• Finance: To analyze financial news and reports to identify trends and
investment opportunities.
• Healthcare: To analyze patient records and clinical trials data to
improve the diagnosis and treatment of diseases.
• Security: To analyze network traffic and other security-related data to
identify and prevent cyberattacks.
Example: Sentiment Analysis
• Step 1: Text Preprocessing Before performing any text analysis, it's crucial to
preprocess the text. Text preprocessing typically involves the following steps:
• a. Tokenization: Divide the text into words, phrases, or sentences (tokens). For
example, the sentence "I love this product" would be tokenized into ["I", "love",
"this", "product"].
• b. Lowercasing: Convert all the text to lowercase to ensure consistent
comparison. For instance, "Love" and "love" should be treated the same.
• c. Removing Punctuation: Eliminate punctuation marks like periods, commas,
and exclamation marks.
• d. Stopword Removal: Remove common words like "and," "the," "is" that don't
carry much meaning.
• e. Lemmatization or Stemming: Reduce words to their base form. For instance,
"running," "ran," and "runs" can be reduced to "run."
• Step 2: Sentiment Analysis Now that we have preprocessed the text, let's perform
sentiment analysis on a sample sentence: "I love this product."
• a. Feature Extraction: Convert the preprocessed text into a numerical
representation. A common method is to use a bag-of-words or term frequency-
inverse document frequency (TF-IDF) representation. In this case, we'd represent
the sentence as a vector, where "I," "love," and "product" are features.
• b. Sentiment Classification: Train a machine learning model, like a support vector
machine (SVM) or a deep learning model, on a labeled dataset to classify text into
sentiments (e.g., positive, negative, or neutral). In this case, "I love this product"
would likely be classified as positive.
• c. Prediction: Apply the trained model to the sample sentence. It predicts that the
sentiment is positive.
• Step 3: Post-processing Depending on the use case, you might want to perform
additional post-processing on the sentiment analysis results. This could include
summarizing the results, presenting them visually, or making business decisions
based on the analysis.
• Example Output: In this example, the output of sentiment analysis
for the sentence "I love this product" would be "positive." This
information can be used by companies to gauge customer opinions and
make data-driven decisions regarding their products.
• Text processing is a fundamental aspect of NLP, and it can be adapted
for a wide range of applications beyond sentiment analysis, including
language translation, chatbots, information retrieval, and more. It
plays a critical role in harnessing the power of textual data in various
domains.
What is Tokenizing?
• It may be defined as the process of breaking up a piece of text into
smaller parts, such as sentences and words.
• These smaller parts are called tokens. For example, a word is a token
in a sentence, and a sentence is a token in a paragraph.
Example: Tokenizing a Sentence
Consider the sentence: "Natural language processing is fascinating!"

• Step 1: Tokenization Tokenization involves splitting this sentence

into its constituent tokens. In this case, the tokens would be individual
words:
• "Natural"
• "language"
• "processing"
• "is"
• "fascinating"
• "!"
• Step 2: Importance of Tokenization
• Now, let's discuss why tokenization is important:
• Text Analysis: Tokenization is the foundation for text analysis in NLP. It allows
computers to work with individual words, which are the building blocks for various
tasks like sentiment analysis, machine translation, and text classification.
• Text Preprocessing: Before any text analysis, text data is typically preprocessed.
Tokenization is often the first step in this process, followed by converting all tokens to
lowercase, removing punctuation, and eliminating common stopwords (e.g., "and,"
"the," "is").
• Feature Extraction: In many NLP tasks, text data is represented as a numerical vector,
where each dimension corresponds to a token (word). Tokenization is the initial step in
converting text into these numerical representations, such as bag-of-words or TF-IDF
vectors.
• N-grams: Tokenization can also be used to create n-grams, which are combinations of
n adjacent words. For example, the bigram "natural language" or the trigram "is
fascinating" could be created by tokenizing the sentence accordingly.
• Step 3: Tokenization in Real Applications
• College students can relate tokenization to real-world applications:
• Search Engines: Tokenization is used by search engines to break down
user queries into individual keywords, making it easier to find relevant
web pages.
• Social Media Analysis: Companies use tokenization to understand
trends on social media. For example, analyzing tweets and posts about a
product launch can reveal public sentiment.
• Language Translation: In machine translation, tokenization is the first
step in breaking down the source and target language texts into units for
translation.
• Chatbots: Chatbots use tokenization to understand and respond to user
messages. The input message is tokenized to identify keywords and
phrases for generating a relevant response.
Implementation in python
• NLTK package
• nltk.tokenize is the package provided by NLTK module to achieve the
process of tokenization.
• word_tokenize module is used for basic word tokenization.
• import the TreebankWordTokenizer class to implement the word
tokenizer algorithm −
• from nltk.tokenize import TreebankWordTokenizer
• Next, create an instance of TreebankWordTokenizer class as follows −
• Tokenizer_wrd = TreebankWordTokenizer()
Complete code
• import nltk
• from nltk.tokenize import TreebankWordTokenizer
• Tokenizer_wrd = TreebankWordTokenizer()
• Tokenizer_wrd.tokenize('Welcome to presidency univerveity
bangalore karnataka.')

• Output
• ['Welcome', 'to', 'presidency', 'univerveity', 'bangalore', 'karnataka', '.']
Example 2
• import nltk
• from nltk.tokenize import word_tokenize
• word_tokenize("won’t")

• Output
• ['wo', "n't"]]

• Hint: nltk.download('punkt')
WordPunktTokenizer Class
An alternative word tokenizer that splits all punctuation into separate
tokens .
• # An alternative word tokenizer that splits all punctuation into
separate tokens.
• from nltk.tokenize import WordPunctTokenizer
• tokenizer = WordPunctTokenizer()
• tokenizer.tokenize(" I can't allow you to go home early")

• Outout
• ['I', 'can', "'", 't', 'allow', 'you', 'to', 'go', 'home', 'early']

Practical Natural Language Processing A Comprehensive Guide To Building Real World NLP Systems 1st Edition Sowmya Vajjala
100% (3)
Practical Natural Language Processing A Comprehensive Guide To Building Real World NLP Systems 1st Edition Sowmya Vajjala
62 pages
Cat and Dog Classification Using CNN: Project Objective
No ratings yet
Cat and Dog Classification Using CNN: Project Objective
7 pages
Disruptive Technologies AI Lecture 3
No ratings yet
Disruptive Technologies AI Lecture 3
19 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
NLP - 1_250119_222702 (1)
No ratings yet
NLP - 1_250119_222702 (1)
71 pages
Unit 3 AI-ML Driven Data Science and Automation
No ratings yet
Unit 3 AI-ML Driven Data Science and Automation
49 pages
Text-Processing-For-NLP-Text-Processing (6)
No ratings yet
Text-Processing-For-NLP-Text-Processing (6)
15 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
NLP_course-EDC-1-29
No ratings yet
NLP_course-EDC-1-29
29 pages
Module-I_NLP (1)
No ratings yet
Module-I_NLP (1)
35 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Natural Language Processing 101
No ratings yet
Natural Language Processing 101
26 pages
Unit 5 - Aiaaia
No ratings yet
Unit 5 - Aiaaia
19 pages
NLP_Preprocessing_Steps__1740444240
No ratings yet
NLP_Preprocessing_Steps__1740444240
20 pages
Natural Language Processing_NOTES
No ratings yet
Natural Language Processing_NOTES
4 pages
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
No ratings yet
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
37 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
nlp_1
No ratings yet
nlp_1
11 pages
Module-1 Introduction To NLP
No ratings yet
Module-1 Introduction To NLP
28 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
AI-2
No ratings yet
AI-2
7 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
NLP
No ratings yet
NLP
13 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
NLP
No ratings yet
NLP
74 pages
NLP unit1
No ratings yet
NLP unit1
24 pages
Python and NLP Notes
No ratings yet
Python and NLP Notes
32 pages
NLP 9
No ratings yet
NLP 9
44 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
Unit V Natural Language Processing
No ratings yet
Unit V Natural Language Processing
20 pages
Unit 1 NLP and TA
No ratings yet
Unit 1 NLP and TA
9 pages
Natural Language Processing_2
No ratings yet
Natural Language Processing_2
76 pages
NLP Presentation
No ratings yet
NLP Presentation
15 pages
Practical Natural Language Processing A Comprehensive Guide to Building Real world Nlp Systems 1st Edition Sowmya Vajjala - The full ebook with complete content is ready for download
100% (1)
Practical Natural Language Processing A Comprehensive Guide to Building Real world Nlp Systems 1st Edition Sowmya Vajjala - The full ebook with complete content is ready for download
61 pages
Natural Language Processing in Python Master Data Science and Machine Learning for Spam Detection, Sentiment Analysis, Latent Semantic Analysis, And Article Spinning (Machine Learning in Python) by Un (Z-li
No ratings yet
Natural Language Processing in Python Master Data Science and Machine Learning for Spam Detection, Sentiment Analysis, Latent Semantic Analysis, And Article Spinning (Machine Learning in Python) by Un (Z-li
163 pages
Lect02
No ratings yet
Lect02
23 pages
NLP Study Materials Updated
No ratings yet
NLP Study Materials Updated
43 pages
NLP Unit 1
No ratings yet
NLP Unit 1
15 pages
NLP PPT
No ratings yet
NLP PPT
58 pages
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
NLP LectureNotes UNIT 1
No ratings yet
NLP LectureNotes UNIT 1
55 pages
176_DL
No ratings yet
176_DL
11 pages
Natural Language Processing: All You Need To Know About
No ratings yet
Natural Language Processing: All You Need To Know About
45 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
big data analytics Chap 11
No ratings yet
big data analytics Chap 11
8 pages
Data Science With Python - Lesson 09 - Data Science With Python - NLP PDF
No ratings yet
Data Science With Python - Lesson 09 - Data Science With Python - NLP PDF
62 pages
NLP handwritten notes_copy
No ratings yet
NLP handwritten notes_copy
26 pages
Talking Points
No ratings yet
Talking Points
8 pages
Machine Learning Algorithms
No ratings yet
Machine Learning Algorithms
6 pages
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
Building Transformer Models with PyTorch 2.0: NLP, computer vision, and speech processing with PyTorch and Hugging Face (English Edition)
From Everand
Building Transformer Models with PyTorch 2.0: NLP, computer vision, and speech processing with PyTorch and Hugging Face (English Edition)
Prem Timsina
No ratings yet
SOP Transpo
No ratings yet
SOP Transpo
1 page
House price predictor ppt Project
No ratings yet
House price predictor ppt Project
13 pages
Impact of Artificial Intelligence and Machine Learning On The Accounting Profession
No ratings yet
Impact of Artificial Intelligence and Machine Learning On The Accounting Profession
18 pages
Infectious Disease Modelling: Lamiaa A. Amar, Ashraf A. Taha, Marwa Y. Mohamed
No ratings yet
Infectious Disease Modelling: Lamiaa A. Amar, Ashraf A. Taha, Marwa Y. Mohamed
13 pages
4c Sklearn-Classification-Regression-Bkhw-Spring 2019
No ratings yet
4c Sklearn-Classification-Regression-Bkhw-Spring 2019
20 pages
4D Flow MRI Velocity Fields Using ML
No ratings yet
4D Flow MRI Velocity Fields Using ML
11 pages
3 Big-Data
No ratings yet
3 Big-Data
14 pages
Confusion Matrix: Example Table of Confusion References External Links
No ratings yet
Confusion Matrix: Example Table of Confusion References External Links
3 pages
Ai 1
No ratings yet
Ai 1
170 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Machine Learning Yearning V0.5 - 02
No ratings yet
Machine Learning Yearning V0.5 - 02
3 pages
Heart Disease Prediction Using Hyper Parameter Optimization (HPO) Tuning
No ratings yet
Heart Disease Prediction Using Hyper Parameter Optimization (HPO) Tuning
10 pages
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi August 12, 2018
No ratings yet
VIP Cheatsheet: Unsupervised Learning: Afshine Amidi and Shervine Amidi August 12, 2018
2 pages
Breast Cancer Aiml Project
No ratings yet
Breast Cancer Aiml Project
25 pages
Application of Machine Learning in High Frequency Trading of Stocks
No ratings yet
Application of Machine Learning in High Frequency Trading of Stocks
12 pages
Artificial Intelligence Class - VIII Flipbook
No ratings yet
Artificial Intelligence Class - VIII Flipbook
98 pages
X Bert Extreme Multi Label Text Classification Using Bidirectional Encoder Representations From Transformers
No ratings yet
X Bert Extreme Multi Label Text Classification Using Bidirectional Encoder Representations From Transformers
12 pages
Yadnesh Resume
No ratings yet
Yadnesh Resume
1 page
Image Processing and Computer Vision (Notes)
No ratings yet
Image Processing and Computer Vision (Notes)
64 pages
Decision Stump Algorithm 1
No ratings yet
Decision Stump Algorithm 1
4 pages
Machine Learning Techniques For Anomaly Detection: An Overview
No ratings yet
Machine Learning Techniques For Anomaly Detection: An Overview
10 pages
ML Mod6
No ratings yet
ML Mod6
24 pages
Machine Learning Engineer Nanodegree: Capstone Proposal
No ratings yet
Machine Learning Engineer Nanodegree: Capstone Proposal
11 pages
Advanced Certification in Data Science and Artificial Intelligence
No ratings yet
Advanced Certification in Data Science and Artificial Intelligence
13 pages
Vehicle Damage Severity Estimation For Insurance Operations Using In-The-Wild Mobile Images
No ratings yet
Vehicle Damage Severity Estimation For Insurance Operations Using In-The-Wild Mobile Images
12 pages
Itm Final Project Maryse Abi Haidar, Jason Saker, Antoine Daou, Paloma Zgheib, Farah Mechref Lebanese American University
No ratings yet
Itm Final Project Maryse Abi Haidar, Jason Saker, Antoine Daou, Paloma Zgheib, Farah Mechref Lebanese American University
16 pages
CYBS Tope
No ratings yet
CYBS Tope
6 pages
Artificial Intelligence in Finance Newsletter by Slidesgo
No ratings yet
Artificial Intelligence in Finance Newsletter by Slidesgo
27 pages
UNIT 2 AAM notes (1)
No ratings yet
UNIT 2 AAM notes (1)
38 pages