Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

Module2.4 Text Processing

Uploaded by

sudothearkknight
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Module2.4 Text Processing

Uploaded by

sudothearkknight
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Course Code: CSA3002

MACHINE LEARNING ALGORITHMS

Course Type: LPC – 2-2-3


Course Objectives
• The objective of the course is to familiarize the learners with
the concepts of Machine Learning Algorithms and attain
Skill Development through Experiential Learning
techniques.
Course Outcomes
At the end of the course, students should be able to
1. Understanding of training and testing the datasets using machine
Learning techniques.
2. Apply optimization and parameter tuning techniques for machine
Learning algorithms.
3. Apply a machine learning model to solve various problems using
machine learning algorithms.
4. Apply machine learning algorithm to create models.
Text Processing Techniques
• Text processing, also known as text manipulation or text analysis, is
the computational process of extracting, transforming, and interpreting
textual data.
• It involves various techniques and methods to work with unstructured
text, enabling the extraction of meaningful information, patterns, and
insights from text documents.
• Text processing is a fundamental component of natural language
processing (NLP) and is used in a wide range of applications,
including information retrieval, sentiment analysis, text classification,
and text mining.
Text processing can be used for a variety of tasks, including:

• Sentiment analysis: Identifying the sentiment of a piece of text, such


as whether it is positive, negative, or neutral.
• Topic modeling: Identifying the main topics of a piece of text.
• Named entity recognition: Identifying named entities in a piece of text,
such as people, places, and organizations.
• Machine translation: Translating a piece of text from one language to
another.
• Spam filtering: Identifying and filtering out spam emails.
• Text summarization: Generating a summary of a piece of text.
Text processing is used in a variety of
industries, including:
• Customer service: To analyze customer feedback and identify areas for
improvement.
• Marketing: To analyze social media posts and other forms of customer
data to better understand customer needs and preferences.
• Finance: To analyze financial news and reports to identify trends and
investment opportunities.
• Healthcare: To analyze patient records and clinical trials data to
improve the diagnosis and treatment of diseases.
• Security: To analyze network traffic and other security-related data to
identify and prevent cyberattacks.
Example: Sentiment Analysis
• Step 1: Text Preprocessing Before performing any text analysis, it's crucial to
preprocess the text. Text preprocessing typically involves the following steps:
• a. Tokenization: Divide the text into words, phrases, or sentences (tokens). For
example, the sentence "I love this product" would be tokenized into ["I", "love",
"this", "product"].
• b. Lowercasing: Convert all the text to lowercase to ensure consistent
comparison. For instance, "Love" and "love" should be treated the same.
• c. Removing Punctuation: Eliminate punctuation marks like periods, commas,
and exclamation marks.
• d. Stopword Removal: Remove common words like "and," "the," "is" that don't
carry much meaning.
• e. Lemmatization or Stemming: Reduce words to their base form. For instance,
"running," "ran," and "runs" can be reduced to "run."
• Step 2: Sentiment Analysis Now that we have preprocessed the text, let's perform
sentiment analysis on a sample sentence: "I love this product."
• a. Feature Extraction: Convert the preprocessed text into a numerical
representation. A common method is to use a bag-of-words or term frequency-
inverse document frequency (TF-IDF) representation. In this case, we'd represent
the sentence as a vector, where "I," "love," and "product" are features.
• b. Sentiment Classification: Train a machine learning model, like a support vector
machine (SVM) or a deep learning model, on a labeled dataset to classify text into
sentiments (e.g., positive, negative, or neutral). In this case, "I love this product"
would likely be classified as positive.
• c. Prediction: Apply the trained model to the sample sentence. It predicts that the
sentiment is positive.
• Step 3: Post-processing Depending on the use case, you might want to perform
additional post-processing on the sentiment analysis results. This could include
summarizing the results, presenting them visually, or making business decisions
based on the analysis.
• Example Output: In this example, the output of sentiment analysis
for the sentence "I love this product" would be "positive." This
information can be used by companies to gauge customer opinions and
make data-driven decisions regarding their products.
• Text processing is a fundamental aspect of NLP, and it can be adapted
for a wide range of applications beyond sentiment analysis, including
language translation, chatbots, information retrieval, and more. It
plays a critical role in harnessing the power of textual data in various
domains.
What is Tokenizing?
• It may be defined as the process of breaking up a piece of text into
smaller parts, such as sentences and words.
• These smaller parts are called tokens. For example, a word is a token
in a sentence, and a sentence is a token in a paragraph.
Example: Tokenizing a Sentence
Consider the sentence: "Natural language processing is fascinating!"

• Step 1: Tokenization Tokenization involves splitting this sentence


into its constituent tokens. In this case, the tokens would be individual
words:
• "Natural"
• "language"
• "processing"
• "is"
• "fascinating"
• "!"
• Step 2: Importance of Tokenization
• Now, let's discuss why tokenization is important:
• Text Analysis: Tokenization is the foundation for text analysis in NLP. It allows
computers to work with individual words, which are the building blocks for various
tasks like sentiment analysis, machine translation, and text classification.
• Text Preprocessing: Before any text analysis, text data is typically preprocessed.
Tokenization is often the first step in this process, followed by converting all tokens to
lowercase, removing punctuation, and eliminating common stopwords (e.g., "and,"
"the," "is").
• Feature Extraction: In many NLP tasks, text data is represented as a numerical vector,
where each dimension corresponds to a token (word). Tokenization is the initial step in
converting text into these numerical representations, such as bag-of-words or TF-IDF
vectors.
• N-grams: Tokenization can also be used to create n-grams, which are combinations of
n adjacent words. For example, the bigram "natural language" or the trigram "is
fascinating" could be created by tokenizing the sentence accordingly.
• Step 3: Tokenization in Real Applications
• College students can relate tokenization to real-world applications:
• Search Engines: Tokenization is used by search engines to break down
user queries into individual keywords, making it easier to find relevant
web pages.
• Social Media Analysis: Companies use tokenization to understand
trends on social media. For example, analyzing tweets and posts about a
product launch can reveal public sentiment.
• Language Translation: In machine translation, tokenization is the first
step in breaking down the source and target language texts into units for
translation.
• Chatbots: Chatbots use tokenization to understand and respond to user
messages. The input message is tokenized to identify keywords and
phrases for generating a relevant response.
Implementation in python
• NLTK package
• nltk.tokenize is the package provided by NLTK module to achieve the
process of tokenization.
• word_tokenize module is used for basic word tokenization.
• import the TreebankWordTokenizer class to implement the word
tokenizer algorithm −
• from nltk.tokenize import TreebankWordTokenizer
• Next, create an instance of TreebankWordTokenizer class as follows −
• Tokenizer_wrd = TreebankWordTokenizer()
Complete code
• import nltk
• from nltk.tokenize import TreebankWordTokenizer
• Tokenizer_wrd = TreebankWordTokenizer()
• Tokenizer_wrd.tokenize('Welcome to presidency univerveity
bangalore karnataka.')

• Output
• ['Welcome', 'to', 'presidency', 'univerveity', 'bangalore', 'karnataka', '.']
Example 2
• import nltk
• from nltk.tokenize import word_tokenize
• word_tokenize("won’t")

• Output
• ['wo', "n't"]]

• Hint: nltk.download('punkt')
WordPunktTokenizer Class
An alternative word tokenizer that splits all punctuation into separate
tokens .
• # An alternative word tokenizer that splits all punctuation into
separate tokens.
• from nltk.tokenize import WordPunctTokenizer
• tokenizer = WordPunctTokenizer()
• tokenizer.tokenize(" I can't allow you to go home early")

• Outout
• ['I', 'can', "'", 't', 'allow', 'you', 'to', 'go', 'home', 'early']

You might also like