Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
21 views

Python and NLP Notes

Gffvbgfs

Uploaded by

surajj0624
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Python and NLP Notes

Gffvbgfs

Uploaded by

surajj0624
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Unit 1: Python Text and NLP Basics

1.1 Introduction to Python Text Basics

Python offers numerous tools for handling and manipulating text. The most basic of these are
string operations, but for more advanced tasks, we use libraries like re for regular expressions
and Spacy for NLP-specific functions.

Basic String Operations

Operation Example Code Output

Lowercasing text = "HELLO"; text.lower() 'hello'

Splitting text = "Hello, World!"; ['Hello', '


text.split(',') World!']

Concatenation a = "Hello"; b = "World"; c = a + " " 'Hello World'


+ b

Replacing text = "I am happy"; 'I am sad'


text.replace('happy', 'sad')

String operations in Python are efficient for simple text processing tasks, such as breaking a
sentence into words, converting text to lowercase, or replacing substrings.

File Handling in Python

Operation Description Example Code Output

Reading a Reads the entire content of file.read() Contents of


File the file. file

Reading Reads the file line by line and file.readlines() List of lines in
Lines returns a list. file

Writing to a Writes data to the file file.write("Hello -


File (overwrites existing data). World")

Example:

with open('example.txt', 'r', encoding='utf-8') as file:


content = file.read()
print(content) # Output: contents of 'example.txt'

1.2 Working with PDFs

PDF Text Extraction

Python libraries such as PyPDF2 and pdfminer.six are commonly used to extract text from
PDF documents.

Library Description Example Code Output


Example

PyPDF2 Simple library text = Text from


to extract PDF pdf_reader.getPage(0).extractText( page 1 of
text. ) PDF

pdfminer.six More powerful extract_text('file.pdf') Complete


for text text from
extraction. PDF

Example:

import PyPDF2

with open('sample.pdf', 'rb') as pdf_file:


reader = PyPDF2.PdfFileReader(pdf_file)
text = reader.getPage(0).extractText()
print(text) # Outputs text from the first page

1.3 Introduction to Regular Expressions (Regex)

Regular expressions allow us to define complex patterns to search, match, or manipulate text.
Python’s re module provides a variety of functions to work with regex.
Regex Description Example Code Output
Operation

Finding Find all occurrences of a re.findall(r'\d+', ['123']


Patterns pattern in text. 'User123 data')

Substituting Replace parts of the text re.sub(r'\d+', 'ID', 'UserID'


Patterns that match a pattern. 'User123')

Shorthand Predefined classes for re.findall(r'\w+', ['Text',


Classes matching. 'Text 123') '123']

Character Matches a range of re.findall(r'[A-Z]', ['H', 'W']


Ranges characters. 'Hello World')

Common Regex Patterns

Pattern Meaning Example Matches

\d Any digit (0-9) \d+ "123" from "User123"

\w Any word character (a-z, A-Z, 0-9) \w+ "User123"

\s Any whitespace character \s+ " " (space)

[a-z] Any lowercase letter [a-z]+ "text" from "text"

Example: Removing Special Characters


python

import re
text = "Hello! Welcome to NLP 101."
clean_text = re.sub(r'[^A-Za-z\s]', '', text) # Removes anything that
is not a letter or space
print(clean_text) # Output: "Hello Welcome to NLP"

1.4 Preprocessing using Regex

Preprocessing is the crucial first step in any NLP pipeline, ensuring that the data is cleaned and
normalized before being fed into algorithms.
Preprocessi Regex Pattern / Example Output
ng Task Operation

Remove re.sub(r'http\S+', "Visit "Visit "


URLs '', text) http://example.c
om"

Remove re.sub(r'[^A-Za-z0- "Hello, World!" "Hello World"


Special 9\s]', '', text)
Characters

Extract re.findall(r'\S+@\S "Contact me at ["john@example.co


Email +', text) john@example.com m"]
Addresses "

Replace re.sub(r'\d+', "My number is "My number is


Digits 'NUM', text) 12345" NUM"

Example: Removing Digits

text = "The price is 123 dollars."


clean_text = re.sub(r'\d+', 'NUM', text)
print(clean_text) # Output: "The price is NUM dollars."

1.5 Introduction to Natural Language Processing (NLP)

NLP involves enabling machines to understand, interpret, and generate human language. It
combines computer science, linguistics, and machine learning techniques.

Key Applications of NLP

Application Description Example

Chatbots Automate customer service Virtual assistants like Siri, Alexa


and support

Sentiment Classifying the sentiment of Analyzing movie reviews


Analysis text (positive/negative)

Machine Translating text between Google Translate


Translation languages
Speech Converting spoken language to Speech-to-text in Google Docs
Recognition text

Challenges in NLP

Challenge Description Example

Ambiguity Words or sentences with multiple "The bank is on the river bank." (bank as
meanings. financial or riverbank)

Variety Variations in language use across British English vs. American English:
dialects, regions, etc. colour vs. color

1.6 Role of Machine Learning in NLP

Modern NLP relies heavily on machine learning, particularly deep learning, to automatically
detect patterns in language. The following models are popular in NLP:

Model Description Example

Bag of Words (BoW) Text represented as a bag of 'I love NLP' → {'I':
individual words, ignoring order. 1, 'love': 1, 'NLP':
1}

TF-IDF Weighting scheme where frequent 'I love NLP' → weighted


but less important words are matrix
down-weighted.

RNN (Recurrent Models sequences and Used for machine translation


Neural Network) dependencies in text. or text generation

Transformers Advanced model that captures Used in GPT, BERT for tasks
global context across sentences. like summarization

Bag of Words Example

from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer()
X = vectorizer.fit_transform(["I love NLP", "I love programming"])
print(X.toarray()) # Output: BoW matrix
1.7 Spacy Basics

Spacy is a popular NLP library in Python, known for its efficiency and ease of use. Key features
include tokenization, part-of-speech tagging, and named entity recognition.

Tokenization

Tokenization refers to splitting text into words or sentences.

Operation Example Code Output

Word Tokenization tokens = [token.text for ['I', 'love', 'NLP']


token in doc]

Sentence sentences = list(doc.sents) ['I love NLP.', 'It is


Tokenization amazing.']

Example: Tokenization

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Natural Language Processing is exciting!")
tokens = [token.text for token in doc]
print(tokens) # Output: ['Natural', 'Language', 'Processing', 'is',
'exciting', '!']

1.8 Stemming, Lemmatization, Stop Words

Operation Description Example Code Output

Stemming Reducing words to stemmer.stem("running") 'run'


their root form.

Lemmatization Converting words to [token.lemma_ for token in ['run',


their base dictionary doc] 'be']
form.
Stop Words Common words that [token for token in doc if List of
can be removed not token.is_stop] non-stop
during processing. words

Example: Lemmatization

doc = nlp("The children are playing.")


lemmas = [token.lemma_ for token in doc]
print(lemmas) # Output: ['the', 'child', 'be', 'play', '.']

1.9 Phrase Matching and Vocabulary

Phrase matching is used to search for multi-word expressions in text, which are often significant
in NLP tasks like entity recognition or keyword extraction.

Example: Phrase Matching

from spacy.matcher import PhraseMatcher


matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp(text) for text in ["machine learning", "natural
language processing"]]
matcher.add("TechTerms", None, *patterns)

doc = nlp("I love machine learning and natural language processing.")


matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end].text) # Output: 'machine learning', 'natural
language processing'

Unit 2: Part of Speech Tagging and Named Entity Recognition (NER)

2.1 Part of Speech Tagging (POS)

POS Tagging is the process of labeling each word in a sentence with its respective part of
speech, such as noun, verb, adjective, etc. POS tagging is a fundamental part of many NLP
tasks, including syntactic parsing and word-sense disambiguation.

POS Tagging in Spacy


● Spacy automatically assigns POS tags using its built-in model, which labels words with
their grammatical roles.

import spacy
nlp = spacy.load("en_core_web_sm")

sentence = "Apple is looking at buying a U.K. startup."


doc = nlp(sentence)

for token in doc:


print(f"{token.text} -> {token.pos_} ({token.tag_})")

Common POS Tags

POS Tag Full Form Example Description

NOUN Noun Apple A person, place, thing, or idea

VERB Verb buying Action or state of being

ADJ Adjective startup Describes a noun

PROPN Proper Noun U.K. Specific names of people, places

ADV Adverb quickly Modifies a verb, adjective, or adverb

AUX Auxiliary Verb is Helps form different tenses

Example Output:
rust

Apple -> PROPN (NNP)


is -> AUX (VBZ)
looking -> VERB (VBG)
at -> ADP (IN)
buying -> VERB (VBG)
a -> DET (DT)
U.K. -> PROPN (NNP)
startup -> NOUN (NN)

POS Tagging vs Named Entity Recognition


Feature POS Tagging Named Entity Recognition (NER)

Purpose Labels words as nouns, verbs, etc. Identifies proper nouns and classifies them
(e.g., person, organization)

Example Verb (run), Noun (book) Person (John), Organization (Google),


Location (Paris)

Use Syntactic parsing, understanding Identifying named entities in text for


Cases sentence structure information extraction

2.2 Named Entity Recognition (NER)

Named Entity Recognition (NER) identifies and classifies named entities mentioned in the text
into predefined categories, such as persons, organizations, locations, dates, etc.

Example of NER in Spacy

doc = nlp("Apple is looking at buying a startup in the U.K.")


for ent in doc.ents:
print(ent.text, ent.label_)

NER Labels and Their Meanings

Entity Label Full Form Example Description

PERSON Person Elon Musk Recognizes people’s names

ORG Organization Apple, Google Recognizes corporate organizations

GPE Geopolitical Entity U.K., Germany Recognizes countries, cities, states

DATE Date July 2020 Recognizes dates

MONEY Monetary Value $500 Recognizes currency values


Example Output:
Apple -> ORG
U.K. -> GPE

Comparison of POS Tagging and NER

Feature POS Tagging Named Entity Recognition (NER)

Purpose Assigns part-of-speech labels Identifies and categorizes named entities


to tokens

Example Verb (run), Noun (city) Person (Elon Musk), Organization (Apple),
GPE (U.K.)

Applications Language structure analysis Information extraction, Named entity


categorization

2.3 Sentence Segmentation

Sentence segmentation is the process of splitting text into individual sentences. It is a critical
step in NLP for understanding sentence boundaries and structure.

Example of Sentence Segmentation in Spacy

text = "Hello! How are you? I'm doing well."


doc = nlp(text)

for sent in doc.sents:


print(sent.text)

Example Output:

Hello!
How are you?
I'm doing well.

Techniques for Sentence Segmentation


Technique Description Example

Rule-based Uses punctuation and specific markers (e.g., split based on . or ?


periods, question marks) to split sentences.

ML-based Uses machine learning models to learn Models trained on annotated


sentence boundaries. corpora to detect sentence
ends.

2.4 Text Modeling using the Bag of Words Model

The Bag of Words (BoW) model represents text data as a collection of words, ignoring grammar
and word order but maintaining frequency counts of each word.

Bag of Words Example


python

from sklearn.feature_extraction.text import CountVectorizer


corpus = ["I love NLP", "NLP is amazing", "I love programming"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray()) # Outputs the BoW matrix

Bag of Words Matrix

Sentence I love NLP is amazing programming

I love NLP 1 1 1 0 0 0

NLP is amazing 0 0 1 1 1 0

I love programming 1 1 0 0 0 1

Advantages and Limitations of Bag of Words

Advantages Limitations

Simple and easy to implement Ignores word order

Works well for simple text classification tasks Does not capture semantic meaning of words
2.5 Text Modeling using the TF-IDF Model

TF-IDF (Term Frequency-Inverse Document Frequency) is an advanced text representation


model that weighs terms based on their frequency in a document and their inverse frequency in
the entire corpus. This reduces the weight of common terms like “the” and “is.”

TF-IDF Formula:

● Term Frequency (TF) = Number of occurrences of the word in a documentTotal number


of words in the document\frac{\text{Number of occurrences of the word in a
document}}{\text{Total number of words in the document}}Total number of words in the
documentNumber of occurrences of the word in a document​
● Inverse Document Frequency (IDF) = log⁡(Total number of documentsNumber of
documents containing the word)\log \left( \frac{\text{Total number of
documents}}{\text{Number of documents containing the word}} \right)log(Number of
documents containing the wordTotal number of documents​)
● TF-IDF = TF * IDF

TF-IDF Example in Python

from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
print(X_tfidf.toarray()) # Outputs the TF-IDF matrix

Comparison of BoW and TF-IDF

Model Description Use Case

Bag of Represents text as a collection of Simple text classification tasks.


Words word frequencies.

TF-IDF Weighs words by frequency and Better for tasks where word significance
importance in the corpus. matters, like information retrieval.

2.6 Understanding the N-Gram Model

An N-Gram is a contiguous sequence of n items (words, characters, etc.) from a given text.
N-Grams capture local context by analyzing adjacent words or characters.
Types of N-Grams

N-Gram Type Example

Unigram (n=1) "I", "love", "NLP"

Bigram (n=2) "I love", "love NLP"

Trigram (n=3) "I love NLP", "love NLP courses"

Example: Generating Bigrams

from sklearn.feature_extraction.text import CountVectorizer


bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
X_bigram = bigram_vectorizer.fit_transform(corpus)
print(bigram_vectorizer.get_feature_names_out()) # Outputs list of
bigrams

N-Gram Applications

● Unigrams: Often used in simple text classification tasks.


● Bigrams/Trigrams: Useful in language models where word context is important (e.g.,
machine translation, speech recognition).

Example N-Gram Usage:

Sentence: "I love NLP"


Unigrams: ["I", "love", "NLP"]
Bigrams: ["I love", "love NLP"]

2.7 Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) is a technique in natural language processing that helps
discover the underlying structure of relationships between terms and documents. LSA reduces
the dimensionality of text data by transforming it into a lower-dimensional space using Singular
Value Decomposition (SVD). This technique is useful for text clustering, topic modeling, and
document similarity.

Steps in LSA:

1. Construct the Term-Document Matrix (using BoW or TF-IDF).


2. Apply Singular Value Decomposition (SVD) to decompose the matrix into three matrices:
U, Σ, and V.
3. Reduce the dimensionality by selecting the top k components from the decomposition.

Formula:
A=UΣVTA = U \Sigma V^TA=UΣVT

Where:

● A is the original matrix.


● U is the matrix representing terms.
● Σ is the diagonal matrix representing the singular values.
● V^T is the matrix representing documents.

Example: Applying LSA in Python


python

from sklearn.decomposition import TruncatedSVD


from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["The dog barked at the mailman.",


"The cat meowed at the dog.",
"The mailman ran from the dog."]

# Convert corpus into a TF-IDF matrix


tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)

# Perform SVD (LSA)


svd = TruncatedSVD(n_components=2)
X_lsa = svd.fit_transform(X)

# Output the LSA-reduced matrix


print(X_lsa)

LSA Applications:

● Topic Modeling: Identifying the underlying topics in a collection of documents.


● Information Retrieval: Improving search engine performance by finding documents with
similar meanings.
2.8 Word Synonyms and Antonyms using NLTK

In NLP, synonyms are words that have similar meanings, while antonyms are words with
opposite meanings. The NLTK (Natural Language Toolkit) provides a built-in lexical database
called WordNet to fetch synonyms and antonyms for any word.

Example: Finding Synonyms and Antonyms with NLTK


python

from nltk.corpus import wordnet

# Synonyms for "happy"


synonyms = []
for syn in wordnet.synsets("happy"):
for lemma in syn.lemmas():
synonyms.append(lemma.name())

# Antonyms for "happy"


antonyms = []
for syn in wordnet.synsets("happy"):
for lemma in syn.lemmas():
if lemma.antonyms():
antonyms.append(lemma.antonyms()[0].name())

print(f"Synonyms: {set(synonyms)}")
print(f"Antonyms: {set(antonyms)}")

Example Output:
arduino

Synonyms: {'felicitous', 'glad', 'happy'}


Antonyms: {'unhappy'}

Applications of Synonyms and Antonyms in NLP:

● Thesaurus generation.
● Word-sense disambiguation.
● Improving semantic search.
2.9 Word Negation Tracking

Word Negation Tracking refers to identifying and understanding negation in a sentence. Words
like "not", "never", "no", or "none" can drastically change the meaning of a sentence. Handling
negations is crucial for tasks like sentiment analysis or intent recognition.

Example: Negation Handling


python

import nltk
from nltk.tokenize import word_tokenize

def negate_sentence(sentence):
tokens = word_tokenize(sentence)
negation = False
result = []

for token in tokens:


if token.lower() in ["not", "never", "no"]:
negation = True
elif token == ".":
negation = False
result.append("NOT_" + token if negation else token)

return " ".join(result)

# Example sentence
sentence = "I am not happy with the service."
negated_sentence = negate_sentence(sentence)
print(negated_sentence) # Output: 'I am NOT_happy with the service .'

Applications of Negation Tracking:

● Sentiment Analysis: Identifying positive and negative opinions more accurately.


● Intent Recognition: Understanding when users are making negative statements.
Unit 3: Text Classification and Text Summarization

3.1 Text Classification

Text classification is the process of assigning labels or categories to a piece of text based on its
content. This is widely used in tasks like sentiment analysis, spam detection, and topic
classification.

Steps in Text Classification:

1. Get the Data: Collect or import the dataset.


2. Data Preprocessing: Clean and preprocess the text (remove punctuation, stop words,
etc.).
3. Transform into BoW/TF-IDF Model: Convert the text into a vector representation.
4. Train the Model: Use classification algorithms like Logistic Regression, SVM, Naive
Bayes, etc.
5. Test the Model: Evaluate the model's performance using metrics like accuracy,
precision, recall.

Example: Text Classification with Logistic Regression


python

from sklearn.model_selection import train_test_split


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data
corpus = ["I love this product!", "This is the worst experience!",
"Absolutely fantastic service!", "Terrible customer support."]
labels = [1, 0, 1, 0] # 1=positive, 0=negative

# Preprocessing: TF-IDF vectorization


vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Split data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, labels,
test_size=0.25, random_state=42)

# Train Logistic Regression classifier


classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Predict and evaluate


y_pred = classifier.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Common Text Classification Algorithms:

Algorithm Description Example Use


Case

Logistic Regression A simple linear model for binary classification. Spam vs


Non-Spam

Naive Bayes A probabilistic classifier based on Bayes' Sentiment


theorem. analysis

Support Vector A robust linear classifier that finds the decision Fake news
Machines boundary. detection

3.2 Text Summarization

Text summarization is the process of reducing the length of a document while preserving its key
information. There are two main types of text summarization: Extractive Summarization and
Abstractive Summarization.

● Extractive Summarization: Selects key sentences or phrases directly from the original
text.
● Abstractive Summarization: Generates new sentences that summarize the content
(like how humans summarize text).

Steps in Extractive Summarization:

1. Fetch and Preprocess the Data: Get the document and clean it.
2. Tokenization: Split the text into sentences.
3. Build a Histogram: Calculate the frequency of each word.
4. Calculate Sentence Scores: Score each sentence based on the significance of its
words.
5. Select Sentences: Choose top N sentences for the summary.

Example: Extractive Summarization using NLTK


python

from nltk.tokenize import sent_tokenize, word_tokenize


from nltk.corpus import stopwords
from collections import defaultdict

# Sample text
text = """Natural Language Processing is an exciting field of
Artificial Intelligence.
It enables machines to understand and process human language. It is
widely used in chatbots, language translation, and many other
applications."""

# Step 1: Tokenize sentences


sentences = sent_tokenize(text)

# Step 2: Preprocess words


stop_words = set(stopwords.words('english'))
word_frequencies = defaultdict(int)

for word in word_tokenize(text):


if word.lower() not in stop_words and word.isalpha():
word_frequencies[word.lower()] += 1

# Step 3: Calculate sentence scores


sentence_scores = {}
for sentence in sentences:
for word in word_tokenize(sentence):
if word.lower() in word_frequencies:
if sentence not in sentence_scores:
sentence_scores[sentence] =
word_frequencies[word.lower()]
else:
sentence_scores[sentence] +=
word_frequencies[word.lower()]

# Step 4: Select top sentences for summary


summary = sorted(sentence_scores, key=sentence_scores.get,
reverse=True)[:2]
print("Summary:", " ".join(summary))

Applications of Text Summarization:

● News Summarization: Condensing lengthy news articles.


● Research Papers: Generating brief abstracts for long papers.
● Automatic Meeting Notes: Summarizing meeting transcripts for key points.

Unit 4: Semantics and Sentiment Analysis

4.1 Introduction to Semantics and Sentiment Analysis

● Semantics in NLP deals with the meaning and interpretation of words, phrases,
sentences, and larger units of text. It helps understand context, disambiguate word
meanings, and identify relationships between entities.
● Sentiment Analysis focuses on determining the emotional tone behind a body of text,
identifying whether it is positive, negative, or neutral.

4.2 Semantics and Word Vectors

Word vectors (also known as word embeddings) are numerical representations of words in a
high-dimensional space. Words that share similar contexts in a corpus tend to be closer in this
vector space. Word vectors enable semantic analysis by capturing relationships like:

● Synonymy: Words with similar meanings.


● Analogy: Word relationships (e.g., "king" is to "queen" as "man" is to "woman").

Common Word Embedding Models:

● Word2Vec: Converts words into dense vectors by training on large corpora using two
architectures: Continuous Bag of Words (CBOW) and Skip-Gram.
● GloVe (Global Vectors for Word Representation): Generates word embeddings by
factorizing word co-occurrence matrices.
● FastText: Builds on Word2Vec, but models sub-word information, making it better for
handling rare words.

Word2Vec Example:
python

import gensim
from gensim.models import Word2Vec
# Example corpus
sentences = [["I", "love", "NLP"], ["NLP", "is", "fun"], ["I",
"enjoy", "learning", "NLP"]]

# Train a Word2Vec model


model = Word2Vec(sentences, vector_size=100, window=5, min_count=1,
workers=4)

# Get word vector for "NLP"


print(model.wv['NLP'])

# Finding most similar words to "NLP"


print(model.wv.most_similar('NLP'))

Example Output (Most Similar Words to 'NLP'):


css

[('love', 0.832), ('learning', 0.810), ('fun', 0.750)]

Analogy Example:
python

# Finding analogy: "king" is to "man" as "queen" is to ?


model.wv.most_similar(positive=['king', 'woman'], negative=['man'],
topn=1)

4.3 Sentiment Analysis Overview

Sentiment Analysis is the task of analyzing a piece of text to determine the underlying
sentiment or opinion. It can classify text as positive, negative, or neutral. Sentiment analysis is
widely used in:

● Product reviews: Identifying customer opinions on products.


● Social media: Analyzing user feedback and reactions.
● Customer service: Gauging user satisfaction from responses.

Common Approaches for Sentiment Analysis:


Method Description Example

Rule-Based Uses predefined lists of positive/negative Lexicon-based


words and rules (SentiWordNet)

Machine Classifies sentiment using machine learning Logistic Regression, Naive


Learning algorithms Bayes

Deep Learning Uses neural networks to automatically learn LSTM, RNNs,


sentiment patterns Transformers

Sentiment Score Example (Lexicon-Based):

Each word is assigned a sentiment score based on its polarity (positive or negative).

● "I love this product!" → Positive sentiment (words like "love" have positive polarity).
● "The service was terrible." → Negative sentiment (words like "terrible" have negative
polarity).

4.4 Sentiment Analysis with NLTK

The Natural Language Toolkit (NLTK) provides tools for simple sentiment analysis. The
VADER (Valence Aware Dictionary for Sentiment Reasoning) model, built into NLTK, is
commonly used for lexicon-based sentiment analysis.

Example: Sentiment Analysis using NLTK's VADER


python

from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize VADER sentiment analyzer


analyzer = SentimentIntensityAnalyzer()

# Example sentence
sentence = "I love this product, but the delivery was terrible."

# Calculate sentiment scores


sentiment_scores = analyzer.polarity_scores(sentence)
print(sentiment_scores) # Outputs a dictionary of scores
Example Output:
arduino

{'neg': 0.344, 'neu': 0.493, 'pos': 0.163, 'compound': -0.1531}

● neg: Negative sentiment score


● neu: Neutral sentiment score
● pos: Positive sentiment score
● compound: Overall sentiment (ranges from -1 to +1, where -1 is very negative and +1 is
very positive)

4.5 Sentiment Analysis Movie Review Project

Objective: Classify movie reviews as positive or negative.

Steps for the Sentiment Analysis Project:

1. Data Collection: Use the IMDb movie review dataset, which contains reviews labeled as
positive or negative.
2. Preprocessing: Clean the text by removing stop words, punctuation, and performing
tokenization.
3. Feature Extraction: Convert text into numerical format using Bag of Words or TF-IDF.
4. Training the Model: Train a machine learning model such as Logistic Regression or
Naive Bayes.
5. Evaluating the Model: Evaluate the model using metrics such as accuracy, precision,
recall, and F1 score.

Example Project Pipeline:


python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample movie review data


data = pd.read_csv('IMDB_Dataset.csv')
X = data['review']
y = data['sentiment'].map({'positive': 1, 'negative': 0})
# Preprocessing: Vectorize text using TF-IDF
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = tfidf.fit_transform(X)

# Split into training and test set


X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y,
test_size=0.2, random_state=42)

# Train the Logistic Regression model


classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Predict on the test set and evaluate


y_pred = classifier.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

4.6 Twitter Sentiment Analysis

Twitter Sentiment Analysis focuses on analyzing the sentiment of tweets in real-time. Since
tweets are brief and often contain informal language, they pose unique challenges for NLP. This
project involves fetching live tweets, processing them, and predicting their sentiment.

Steps for Twitter Sentiment Analysis:

1. Set up the Twitter Application: Create a Twitter developer account and get access
tokens and API keys.
2. Fetch Real-Time Tweets: Use the tweepy library to fetch tweets based on specific
hashtags or keywords.
3. Preprocessing the Tweets: Clean tweets by removing URLs, hashtags, mentions, and
special characters.
4. Predicting Sentiment: Load a pre-trained sentiment analysis model (e.g., TF-IDF and
Logistic Regression) to classify the sentiment of each tweet.
5. Visualizing Results: Plot the distribution of sentiments (positive, negative, neutral).

Example: Fetching Tweets with Tweepy and Sentiment Analysis


python

import tweepy
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Set up Tweepy API with your credentials
auth = tweepy.OAuthHandler('API_KEY', 'API_SECRET_KEY')
auth.set_access_token('ACCESS_TOKEN', 'ACCESS_SECRET')
api = tweepy.API(auth)

# Fetch tweets based on a keyword


tweets = api.search(q="product review", lang="en", count=100)

# Initialize VADER sentiment analyzer


analyzer = SentimentIntensityAnalyzer()

# Analyze sentiment of each tweet


for tweet in tweets:
text = tweet.text
sentiment = analyzer.polarity_scores(text)
print(f"Tweet: {text} | Sentiment: {sentiment['compound']}")

Applications of Twitter Sentiment Analysis:

● Brand monitoring: Understanding public sentiment about a product or service.


● Political sentiment analysis: Gauging public opinion on political issues.
● Market research: Analyzing customer feedback to improve products.

Visualizing Twitter Sentiment Results

Once the tweets are processed and classified, you can plot the sentiment distribution using
matplotlib or seaborn to visualize whether the majority of tweets are positive, negative, or
neutral.

Key Concepts Recap and Differences


Concept Definition Example Tools/Techniques

Word Vectors High-dimensional Word2Vec, GloVe Word2Vec, FastText


numeric
representations of
words.
Sentiment Classifies text based Positive or negative NLTK (VADER), Machine
Analysis on emotional tone. movie reviews Learning (Logistic
Regression)

Named Entity Identifies and Recognizing SpaCy, NLTK


Recognition classifies proper people, locations,
nouns. dates in a
sentence.

Text Condenses text into Summarizing an TF-IDF, Word


Summarization shorter, meaningful article into 3-4 key Frequencies
summaries. sentences

Unit 4: Semantics and Sentiment Analysis - Continued

4.7 Sentiment Analysis Movie Review Project (Expanded)

Objective: The goal of this project is to classify movie reviews as either positive or negative
using machine learning techniques. In this section, we'll break down the project pipeline into
detailed steps with relevant code, insights, and explanations.

Steps for Movie Review Sentiment Classification:

1. Data Collection:
○ We'll use the IMDb movie review dataset, a commonly used dataset for
sentiment analysis.
○ This dataset contains reviews labeled as positive or negative, which helps train
a classification model.

python

import pandas as pd

# Load IMDb dataset (CSV file with reviews and sentiment labels)
data = pd.read_csv('IMDB_Dataset.csv')

# Inspect the first few rows of the data


print(data.head())
2. Sample Data:
Review Sentiment
"I loved this movie! The acting was amazing and the story was gripping." Positive

"This was the worst movie I have ever seen. I regret watching it." Negative

"An absolute masterpiece with brilliant performances by the entire cast." Positive

"Terrible plot, bad acting, and a complete waste of time. Avoid this movie Negative
at all costs."

3. Preprocessing:
○ Before training the model, we need to clean the data:
■ Lowercasing: Convert all text to lowercase to ensure uniformity.
■ Removing Punctuation: Strip out punctuation marks that don’t carry
meaning.
■ Tokenization: Split text into individual words or tokens.
■ Stop Word Removal: Remove common words like "the", "is", "in", which
don’t contribute to sentiment.

python

from nltk.corpus import stopwords


import re
from nltk.tokenize import word_tokenize

# Preprocess function to clean and tokenize text


def preprocess_text(text):
# Remove punctuation and convert to lowercase
text = re.sub(r'[^\w\s]', '', text.lower())
# Tokenize
tokens = word_tokenize(text)
# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
return ' '.join(tokens)

# Apply preprocessing to the dataset


data['cleaned_review'] = data['review'].apply(preprocess_text)
4. Example of Preprocessing: Original Review: "I loved this movie! The acting
was amazing and the story was gripping." Preprocessed: "loved movie
acting amazing story gripping"
5. Vectorization (Converting Text to Numerical Form):
○ We’ll use the TF-IDF (Term Frequency-Inverse Document Frequency) model
to convert the text into a numerical format that machine learning algorithms can
work with.

python

from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text data into TF-IDF vectors


tfidf = TfidfVectorizer(max_features=5000) # Limit the vocabulary to
5000 words
X = tfidf.fit_transform(data['cleaned_review'])
y = data['sentiment'].map({'positive': 1, 'negative': 0}) # Map
'positive' to 1 and 'negative' to 0

6. Splitting the Data:


○ We divide the dataset into a training set (to train the model) and a test set (to
evaluate its performance).

python

from sklearn.model_selection import train_test_split

# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

7. Training the Model:


○ We’ll use Logistic Regression as our classification model. Logistic regression is
well-suited for binary classification tasks.

python

from sklearn.linear_model import LogisticRegression

# Train the Logistic Regression classifier


classifier = LogisticRegression()
classifier.fit(X_train, y_train)

8. Testing and Evaluating the Model:


○ After training, we’ll evaluate the model using metrics like accuracy, precision,
recall, and F1-score.

python

from sklearn.metrics import accuracy_score, classification_report

# Predict on the test set


y_pred = classifier.predict(X_test)

# Evaluate the model


print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred, target_names=['Negative',
'Positive']))
Output Example:

Accuracy: 0.87
precision recall f1-score support

Negative 0.88 0.86 0.87 245


Positive 0.86 0.88 0.87 255

accuracy 0.87 500

4.8 Twitter Sentiment Analysis (Expanded)

Objective: Analyze the sentiment of real-time tweets on a specific topic or hashtag using the
Twitter API and classify them as positive, negative, or neutral.

Steps for Twitter Sentiment Analysis:

1. Setting Up the Twitter API:


○ First, create a Twitter Developer Account and get access tokens and API keys.
○ Use the Tweepy library to authenticate and fetch tweets.
python

import tweepy

# Replace these with your own API keys from Twitter Developer account
API_KEY = 'your_api_key'
API_SECRET = 'your_api_secret'
ACCESS_TOKEN = 'your_access_token'
ACCESS_SECRET = 'your_access_secret'

# Authenticate using Tweepy


auth = tweepy.OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
api = tweepy.API(auth)

2.
3. Fetching Real-Time Tweets:
○ We can fetch tweets based on specific hashtags or keywords.

python

# Fetch tweets based on a hashtag


keyword = "#AI"
tweets = api.search(q=keyword, lang="en", count=100)

# Print the text of the first 5 tweets


for tweet in tweets[:5]:
print(tweet.text)

4. Preprocessing Tweets:
○ Just like the movie review dataset, we preprocess tweets to remove unnecessary
characters such as URLs, hashtags, mentions, and punctuation.

python

def preprocess_tweet(text):
text = re.sub(r"http\S+", "", text) # Remove URLs
text = re.sub(r"#\w+", "", text) # Remove hashtags
text = re.sub(r"@\w+", "", text) # Remove mentions
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
return text.lower()
# Preprocess tweets
cleaned_tweets = [preprocess_tweet(tweet.text) for tweet in tweets]

5. Sentiment Analysis of Tweets:


○ We’ll use the VADER sentiment analyzer from NLTK to classify the sentiment of
each tweet as positive, negative, or neutral.

python

from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize the VADER sentiment analyzer


sid = SentimentIntensityAnalyzer()

# Analyze the sentiment of each tweet


for tweet in cleaned_tweets:
sentiment = sid.polarity_scores(tweet)
print(f"Tweet: {tweet} | Sentiment: {sentiment['compound']}")

6. Visualizing Sentiment Distribution:


○ You can use matplotlib to visualize the distribution of sentiments (positive,
negative, neutral) across the tweets.

python

import matplotlib.pyplot as plt

# Count the number of positive, neutral, and negative sentiments


sentiment_counts = {'positive': 0, 'neutral': 0, 'negative': 0}

for tweet in cleaned_tweets:


sentiment = sid.polarity_scores(tweet)
if sentiment['compound'] >= 0.05:
sentiment_counts['positive'] += 1
elif sentiment['compound'] <= -0.05:
sentiment_counts['negative'] += 1
else:
sentiment_counts['neutral'] += 1
# Plot the sentiment distribution
labels = ['Positive', 'Neutral', 'Negative']
sizes = [sentiment_counts['positive'], sentiment_counts['neutral'],
sentiment_counts['negative']]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90,
colors=['green', 'gray', 'red'])
plt.axis('equal') # Equal aspect ratio ensures that pie chart is
drawn as a circle.
plt.title(f"Sentiment Distribution for {keyword} Tweets")
plt.show()

7. Applications of Twitter Sentiment Analysis:


● Brand Monitoring: Tracking how users perceive a brand or product based on real-time
feedback on social media.
● Political Sentiment: Analyzing public opinion on political issues or candidates.

You might also like